What are the stages of a RAG pipeline?

Ingestion (load and clean source data), chunking (split it into retrievable units), embedding (turn chunks into vectors), storage (a vector index, often with metadata), retrieval (find relevant chunks for a query, ideally hybrid + reranked), and grounded generation (prompt the LLM with the retrieved context and require citations).

What is the most important part of a RAG pipeline to get right?

Retrieval quality. If the right chunks don't surface, no amount of prompting fixes the answer — the model can only reason over what it's given. Most RAG failures are retrieval failures, usually traceable to bad chunking or relying on vector search alone.

What chunk size should I use for RAG?

There's no universal number, but chunk on semantic boundaries (headings, paragraphs) rather than fixed character counts, keep chunks focused on one idea, and add a little overlap so context isn't severed mid-thought. Then measure retrieval quality and adjust — don't guess once and move on.

Do I need a vector database for RAG?

For anything beyond a prototype, yes — but it can be pgvector on the Postgres you already run, not a separate service. The point is fast similarity search with metadata filtering; many teams over-provision a dedicated vector DB they don't yet need.

Build a RAG Pipeline From Scratch (Production Patterns That Actually Matter)

Most RAG tutorials stop at “embed your docs, do a similarity search, stuff the results in a prompt.” That gets you a demo. It does not get you something that gives correct, grounded answers on real data — and the gap between those two is where all the actual engineering lives.

A RAG pipeline is a series of stages, and a weak link in any one of them caps the quality of the whole thing. You can have a frontier model and a beautiful prompt, and still ship garbage if your chunking is wrong. So this is the pipeline end to end, with the production patterns that decide whether it works — not just the happy-path demo.

If you’re still deciding whether RAG is even the right tool versus fine-tuning, read RAG vs Fine-Tuning for LLMs first. This post assumes you’ve decided to retrieve.

TL;DR

RAG is a pipeline: ingest → chunk → embed → store → retrieve → generate. The output is only as good as the weakest stage.
Retrieval quality is everything. Most “the LLM hallucinated” bugs are actually “the right chunk never got retrieved” bugs.
Chunk on meaning, not character counts. Semantic boundaries plus light overlap beat fixed-size splits.
Don’t rely on vector search alone. Hybrid (keyword + vector) retrieval with a reranker is the production default.
Ground the generation. Pass only retrieved context, require citations, and refuse when context is thin.
You can’t improve what you don’t measure. Build a retrieval eval before you tune anything.

The pipeline, stage by stage

1. Ingestion

Load your sources and clean them before anything else. Strip boilerplate, nav chrome, and duplicated headers/footers. Garbage in here propagates through every downstream stage and you’ll never trace the bad answer back to it. Preserve structure — headings, lists, tables — because that structure is what makes good chunking possible.

2. Chunking — where most pipelines quietly fail

Chunking is the highest-leverage, most-underrated stage. The naive move is to split every document into fixed 500-character windows. Don’t. Fixed-size splitting severs sentences and merges unrelated ideas, and then retrieval surfaces fragments that don’t mean anything on their own.

Instead:

Split on semantic boundaries — headings, paragraphs, list items. Respect the document’s own structure.
One idea per chunk. A chunk should be retrievable and self-contained.
Add light overlap so context isn’t cut mid-thought between adjacent chunks.
Attach metadata to every chunk: source, title, section, date, URL. You’ll use it for filtering and citations.

Text

chunk = {
  id, text,
  metadata: { source, title, section, url, date }
}

💡 Key insight: If retrieval is bad, fix chunking before you touch the model or the prompt. The retriever can only find what chunking made findable.

3. Embedding

Turn each chunk into a vector with an embedding model. Two rules that save pain later:

Embed the same way at index time and query time. Same model, same preprocessing. A mismatch silently wrecks relevance.
Version your embeddings. When you change the embedding model, you must re-embed the whole corpus. Track which model produced which vectors so you know when a reindex is due.

4. Storage

Store vectors in an index that does fast similarity search with metadata filtering. You don’t necessarily need a dedicated vector database — pgvector on the Postgres you already run handles a surprising amount before a specialized store (Qdrant, Weaviate, Pinecone) earns its keep.

What actually matters: filtering. “Search only this customer’s docs” or “only documents from the last year” is a metadata WHERE clause combined with vector similarity. Without it, retrieval leaks across boundaries it shouldn’t.

5. Retrieval — go hybrid, then rerank

This is the stage that most separates a demo from a product.

Vector search alone is not enough. Embeddings are great at semantic similarity and bad at exact matches — error codes, product SKUs, proper nouns, acronyms. Keyword search (BM25) is the opposite. Hybrid retrieval runs both and merges the results, so you catch both “what they meant” and “the exact term they typed.”

Then rerank. Initial retrieval optimizes for recall — pull a generous candidate set (say, top 20). A cross-encoder reranker then scores those candidates against the query far more precisely and keeps the top handful you’ll actually pass to the model. Retrieve broad, rerank narrow.

Text

candidates = vectorSearch(q, k=20) ∪ keywordSearch(q, k=20)
top = rerank(q, candidates)[:5]

6. Grounded generation

Now — and only now — the LLM. The job here is to keep it honest:

Pass only the retrieved context. Don’t let the model fall back on parametric memory for facts it should be reading.
Require citations. Ask it to cite the chunk/source for each claim. Citations are both a UX feature and a hallucination check.
Give it permission to say “I don’t know.” If the retrieved context doesn’t answer the question, the correct output is a refusal, not a confident guess. Tell it that explicitly.

Text

System: Answer ONLY from the context below. Cite sources by id.
If the context doesn't contain the answer, say you don't know.

Context:
[1] {chunk_1}
[2] {chunk_2}
...

Question: {user_query}

The patterns that separate prod from demo

Hybrid + rerank, not bare vector search. The single biggest quality jump.
Metadata filtering for security and scoping — never retrieve across tenant or permission boundaries.
Citations and refusal wired into the prompt, so wrong answers become “I don’t know” instead of confident fiction.
Caching. Cache embeddings (don’t re-embed unchanged chunks) and cache answers to repeated queries.
A retrieval eval set. A fixed set of question → expected-source pairs you can score on every change.

Common mistakes

Fixed-size chunking. The default that quietly caps your ceiling. Chunk on meaning.
Vector-only retrieval. You’ll miss exact-match queries every time. Add keyword search.
No reranking. Stuffing the raw top-k into the prompt wastes context on near-misses.
Tuning the prompt to fix a retrieval problem. If the right chunk isn’t retrieved, the prompt is irrelevant. Diagnose retrieval first.
No evaluation. “It looks better” isn’t a metric. Without an eval set you’re guessing, and you’ll regress silently.

Best practices

Measure retrieval separately from generation. Most failures are retrieval failures; isolate them. Track recall on your eval set.
Chunk on structure, then iterate. Start with semantic boundaries and light overlap; adjust based on retrieval scores.
Default to hybrid + rerank. Treat it as the baseline, not an optimization.
Filter by metadata for scope and security. Especially in multi-tenant systems.
Force grounding and citations. Answer only from context; cite; allow “I don’t know.”
Re-embed on model change. Version vectors so you know when a reindex is required.

Conclusion

RAG isn’t one trick — it’s a pipeline, and quality is set by its weakest stage. Get chunking right, retrieve hybrid and rerank, ground the generation, and measure retrieval so you’re improving the right thing. Do that and you cross the line from impressive demo to a system people can trust with real questions.

Skip the engineering — relying on naive chunking and bare vector search — and you’ll ship something that demos well and fails the moment real users ask real questions.

Go deeper across LLM Engineering — RAG, Fine-Tuning & Production LLMs, revisit the RAG vs Fine-Tuning decision framework, or explore AI Coding Agents for the agentic side of LLM systems.

Explore more: LLM Engineering · AI Coding Agents · Claude Code

Build a RAG Pipeline From Scratch (Production Patterns That Actually Matter)

TL;DR

The pipeline, stage by stage

1. Ingestion

2. Chunking — where most pipelines quietly fail

3. Embedding

4. Storage

5. Retrieval — go hybrid, then rerank

6. Grounded generation

The patterns that separate prod from demo

Common mistakes

Best practices

Conclusion

Related Articles

Build a RAG Chatbot in Next.js: Retrieval, Streaming & Citations (2026)

RAG vs Fine-Tuning for LLMs in 2026: A Production Decision Framework With Real Tradeoffs

Vercel AI SDK in Production: Streaming, Tool-Calling & the Gotchas Nobody Tells You (2026)

Explore Topics

Related Articles

Build a RAG Chatbot in Next.js: Retrieval, Streaming & Citations (2026)

RAG vs Fine-Tuning for LLMs in 2026: A Production Decision Framework With Real Tradeoffs

Vercel AI SDK in Production: Streaming, Tool-Calling & the Gotchas Nobody Tells You (2026)

Get new posts on AI, Claude Code & LLMs

Explore Topics