Akforges
← All posts
AIRAGLLMProduction

RAG in Production: Patterns That Actually Work at Scale

Naive RAG breaks in production. Here are the architecture patterns we use when building retrieval-augmented generation systems for real users — chunking, reranking, eval harnesses, and cost controls.

April 5, 202611 min readAkforges Studio

Naive RAG — embed documents, store in a vector DB, retrieve top-k, stuff into context — works in a demo. It breaks in production for predictable reasons: low retrieval precision, context window overflow, hallucinations on edge cases, and runaway costs.

Here are the patterns we've settled on after building RAG systems for production use.


Why naive RAG fails

The fundamental problem with naive RAG is that it conflates document storage with retrieval quality. A vector similarity search finds semantically related chunks — but "semantically related" and "answers the question" are different things.

Common failure modes:

Retrieval precision is low: Top-5 retrieved chunks contain the answer 60–70% of the time in typical business document corpora. That means 30–40% of queries go to the LLM with wrong or missing context, and the model hallucinates rather than saying "I don't know."

Chunk boundary problems: Fixed-size chunking cuts through sentences, paragraphs, and tables at arbitrary points. A chunk containing "the answer is 47%" with no surrounding context is useless.

Context stuffing: Stuffing 8,000 tokens of loosely relevant context costs money and degrades answer quality. Models perform better with focused, relevant context than with a haystack.

No eval loop: Without a ground-truth eval suite, you don't know when retrieval quality degrades — model updates, new documents, or embedding model changes can silently break your system.


Pattern 1: Hierarchical chunking

Instead of fixed-size chunks, use a parent-child hierarchy:

  • Parent chunks: Large (512–1024 tokens), semantically complete units (full paragraphs, sections, table-plus-caption)
  • Child chunks: Small (64–128 tokens), granular units for precise retrieval

At query time, retrieve small child chunks by semantic similarity, then return their parent chunks to the LLM. This gives you retrieval precision (small chunks match queries well) and context completeness (large chunks give the model the full picture).

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Parent splitter — larger, semantically complete units
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " "],
)

# Child splitter — granular, for retrieval
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=128,
    chunk_overlap=20,
    separators=["\n\n", "\n", ".", " "],
)

Store both. Retrieve children, return parents to the LLM.


Pattern 2: Hybrid retrieval (dense + sparse)

Pure vector (dense) retrieval misses exact keyword matches — product codes, person names, specific numeric values. Pure keyword (sparse) retrieval misses semantic similarity.

Combine both with Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
    """Combine dense and sparse retrieval rankings."""
    scores = {}
    for rank, doc in enumerate(dense_results):
        doc_id = doc.metadata["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, doc in enumerate(sparse_results):
        doc_id = doc.metadata["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    # Return docs sorted by combined score
    sorted_ids = sorted(scores, key=scores.get, reverse=True)
    return [get_doc(id) for id in sorted_ids[:10]]

Pgvector + tsvector in PostgreSQL handles both dense and sparse retrieval in a single database. For most use cases under ~10M documents, you don't need a dedicated vector DB.


Pattern 3: Reranking

After retrieval, before sending to the LLM, run a reranker. A cross-encoder reranker (Cohere Rerank, BGE-Reranker, or Jina Reranker) re-scores retrieved chunks using both the query and chunk text together. This is significantly more accurate than cosine similarity because the query and chunk are evaluated in context.

import cohere

co = cohere.Client(api_key)

def rerank(query: str, documents: list[str], top_n: int = 5) -> list[str]:
    results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
    )
    return [documents[r.index] for r in results.results]

Reranking from 10 candidates down to 3–5 typically improves answer quality measurably and reduces context tokens (which reduces cost).


Pattern 4: Query transformation

User queries are often ambiguous, too short, or assume context the system doesn't have. Transform the query before retrieval.

Hypothetical Document Embedding (HyDE): Generate a hypothetical answer to the query, then retrieve documents similar to that hypothetical answer. This works because the hypothetical answer is in the same semantic space as real documents.

async def hyde_retrieve(query: str, vectorstore) -> list[Document]:
    # Generate a hypothetical answer
    hypothetical = await llm.ainvoke(
        f"Write a short paragraph that would answer this question: {query}"
    )
    # Retrieve using the hypothetical answer, not the original query
    return await vectorstore.asimilarity_search(hypothetical.content, k=10)

Multi-query retrieval: Generate 3–5 variations of the query, retrieve for each, deduplicate. Catches documents that one phrasing misses.


Pattern 5: Structured retrieval metadata

Add structured metadata to every document chunk at indexing time. This enables filtered retrieval that's faster and more precise than semantic search alone.

chunk_metadata = {
    "source": "employee-handbook-v3.pdf",
    "section": "Benefits",
    "date_updated": "2025-11-01",
    "document_type": "policy",
    "jurisdiction": "US",
}

Then filter at retrieval time:

results = vectorstore.similarity_search(
    query=query,
    k=10,
    filter={"document_type": "policy", "jurisdiction": "US"},
)

This prevents the LLM from citing a US benefits policy when answering a UK question, or an outdated document when a newer version exists.


Pattern 6: Confidence scoring and fallback

A well-designed RAG system should know when it doesn't know. Implement confidence scoring and handle the low-confidence case explicitly.

async def answer_with_confidence(query: str, context: list[str]) -> dict:
    response = await llm.ainvoke(
        system="You are a helpful assistant. If you cannot answer confidently "
               "based on the provided context, respond with CONFIDENCE:LOW and "
               "explain what information is missing.",
        user=f"Context:\n{format_context(context)}\n\nQuestion: {query}",
        response_format={"type": "json_object"},
    )

    parsed = json.loads(response.content)
    if parsed.get("confidence") == "LOW":
        # Route to human agent, broader search, or explicit "I don't know"
        return handle_low_confidence(query, parsed["missing_information"])

    return parsed

"I don't know" is a correct answer. A confident wrong answer causes user trust erosion that's hard to recover from.


Pattern 7: Eval harness for retrieval quality

You need to measure retrieval quality independently of generation quality. Build an eval suite with:

  • Retrieval precision: For each test query, does the correct document appear in the top-k retrieved?
  • Retrieval recall: Are all relevant documents retrieved?
  • Answer accuracy: Does the final answer match the ground truth?
def evaluate_retrieval(
    test_cases: list[dict],  # [{query, expected_doc_ids, expected_answer}]
    retriever,
) -> dict:
    precision_scores = []
    for case in test_cases:
        retrieved = retriever.retrieve(case["query"], k=5)
        retrieved_ids = {doc.metadata["id"] for doc in retrieved}
        expected_ids = set(case["expected_doc_ids"])
        precision = len(retrieved_ids & expected_ids) / len(retrieved_ids)
        precision_scores.append(precision)
    return {"retrieval_precision@5": sum(precision_scores) / len(precision_scores)}

Run this on every PR. A drop in retrieval precision from 0.78 to 0.61 is a significant regression — catch it before it reaches users.


Cost controls

RAG systems can be expensive because every query does at minimum: 1 embedding call + vector search + 1 LLM call. At scale:

Cache embeddings aggressively: Query embeddings are cheap to compute but adding a cache for repeated queries eliminates the LLM call entirely.

Cache LLM responses: For deterministic queries (same query → same documents → same answer), a Redis cache keyed on hash(query + context_hash) eliminates the most expensive part.

Route by complexity: Use a cheap model (GPT-4o mini, Claude Haiku) for simple factual lookups. Reserve expensive models for complex reasoning. A classifier that routes queries adds one cheap inference call but saves money on 60–70% of requests.

Set hard context limits: Never send more than N tokens of context regardless of what retrieval returns. If retrieval returns 10 chunks and you only need 5, the reranker decides which 5.


The production checklist

Before shipping a RAG system:

  1. ☐ Hierarchical chunking (not fixed-size)
  2. ☐ Hybrid retrieval (dense + sparse)
  3. ☐ Reranker on retrieved candidates
  4. ☐ Structured metadata on all chunks
  5. ☐ Query transformation for short/ambiguous queries
  6. ☐ Confidence scoring with explicit low-confidence handling
  7. ☐ Eval harness with 100+ hand-labelled test cases
  8. ☐ Response caching (Redis or equivalent)
  9. ☐ Per-user cost limits
  10. ☐ Distributed tracing on every retrieval + LLM call

A prototype that hits 7/10 of these is ready for a beta. All 10 before public launch.


Building a RAG system that needs to be production-hardened? We specialise in taking AI prototypes to production — evals, guardrails, observability, and cost controls included.

Work with us

Need help applying this to your stack?

Free 30-min strategy call. We'll scope your problem and tell you honestly what the fix looks like.

Book a strategy call