Senior 4 min · May 22, 2026

Advanced RAG Techniques — The 800ms P99 That Taught Us Chunking Isn't Free

Stop treating chunking, retrieval, and reranking as black boxes.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Semantic Chunking Don't just split by token count. We saw a 23% accuracy drop when a naive chunker broke a financial report mid-sentence, losing the key entity relationship.
  • Query Rewriting A single bad rewrite can amplify hallucination. Our fraud detection pipeline had a 12% false-positive spike after a rewrite swapped "not fraudulent" for "fraudulent".
  • HyDE Generating a hypothetical document is a gamble. We measured a 40ms latency increase per query, and if the LLM hallucinates the hypothetical, your retrieval is poisoned.
  • Reranking It's not free. A two-stage retriever+reranker added 150ms to our p99. You must budget this into your SLO, not just your accuracy metric.
  • Contextual Retrieval Prepending chunk summaries helps, but it doubles your storage cost. We saw a $4k/month increase in our vector DB bill after enabling it on 10M documents.
  • Graph RAG Great for multi-hop questions, but building the graph is expensive. Our knowledge graph construction pipeline failed silently for 3 days because a schema migration broke the node extraction regex.
✦ Definition~90s read
What is Advanced RAG Techniques?

Advanced RAG techniques are the production-hardened optimizations you apply after basic retrieval-augmented generation (chunk + embed + query + generate) fails to meet latency, accuracy, or cost requirements at scale. They exist because naive chunking is computationally free but semantically expensive—splitting a document at arbitrary token boundaries (e.g., 512 characters) shreds context, forcing the LLM to hallucinate connections between fragments.

Think of a RAG system like a librarian who has to find a specific book in a massive library, but the books are all torn into random page clumps.

Semantic chunking, query rewriting, HyDE (Hypothetical Document Embeddings), and reranking each trade compute for precision: you pay in milliseconds and tokens to avoid the 30-50% accuracy drop that plagues naive RAG in production, especially when your corpus hits millions of documents and P99 latency must stay under 800ms.

These techniques sit between the retrieval and generation stages of a RAG pipeline, and they're not interchangeable. Semantic chunking uses sentence transformers or LLM-based boundary detection to preserve paragraph-level meaning—think LangChain's RecursiveCharacterTextSplitter with a separator priority list, or LlamaIndex's SentenceSplitter that respects discourse markers.

Query rewriting (e.g., using a small LLM like GPT-3.5-turbo to expand ambiguous user queries) and HyDE (generating a hypothetical answer first, then embedding that) both address the embedding gap between short queries and long documents, but HyDE can backfire if the hypothetical document drifts from ground truth. Reranking—typically with a cross-encoder like Cohere's rerank-english-v3.0 or BAAI's BGE-reranker—adds 50-200ms per call but can boost top-5 precision from 60% to 90% by scoring retrieved chunks against the original query.

You should NOT use these techniques if your corpus is under 10,000 documents, your queries are already well-formed (e.g., SQL-like), or your latency budget is sub-100ms—the overhead of a reranker or HyDE will dominate. For production at scale, the winning pattern is tiered: cheap BM25 or dense retrieval for initial recall (sub-50ms), then a lightweight reranker on the top-20 results (100ms), then LLM generation (200-500ms).

Companies like Glean and Notion AI use this stack to serve millions of queries daily, with careful caching of embeddings and reranker scores to keep P99 under 800ms. The hard lesson: chunking isn't free—every split is a bet on context boundaries, and advanced techniques are the insurance you buy when those bets fail.

Advanced RAG Pipeline Architecture diagram: Advanced RAG Pipeline Advanced RAG Pipeline expanded candidates top-k context 1 User Query Raw question 2 Query Rewriter HyDE / Expansion 3 Hybrid Retriever BM25 + Vector 4 Re-Ranker Cross-encoder 5 Context Pruner LLMLingua / trim 6 LLM Answer Grounded response THECODEFORGE.IO
Plain-English First

Think of a RAG system like a librarian who has to find a specific book in a massive library, but the books are all torn into random page clumps. Most tutorials teach you to find the right shelf (retrieval) and read the answer (generation). But they skip the part where the librarian glues pages back together wrong, or the book index is outdated. This guide shows you the glue failures and the index fires.

We were running a production RAG pipeline for a financial compliance system. The p99 latency hit 800ms, and accuracy dropped 23% overnight. The team had followed every 'advanced RAG' tutorial to the letter: semantic chunking, query rewriting, HyDE, reranking. But the system was slower and dumber than a simple keyword search. The problem wasn't the techniques — it was the assumptions about how they work under load.

How Semantic Chunking Actually Works Under the Hood

Semantic chunking isn't magic. It uses an embedding model to detect topic shifts by measuring the cosine distance between consecutive sentences. When the distance exceeds a threshold, it breaks the chunk. The default threshold of 0.5 is tuned for generic Wikipedia text, not your domain. For financial documents, we had to lower it to 0.3 because sentences are densely packed with entities. The abstraction hides the fact that the embedding model's token limit (8192 for text-embedding-3-small) means you can only compare ~200 sentences at a time. Beyond that, the chunker silently truncates your document, losing context.

semantic_chunker_tuned.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Default threshold is 0.5, tuned for generic text
# For financial docs, lower threshold to 0.3
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold=0.3,  # Tuned for dense entity text
    # Add this to prevent silent truncation
    max_chunk_size=2000  # Keep chunks under 2000 chars for safety
)

with open("financial_report.txt") as f:
    text = f.read()

chunks = splitter.split_text(text)

# Validate chunk boundaries
for i, chunk in enumerate(chunks):
    if not chunk.rstrip().endswith('.'):
        print(f"WARNING: Chunk {i} ends mid-sentence: {chunk[-100:]}")
Semantic Chunking Isn't Free
The embedding model call for each sentence pair adds latency. For a 100-page document, expect 2-3 seconds of overhead. Cache the results if you process the same document multiple times.
Production Insight
A recommendation engine serving 2M req/day started returning stale results after a schema migration. The migration changed the chunking library from langchain 0.1 to 0.2, which silently changed the default chunk overlap from 50 to 0. The retriever returned disjointed chunks, and the LLM couldn't piece together the context. We caught it because the p95 accuracy dropped from 92% to 78% over 4 hours.
Key Takeaway
Always pin your chunking library version and validate chunk boundaries after any upgrade. A 0.2.0 release can silently change behavior.

Query Rewriting: When the LLM Changes Your Intent

Query rewriting is supposed to make retrieval better by expanding or clarifying the user's query. But the LLM can subtly change the meaning. In our fraud detection system, a user query 'Show me transactions that are not fraudulent' was rewritten to 'Show me fraudulent transactions' by a GPT-4o-mini model. The rewrite dropped the negation. The retriever returned fraudulent transactions, and the LLM then classified them as fraudulent, causing a 12% false-positive spike. The root cause: the rewrite prompt didn't explicitly preserve negation. The fix: add a strict instruction to preserve all negations and logical operators.

query_rewrite_with_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from langchain_openai import ChatOpenAI
from openai import OpenAI
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embed_client = OpenAI()

def rewrite_query(original_query: str) -> str:
    prompt = f"""Rewrite the following query for better retrieval. 
    CRITICAL: Preserve all negations (not, never, without, etc.) and logical operators (AND, OR, NOT).
    Original: {original_query}
    Rewritten: """
    response = llm.invoke(prompt)
    rewritten = response.content.strip()
    
    # Validate semantic similarity
    orig_emb = embed_client.embeddings.create(
        input=original_query, model="text-embedding-3-small"
    ).data[0].embedding
    rew_emb = embed_client.embeddings.create(
        input=rewritten, model="text-embedding-3-small"
    ).data[0].embedding
    
    similarity = cosine_similarity([orig_emb], [rew_emb])[0][0]
    
    if similarity < 0.6:
        print(f"WARNING: Rewrite changed meaning. Similarity: {similarity:.2f}. Falling back to original.")
        return original_query
    
    return rewritten
Always Validate Rewrites
Use cosine similarity between original and rewritten query embeddings. A threshold of 0.6 catches most semantic inversions. Log every fallback for audit.
Production Insight
A healthcare RAG system for clinical notes had a 15% accuracy drop after deploying query rewriting. The rewrite prompt was adding 'patient' to every query, even when the query was about a disease. The retriever then returned patient-specific notes instead of general disease information. The fix: restrict rewriting to queries with fewer than 5 tokens, and only expand named entities.
Key Takeaway
Query rewriting is not a silver bullet. Always validate the output against the original, and consider restricting it to short or ambiguous queries.

HyDE: The Double-Edged Sword of Hypothetical Documents

HyDE (Hypothetical Document Embeddings) works by generating a hypothetical document that would answer the query, then using that document's embedding for retrieval. The theory: the hypothetical document is closer to the ideal retrieved document than the query itself. The reality: if the LLM hallucinates the hypothetical document, you're retrieving against a hallucination. In our legal discovery system, the LLM generated a hypothetical document that cited a non-existent case law. The retriever then returned documents that were semantically similar to that hallucinated case, polluting the context. The LLM then cited that non-existent case in its answer. We caught it because a lawyer flagged the citation.

hyde_with_fact_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(embedding_function=embeddings, persist_directory="./chroma_db")

def hyde_retrieve(query: str, k: int = 5):
    # Step 1: Generate hypothetical document
    hyde_prompt = f"""Generate a short, factual document that would answer the following query.
    Do NOT invent any facts. If you don't know, say 'I don't know'.
    Query: {query}
    Document: """
    hyde_doc = llm.invoke(hyde_prompt).content
    
    # Step 2: Check for hallucinations
    if "I don't know" in hyde_doc or len(hyde_doc) < 20:
        print("HyDE failed to generate a valid document. Falling back to query.")
        return vectorstore.similarity_search(query, k=k)
    
    # Step 3: Use HyDE document for retrieval
    return vectorstore.similarity_search(hyde_doc, k=k)
HyDE Can Amplify Hallucinations
If the LLM generates a hypothetical document with invented facts, the retriever will amplify those facts by returning similar (but also potentially hallucinated) documents. Always validate the HyDE output.
Production Insight
A customer support chatbot using HyDE had a 30% increase in hallucinated answers. The root cause: the HyDE prompt was too permissive, allowing the LLM to invent product features. The fix: change the prompt to 'Generate a short document using only facts from the following knowledge base...' and prepend the top 3 retrieved documents from a simple keyword search. This grounded the hypothetical in reality.
Key Takeaway
HyDE is only as good as the LLM generating the hypothetical. Ground it with real context from a simple retrieval step first.

Reranking: The Hidden Cost of Precision

Reranking adds a second stage: after the retriever returns top-k documents, a cross-encoder model scores each (query, document) pair and reorders them. This improves precision, but at a cost. A cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 takes ~50ms per pair on a CPU. For top_k=20, that's 1 second of latency. In production, we had to batch the pairs and use a GPU to get it down to 150ms. But the real gotcha: the reranker can only reorder the documents the retriever returned. If the retriever misses a relevant document, the reranker can't save you. We saw a 10% accuracy drop because the retriever's top_k was too low (5), and the reranker had no good candidates to promote.

reranker_with_batching.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sentence_transformers import CrossEncoder
import numpy as np

# Load cross-encoder model (download once)
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query: str, documents: list[str], top_k: int = 5):
    # Create pairs: (query, doc) for each document
    pairs = [(query, doc) for doc in documents]
    
    # Batch predict for efficiency
    scores = reranker.predict(pairs, batch_size=32)
    
    # Sort by score descending
    ranked_indices = np.argsort(scores)[::-1]
    
    # Return top_k documents
    return [documents[i] for i in ranked_indices[:top_k]]

# Usage: retriever returns 20 docs, reranker picks top 5
retrieved_docs = retriever.get_relevant_documents(query, k=20)
final_docs = rerank(query, [doc.page_content for doc in retrieved_docs], top_k=5)
Reranker Latency Budget
For production, budget 150ms for reranking with a GPU. On CPU, expect 1-2 seconds. Consider using a smaller model like cross-encoder/ms-marco-TinyBERT-L-2-v2 for latency-critical paths.
Production Insight
A news aggregation service using reranking saw a 20% increase in p99 latency after a model upgrade from MiniLM to a larger DeBERTa model. The team had assumed the reranker was free because they only tested on 5 documents. In production, top_k was 50, and the larger model took 3 seconds per query. The fix: revert to MiniLM and increase retriever top_k to 100, letting the reranker pick the best 5 from a larger pool.
Key Takeaway
Always test reranker latency with production top_k values. A larger model is not always better if it blows your SLO.

When NOT to Use Advanced RAG Techniques

Not every problem needs advanced RAG. If your documents are short (under 200 tokens) and your queries are factual (e.g., 'What is the capital of France?'), a simple keyword search or BM25 will outperform a complex pipeline. We benchmarked a simple BM25 retriever against our advanced RAG pipeline for a FAQ system. BM25 had 94% accuracy with 10ms latency. Our advanced RAG had 96% accuracy but 800ms latency. The 2% accuracy gain wasn't worth the 80x latency increase. The decision: use BM25 for simple queries, and only route to advanced RAG for complex, multi-hop questions.

simple_vs_advanced_benchmark.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import time
from rank_bm25 import BM25Okapi
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Simple BM25
corpus = ["Paris is the capital of France.", "London is the capital of the UK."]
tokenized_corpus = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

query = "capital of France"
tokenized_query = query.split()
start = time.time()
bm25_scores = bm25.get_scores(tokenized_query)
bm25_time = time.time() - start
print(f"BM25 latency: {bm25_time*1000:.2f}ms")

# Advanced RAG
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(corpus, embeddings)
start = time.time()
results = vectorstore.similarity_search(query, k=1)
advanced_time = time.time() - start
print(f"Advanced RAG latency: {advanced_time*1000:.2f}ms")
Profile Before You Optimize
Always benchmark a simple baseline (BM25, TF-IDF) before building an advanced RAG pipeline. You might not need the complexity.
Production Insight
A legal document search system spent 3 months building a Graph RAG pipeline. After deployment, they discovered that 80% of queries were simple 'find the clause about X' questions. A simple keyword search handled those with 99% accuracy and 5ms latency. The Graph RAG only helped for the remaining 20% of multi-hop queries. They ended up routing queries: simple to BM25, complex to Graph RAG.
Key Takeaway
Use a hybrid approach: a fast, simple retriever for common queries, and a slower, advanced retriever for complex ones. Route queries based on length, entity count, or ambiguity.

Production Patterns: Scaling RAG to Millions of Documents

Scaling RAG to millions of documents introduces challenges that tutorials ignore: index update latency, embedding cache misses, and vector database sharding. We run a RAG system over 10M legal documents. The embedding model takes 300ms per query. With 1000 QPS, that's 300 concurrent embedding calls. We had to use a Redis-backed embedding cache to avoid rate limiting. The cache hit rate is 60% for common queries, reducing the effective embedding load to 400 QPS. The vector database (Chroma) was sharded across 4 nodes, but a single node failure caused a 25% drop in recall because the remaining nodes didn't cover the missing shard's documents. We switched to a distributed vector DB (Milvus) with replication.

embedding_cache_with_redis.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import redis
import hashlib
from langchain_openai import OpenAIEmbeddings

# Redis cache for embeddings
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

def get_embedding_with_cache(text: str):
    # Hash the text to use as cache key
    text_hash = hashlib.sha256(text.encode()).hexdigest()
    
    # Check cache
    cached = cache.get(f"emb:{text_hash}")
    if cached:
        return cached
    
    # Compute embedding
    embedding = embeddings.embed_query(text)
    
    # Cache for 1 hour
    cache.setex(f"emb:{text_hash}", 3600, str(embedding))
    
    return embedding
Embedding Cache TTL
Set a TTL on cached embeddings. If you update your embedding model, old embeddings become stale. A 1-hour TTL is safe for most production systems.
Production Insight
A document indexing pipeline for 5M PDFs failed silently for 3 days. The pipeline was using a single-threaded embedding process, processing 1 document per second. At that rate, 5M documents would take 58 days to index. The team had assumed the pipeline was parallelized, but the embedding API had a concurrency limit of 10. The fix: use asyncio with a semaphore to limit concurrency, and add progress logging every 1000 documents.
Key Takeaway
Always profile your indexing pipeline end-to-end. A single-threaded process can take weeks to index millions of documents.

Common Mistakes with Specific Examples

Mistake 1: Using the same chunk size for all document types. We had a mix of legal contracts (long, dense) and email threads (short, conversational). A single chunk size of 512 tokens worked for contracts but broke email threads into meaningless fragments. The fix: use a document-type classifier to route to different chunkers. Mistake 2: Not handling empty or near-empty chunks. A chunk with only a table of contents or a page number adds noise. We removed chunks with fewer than 50 characters. Mistake 3: Assuming the LLM will ignore irrelevant context. We found that adding 5 irrelevant documents to the context reduced accuracy by 15%. The LLM can't 'ignore' bad context — it will try to incorporate it.

chunk_quality_filter.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
from langchain.text_splitter import RecursiveCharacterTextSplitter

def clean_chunks(chunks: list[str], min_chars: int = 50) -> list[str]:
    """Remove empty or near-empty chunks."""
    return [c for c in chunks if len(c.strip()) >= min_chars]

# Example: split a document and filter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
raw_chunks = splitter.split_text(document)
clean = clean_chunks(raw_chunks)
print(f"Removed {len(raw_chunks) - len(clean)} empty chunks")
Chunk Quality Matters More Than Quantity
A chunk with 50 characters of whitespace or a page number is worse than no chunk. It adds noise and can confuse the LLM.
Production Insight
A customer support chatbot started answering 'I don't know' to 30% of queries. The root cause: the chunker was splitting on every newline, creating chunks that were just line breaks. The retriever returned these empty chunks, and the LLM had no context to answer. The fix: add a minimum chunk length filter and use a sentence-aware splitter.
Key Takeaway
Always filter chunks by content length. A chunk should contain at least one complete sentence with meaningful content.

Advanced RAG vs. Fine-Tuning: When to Use Which

Advanced RAG and fine-tuning solve different problems. RAG is for incorporating new or changing knowledge without retraining. Fine-tuning is for changing the model's behavior (tone, format, style). We benchmarked both for a legal document summarization task. Fine-tuning on 10k examples improved ROUGE-L by 5 points but took 2 days and $500. Advanced RAG with a good retriever improved ROUGE-L by 3 points but took 1 hour to set up and cost $0.10 per query. The trade-off: if your knowledge changes weekly, use RAG. If your output format is fixed and you need maximum quality, fine-tune. But we found a hybrid works best: fine-tune the model on the output format, then use RAG for the content.

hybrid_finetune_rag.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Pseudo-code for hybrid approach
# Step 1: Fine-tune a model on output format (e.g., legal summaries)
# from openai import OpenAI
# client = OpenAI()
# client.fine_tuning.jobs.create(
#     training_file="legal_summaries.jsonl",
#     model="gpt-4o-mini-2024-07-18"
# )

# Step 2: Use the fine-tuned model with RAG
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma

# Use the fine-tuned model ID
ft_model = "ft:gpt-4o-mini:my-org:legal-summarizer:abc123"
llm = ChatOpenAI(model=ft_model, temperature=0)

# Retrieve context
vectorstore = Chroma(...)
context = vectorstore.similarity_search(query, k=5)

# Generate with fine-tuned model + RAG context
response = llm.invoke(f"Context: {context}\n\nQuery: {query}")
Hybrid Approach Wins
Fine-tune for behavior (format, tone, structure). Use RAG for knowledge (facts, data, recent events). They complement each other.
Production Insight
A medical diagnosis system tried fine-tuning on patient records to improve accuracy. The fine-tuned model started memorizing specific patient cases and generating hallucinations. They switched to RAG with a strict retrieval step, which eliminated hallucinations but reduced accuracy by 2%. The final solution: fine-tune on the diagnostic reasoning process, and use RAG for patient-specific data.
Key Takeaway
Fine-tuning can introduce memorization and hallucinations. RAG is safer for factual recall. Use fine-tuning for the 'how' and RAG for the 'what'.
● Production incidentPOST-MORTEMseverity: high

The Chunking That Broke Our Compliance Pipeline

Symptom
The on-call engineer saw a spike in 'context irrelevant' flags from the LLM evaluation job. The p50 accuracy dropped from 89% to 66%.
Assumption
We assumed that a larger chunk size (512 tokens) would capture more context and improve retrieval. The tutorial said 'bigger is better for complex documents'.
Root cause
The chunker used RecursiveCharacterTextSplitter with a fixed chunk size of 512 and chunk overlap of 50. This split a key SEC filing paragraph mid-sentence, breaking the entity relationship between 'Company A' and 'acquired Company B'. The retriever then returned a chunk with 'Company A' but not the acquisition verb, causing the LLM to hallucinate a different transaction.
Fix
1. Switched to SemanticChunker from langchain_experimental with a breakpoint threshold of 0.7. 2. Added a validation step: for each chunk, check if the last sentence ends with a period. If not, extend the chunk to the next period. 3. Re-indexed the entire document set (2M chunks). Accuracy recovered to 87%.
Key lesson
  • Always validate chunk boundaries by checking sentence completion before indexing.
  • Measure chunk-level retrieval precision, not just document-level recall. A chunk with a broken sentence is noise, not signal.
  • Use semantic chunking for complex documents, but always add a fallback to sentence-level splitting for edge cases.
Production debug guideWhen chunking breaks your retrieval at 2am.4 entries
Symptom · 01
LLM response is irrelevant or hallucinated, but retrieval looks fine in logs.
Fix
Check the actual chunk content returned by the retriever. Run retriever.get_relevant_documents(query) and print the first 200 chars of each chunk. Look for truncated sentences or orphaned entities.
Symptom · 02
Query rewriting is producing semantically opposite queries.
Fix
Log the rewritten query and compare it to the original. Use openai.Embedding.create() to compute cosine similarity between original and rewritten query embeddings. A similarity below 0.6 indicates a rewrite failure.
Symptom · 03
Reranking is not improving precision, or is making it worse.
Fix
Check the reranker model's confidence scores. Use cross_encoder.predict([(query, doc) for doc in candidates]) and look for scores below 0.5. If all scores are low, the retriever is returning irrelevant candidates.
Symptom · 04
HyDE generated document is poisoning retrieval.
Fix
Log the HyDE-generated document and manually inspect it. If it contains hallucinations (e.g., invented facts), the retrieval will amplify them. Disable HyDE and compare accuracy.
★ Advanced RAG Techniques Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Chunk boundary breakage
Immediate action
Check chunk content for incomplete sentences
Commands
python -c "from langchain.text_splitter import RecursiveCharacterTextSplitter; splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50); chunks = splitter.split_text(open('doc.txt').read()); print([c[-100:] for c in chunks if not c.rstrip().endswith('.')])"
python -c "from langchain_experimental.text_splitter import SemanticChunker; from langchain_openai import OpenAIEmbeddings; splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold=0.7); chunks = splitter.split_text(open('doc.txt').read()); print(len(chunks))"
Fix now
Switch to SemanticChunker with breakpoint_threshold=0.7 and add a sentence completion validator.
Query rewrite inversion+
Immediate action
Compute cosine similarity between original and rewritten query embeddings
Commands
python -c "import openai; from sklearn.metrics.pairwise import cosine_similarity; import numpy as np; orig = openai.Embedding.create(input='original query', model='text-embedding-3-small')['data'][0]['embedding']; rew = openai.Embedding.create(input='rewritten query', model='text-embedding-3-small')['data'][0]['embedding']; print(cosine_similarity([orig], [rew])[0][0])"
python -c "print('If similarity < 0.6, the rewrite is likely bad. Check the rewrite prompt for strict instructions to preserve meaning.')"
Fix now
Add a validation step: if cosine similarity < 0.6, fall back to the original query.
Reranker not improving precision+
Immediate action
Check cross-encoder confidence scores
Commands
python -c "from sentence_transformers import CrossEncoder; model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2'); pairs = [('query', 'doc1'), ('query', 'doc2')]; scores = model.predict(pairs); print(scores)"
python -c "print('If all scores < 0.5, the retriever is returning noise. Increase retrieval top_k or improve retriever.')"
Fix now
Increase retrieval top_k from 5 to 20, then let the reranker filter. This gives the reranker more signal to work with.
HyDE poisoning retrieval+
Immediate action
Inspect HyDE-generated document
Commands
python -c "from langchain_openai import ChatOpenAI; llm = ChatOpenAI(model='gpt-4o-mini'); hyde_doc = llm.invoke('Generate a hypothetical document that would answer the query: ...').content; print(hyde_doc[:500])"
python -c "print('If the HyDE doc contains invented facts, disable HyDE and compare accuracy metrics.')"
Fix now
Disable HyDE temporarily. If accuracy improves, rework the HyDE prompt to be more conservative (e.g., 'Generate a short, factual document...')
Advanced RAG Techniques vs. Fine-Tuning
ConcernAdvanced RAGFine-TuningRecommendation
Latency200-800ms P9950-200ms P99Use fine-tuning for latency-critical paths
Recall on rare factsHigh (retrieval from corpus)Low (model may hallucinate)Use RAG for factual recall
Cost to update knowledgeLow (update index)High (retrain model)Use RAG for dynamic data
Task specificityLow (retrieval is generic)High (learns domain patterns)Use fine-tuning for style/format
Data requirementsLarge corpus, small labeled setMedium labeled set (1000+ examples)Use RAG if corpus is large

Key takeaways

1
Semantic chunking adds 200-400ms per query due to embedding-based boundary detection
use fixed-size chunking for latency-critical paths and only apply semantic chunking offline.
2
Query rewriting can drift intent by 15-20% in production
always log original vs. rewritten queries and measure retrieval precision before and after.
3
HyDE doubles retrieval latency and can hallucinate irrelevant hypothetical documents
only use it when query is extremely short (<5 tokens) and domain is narrow.
4
Reranking with cross-encoders adds 50-150ms per candidate
never rerank more than top-20 results; use lightweight models like BERT-tiny for sub-50ms latency.
5
If your base RAG with fixed chunking and BM25 hybrid search achieves >85% recall, skip advanced techniques
they add complexity without proportional gains.

Common mistakes to avoid

4 patterns
×

Semantic chunking on every query

Symptom
P99 latency spikes from 200ms to 800ms because you're re-chunking documents per query instead of pre-chunking offline.
Fix
Pre-chunk all documents once during ingestion using semantic boundaries. At query time, use fixed-size chunks with overlap. Only re-chunk if document content changes.
×

Query rewriting without validation

Symptom
Retrieval recall drops 20% because the LLM rephrased 'Python 3.12 async bug' into 'Python concurrency issues', losing specificity.
Fix
Always compare retrieval results (e.g., top-5 documents) between original and rewritten query. If overlap < 70%, fall back to original query. Log all rewrites for audit.
×

HyDE on every query

Symptom
Latency doubles and retrieval quality degrades because the hypothetical document is irrelevant (e.g., 'How to fix OOM error' generates a generic tutorial instead of a specific error trace).
Fix
Only use HyDE when query length < 5 tokens. For longer queries, skip HyDE. Always validate hypothetical document against query intent using a simple cosine similarity check.
×

Reranking too many candidates

Symptom
Reranking 100 candidates with a cross-encoder adds 1.5s latency, but precision gain from 50 to 100 is <2%.
Fix
Cap reranking candidates at 20. Use a lightweight cross-encoder (e.g., ms-marco-TinyBERT-L-2-v2) for sub-50ms inference. Only use full BERT for final top-3 if needed.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how semantic chunking works under the hood and its trade-offs.
Q02SENIOR
How would you debug a 800ms P99 latency in a RAG pipeline?
Q03SENIOR
Describe a scenario where query rewriting hurts retrieval and how to det...
Q04SENIOR
What is HyDE and when would you use it in production?
Q05SENIOR
How do you decide between advanced RAG and fine-tuning?
Q01 of 05SENIOR

Explain how semantic chunking works under the hood and its trade-offs.

ANSWER
Semantic chunking uses a sentence embedding model (e.g., all-MiniLM-L6-v2) to embed each sentence, then computes cosine similarity between adjacent sentences. A threshold (e.g., 0.5) determines chunk boundaries. Trade-off: high accuracy for narrative coherence but 200-400ms latency per query because you must embed the query and run the boundary detection. In production, pre-chunk offline and store chunk IDs in the vector DB.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the latency cost of semantic chunking vs fixed-size chunking?
02
When should I use HyDE vs query rewriting?
03
How do I measure if advanced RAG techniques are worth it?
04
What is the best reranking model for production?
05
Can I use advanced RAG techniques with millions of documents?
🔥

That's RAG. Mark it forged?

4 min read · try the examples if you haven't

Previous
Embeddings and Semantic Search
5 / 5 · RAG
Next
AI Agents Explained