Embeddings and Semantic Search — The 3AM Incident Where Our Vector DB Returned 100% Wrong Results
We deployed semantic search and got 100% irrelevant results.
20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.
- Embedding Models Not all are equal. We saw a 40% accuracy drop switching from
text-embedding-3-smalltoall-MiniLM-L6-v2on a legal document search. Test on your domain. - Vector Index FAISS IVF with 100 centroids gave us 95% recall at 10ms query time. HNSW was faster but used 3x memory. Profile your latency vs. memory budget.
- Hybrid Search Pure vector search failed on exact-match queries like order IDs. Adding a BM25 reranker fixed that. We now use reciprocal rank fusion with weights.
- Embedding Drift Model updates change the vector space silently. We pinned a specific model version after a
sentence-transformersupgrade silently broke our index. - Normalization Forgetting to normalize embeddings before cosine similarity search caused a 15% recall drop. Normalize once at write time, not at query time.
- Chunking Strategy Overlapping chunks of 256 tokens with 32-token overlap gave us the best balance of context and precision for our RAG pipeline.
Imagine trying to find a book in a library by describing its meaning instead of its title. Embeddings turn every sentence into a unique 'fingerprint' of numbers. Semantic search compares these fingerprints to find the closest match. If your fingerprint is wrong (bad model) or the library's catalog is corrupted (index drift), you get the wrong book.
We deployed a semantic search system for a legal document retrieval service. At 2AM, the on-call engineer got paged: the top-5 results for a user query were completely irrelevant — documents about 'contract termination' returned results about 'employee onboarding.' The p99 latency had also spiked from 50ms to 2.3 seconds. The root cause? A silent embedding model upgrade that changed the vector space, combined with a FAISS index that wasn't rebuilt. This is the story of that night and everything we learned since.
How Embeddings Actually Work Under the Hood
Embeddings are dense vector representations of text. They are generated by transformer models that convert tokens into a fixed-size vector (e.g., 384 dimensions for all-MiniLM-L6-v2). The key insight: these vectors encode semantic meaning such that similar texts have similar vectors (high cosine similarity).
Under the hood, the model applies a series of attention layers, pooling, and normalization. The output is a vector where each dimension captures some latent feature of the input. The abstraction hides the fact that the model's behavior can change with library updates.
The production implication: you must treat the embedding model as a black box that can silently change. Pin the exact model revision, not just the library version. Use a hash of the model's configuration to detect drift.
normalize_embeddings=True in model.encode().Practical Implementation: Building a Semantic Search Pipeline
We'll build a complete pipeline: load documents, generate embeddings, index with FAISS, and query. We'll use the all-MiniLM-L6-v2 model and FAISS IVF index. This is production-ready for up to 1 million documents on a single machine.
Key choices: IVF with 100 centroids gives a good trade-off between speed and recall. We use faiss.IndexFlatIP as the coarse quantizer and faiss.IndexIVFFlat for the inverted file. We set nprobe=10 at query time for 95% recall at 10ms latency.
METRIC_INNER_PRODUCT with normalized embeddings.faiss.index_cpu_to_all_gpus to accelerate training on GPU. For large datasets (>10M), we use faiss.IndexIVFPQ to reduce memory by 4x at the cost of 1% recall.When NOT to Use Semantic Search
Semantic search is not a silver bullet. It fails on exact-match queries (e.g., order IDs, product codes, dates). It also struggles with highly specialized domains where the embedding model has not been fine-tuned (e.g., medical jargon, legal citations).
In these cases, hybrid search (vector + keyword) is better. Use BM25 for exact matches and semantic search for meaning. Combine results with reciprocal rank fusion (RRF).
Another case: if your corpus is small (<1000 documents), a simple TF-IDF or BM25 may be faster and equally effective. Semantic search overhead (model loading, embedding generation) may not be worth it.
Production Patterns & Scale: Handling 10M+ Documents
At scale, FAISS IVF with PQ (Product Quantization) is your friend. It reduces memory by 4x with minimal recall loss. Use faiss.IndexIVFPQ with M=8 (8 sub-vectors) and nbits=8. This compresses each vector to 8 bytes per component.
For distributed search, use FAISS with a sharded index. Each shard handles a subset of documents. At query time, broadcast the query to all shards and merge results.
Another pattern: use a vector database like ChromaDB or Qdrant for persistence and replication. They handle index updates, rebalancing, and replication out of the box.
Monitoring: track embedding generation latency, index query latency, and recall. Use a hold-out set of 1000 known queries to measure recall weekly.
index = faiss.index_cpu_to_all_gpus(index). Training 10M vectors on CPU can take hours. On GPU, it's minutes.Common Mistakes with Specific Examples
- Not normalizing embeddings: Cosine similarity on unnormalized vectors gives wrong results. Always normalize. Example:
model.encode(text, normalize_embeddings=True). - Using the wrong metric: FAISS default is L2 distance. For cosine similarity, use inner product after normalization. Set
faiss.METRIC_INNER_PRODUCT. - Not rebuilding index after model upgrade: We learned this the hard way (see incident). Pin model revision.
- Ignoring chunking strategy: For RAG, chunk size matters. Too small: lost context. Too large: irrelevant results. We use 256 tokens with 32-token overlap.
- Not testing recall: We deployed with 80% recall and users complained. Use a hold-out set of known queries to measure recall weekly.
Comparison vs Alternatives: FAISS vs ChromaDB vs Qdrant
FAISS is a library, not a database. It gives you full control over indexing and search, but you manage persistence, replication, and updates yourself.
ChromaDB is a lightweight vector database. It's easy to set up (pip install) and supports metadata filtering. Good for small to medium datasets (<1M documents).
Qdrant is a production-grade vector database. It supports filtering, sharding, replication, and CRUD operations. Better for large-scale, multi-tenant systems.
Our recommendation: start with FAISS for experimentation, move to ChromaDB for simple deployments, and use Qdrant for production at scale.
Debugging and Monitoring in Production
Monitoring semantic search in production requires tracking both the system (latency, throughput) and the quality (recall, relevance).
- Embedding generation latency: p50, p99
- Index query latency: p50, p99
- Recall@5 (measured weekly)
- Cosine similarity distribution (should be stable)
- Index size and memory usage
Tools: Prometheus for metrics, Grafana for dashboards. Use OpenTelemetry for tracing.
Alert on: recall drop >5%, latency spike >2x, index size change >10%.
The Cold Start Problem: Why Your First 1,000 Embeddings Will Lie to You
When you deploy semantic search fresh, your first batch of embeddings looks great on a laptop. In production, it's another story. The root cause? Your vector space hasn't stabilized. New documents shift the distribution. Your nearest neighbors logic is based on a sparsely populated space that doesn't represent real-world queries.
Here's the fix: Warm-start your index with a representative dataset. That means 10,000+ documents that mirror your production traffic. Don't seed with your training data — seed with the data your users will actually query. Use a stratified sample if you have categories. This prevents the dreaded 'N-nearest neighbors returning irrelevant results' bug that I've seen take down three separate search pipelines.
During the warm-start phase, run batch inference at lower concurrency. Embedding models are stateless, but your vector database isn't. Build the index before the first user hits the endpoint. Otherwise, you're asking your retriever to swim in an empty pool.
Why Batch Encoding Breaks Your Latency Budget (And How to Fix It)
Single-query embedding inference is fast. But when you have 10,000+ documents to encode for an index refresh, doing it one-by-one is a death march. The common reaction? Crank up batch size to 512. That's wrong.
Large batches cause memory spikes on your embeddings server — especially with transformer models like Sentence-BERT. I've seen an otherwise stable service OOM-kill itself because a batch of 512 768-dimensional vectors consumed 4GB of RAM for one encode call. The fix is batch sizing based on model architecture AND available VRAM.
Rule of thumb: Model hidden size × sequence length × batch size × precision (bytes) fits in under 70% of GPU memory. For 'all-MiniLM-L6-v2' with 384-dim hidden, 128-token sequences, batch of 32 is safe on a 8GB card.
Also: never interleave document encoding with query encoding unless your model has explicit support. Positional encodings differ between training and inference unless you manage sequence lengths carefully.
The Silent Embedding Drift That Broke Our Semantic Search
sentence-transformers from 2.2.0 to 2.3.0 was a minor patch that would not affect embedding quality. We did not pin the model version in our requirements.txt.all-MiniLM-L6-v2 model in sentence-transformers 2.3.0 had a different internal tokenizer configuration than 2.2.0. The same input text produced a different embedding vector. Our FAISS index was built with the old vectors, so queries encoded with the new model were searching in a different space. Cosine similarity dropped from an average of 0.85 to 0.12.sentence-transformers==2.2.0 in requirements.txt and pinned the model by its Hugging Face revision hash.
2. Rebuilt the FAISS index from scratch using the correct model version.
3. Added a CI pipeline that compares embedding cosine similarity for a fixed set of test sentences before and after any model upgrade. If the mean similarity drops below 0.95, the build fails.
4. Added a version field to the index metadata so we can detect mismatches at query time.- Pin both the library version and the model revision hash. A model upgrade is not a patch.
- Add a regression test that measures embedding stability. Compare cosine similarity of a fixed test set across versions.
- Store the embedding model version in index metadata. Validate it at query time and return a clear error if mismatched.
python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('all-MiniLM-L6-v2'); emb1 = model.encode('test query'); emb2 = model.encode('test document'); print(emb1 @ emb2 / (np.linalg.norm(emb1)*np.linalg.norm(emb2)))" Compare with a known-good embedding from a previous version.faiss.index_factory with IVF might have too few centroids. Run index.nprobe to verify. Also check if the index is on disk or in memory. Use faiss.read_index and faiss.index_cpu_to_all_gpus for GPU acceleration.index.ntotal should be > 0. If it's zero, the index was not built or was corrupted. Check the indexing pipeline logs for errors. Also verify that the embedding dimension matches: index.d vs len(query_embedding).import tracemalloc; tracemalloc.start() to track allocations. Also check for memory leaks in the embedding model: model.encode may cache results. Use model.encode(sentences, show_progress_bar=False) to disable caching.python -c "import sentence_transformers; print(sentence_transformers.__version__)"python -c "from sentence_transformers import SentenceTransformer; m = SentenceTransformer('all-MiniLM-L6-v2'); print(m._modules)"pip install sentence-transformers==2.2.0 then re-run indexing script.Key takeaways
Common mistakes to avoid
4 patternsForgetting to normalize embeddings
vec = vec / np.linalg.norm(vec). Use inner product search instead of cosine if normalized.Mixing embedding models across index/query
sentence-transformers/all-MiniLM-L6-v2@v1). Re-embed entire corpus on model change.Using top-k without a similarity cutoff
results = [r for r in results if r.score > 0.7]. Return empty set if none pass.Not chunking documents properly
Interview Questions on This Topic
Explain how embeddings are generated for semantic search. What happens under the hood?
Frequently Asked Questions
20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.
That's RAG. Mark it forged?
5 min read · try the examples if you haven't