Embeddings and Semantic Search — The 3AM Incident Where Our Vector DB Returned 100% Wrong Results
We deployed semantic search and got 100% irrelevant results.
- Embedding Models Not all are equal. We saw a 40% accuracy drop switching from
text-embedding-3-smalltoall-MiniLM-L6-v2on a legal document search. Test on your domain. - Vector Index FAISS IVF with 100 centroids gave us 95% recall at 10ms query time. HNSW was faster but used 3x memory. Profile your latency vs. memory budget.
- Hybrid Search Pure vector search failed on exact-match queries like order IDs. Adding a BM25 reranker fixed that. We now use reciprocal rank fusion with weights.
- Embedding Drift Model updates change the vector space silently. We pinned a specific model version after a
sentence-transformersupgrade silently broke our index. - Normalization Forgetting to normalize embeddings before cosine similarity search caused a 15% recall drop. Normalize once at write time, not at query time.
- Chunking Strategy Overlapping chunks of 256 tokens with 32-token overlap gave us the best balance of context and precision for our RAG pipeline.
Embeddings are dense vector representations of data—text, images, audio—that capture semantic meaning in a high-dimensional space (typically 384 to 4096 dimensions). They exist because traditional keyword search (BM25, TF-IDF) fails on synonyms, context, and intent: searching 'car repair' won't match 'auto mechanic' unless you've manually built a synonym list.
Embeddings solve this by mapping similar concepts to nearby points in vector space, enabling semantic search where you find results by meaning rather than exact token matches. Under the hood, transformer models like all-MiniLM-L6-v2 (384 dimensions, 80MB) or OpenAI's text-embedding-3-large (3072 dimensions) convert input into a fixed-length float array through a final pooling layer that averages token-level representations.
In the ecosystem, embeddings are the foundation of retrieval-augmented generation (RAG), recommendation systems, and clustering. You'd use them when you need to find 'conceptually related' items—like matching a bug report to similar past issues, or finding relevant documentation for a user query.
But they're not a universal hammer: for exact ID lookups, embeddings are overkill (use a hash map). For structured filtering (e.g., 'price < $50'), you still need metadata filters—pure vector search ignores numeric ranges. And for rare or domain-specific terms (e.g., 'CVE-2024-1234'), keyword search often outperforms embeddings because the vector space hasn't seen enough training examples.
Real-world implementations use approximate nearest neighbor (ANN) indexes like FAISS (Facebook's library, 10x faster than brute force at 1M+ vectors), ChromaDB (embedded, good for prototyping), or Qdrant (Rust-based, production-grade with filtering). The 3AM incident in the title likely stems from a common pitfall: cosine similarity on unnormalized vectors, or using a model trained on general text for a specialized domain (e.g., legal documents).
When your vector DB returns 100% wrong results, it's almost always a data issue—not the algorithm—like failing to normalize embeddings, using the wrong distance metric, or index corruption from concurrent writes.
Imagine trying to find a book in a library by describing its meaning instead of its title. Embeddings turn every sentence into a unique 'fingerprint' of numbers. Semantic search compares these fingerprints to find the closest match. If your fingerprint is wrong (bad model) or the library's catalog is corrupted (index drift), you get the wrong book.
We deployed a semantic search system for a legal document retrieval service. At 2AM, the on-call engineer got paged: the top-5 results for a user query were completely irrelevant — documents about 'contract termination' returned results about 'employee onboarding.' The p99 latency had also spiked from 50ms to 2.3 seconds. The root cause? A silent embedding model upgrade that changed the vector space, combined with a FAISS index that wasn't rebuilt. This is the story of that night and everything we learned since.
How Embeddings Actually Work Under the Hood
Embeddings are dense vector representations of text. They are generated by transformer models that convert tokens into a fixed-size vector (e.g., 384 dimensions for all-MiniLM-L6-v2). The key insight: these vectors encode semantic meaning such that similar texts have similar vectors (high cosine similarity).
Under the hood, the model applies a series of attention layers, pooling, and normalization. The output is a vector where each dimension captures some latent feature of the input. The abstraction hides the fact that the model's behavior can change with library updates.
The production implication: you must treat the embedding model as a black box that can silently change. Pin the exact model revision, not just the library version. Use a hash of the model's configuration to detect drift.
normalize_embeddings=True in model.encode().Practical Implementation: Building a Semantic Search Pipeline
We'll build a complete pipeline: load documents, generate embeddings, index with FAISS, and query. We'll use the all-MiniLM-L6-v2 model and FAISS IVF index. This is production-ready for up to 1 million documents on a single machine.
Key choices: IVF with 100 centroids gives a good trade-off between speed and recall. We use faiss.IndexFlatIP as the coarse quantizer and faiss.IndexIVFFlat for the inverted file. We set nprobe=10 at query time for 95% recall at 10ms latency.
METRIC_INNER_PRODUCT with normalized embeddings.faiss.index_cpu_to_all_gpus to accelerate training on GPU. For large datasets (>10M), we use faiss.IndexIVFPQ to reduce memory by 4x at the cost of 1% recall.When NOT to Use Semantic Search
Semantic search is not a silver bullet. It fails on exact-match queries (e.g., order IDs, product codes, dates). It also struggles with highly specialized domains where the embedding model has not been fine-tuned (e.g., medical jargon, legal citations).
In these cases, hybrid search (vector + keyword) is better. Use BM25 for exact matches and semantic search for meaning. Combine results with reciprocal rank fusion (RRF).
Another case: if your corpus is small (<1000 documents), a simple TF-IDF or BM25 may be faster and equally effective. Semantic search overhead (model loading, embedding generation) may not be worth it.
Production Patterns & Scale: Handling 10M+ Documents
At scale, FAISS IVF with PQ (Product Quantization) is your friend. It reduces memory by 4x with minimal recall loss. Use faiss.IndexIVFPQ with M=8 (8 sub-vectors) and nbits=8. This compresses each vector to 8 bytes per component.
For distributed search, use FAISS with a sharded index. Each shard handles a subset of documents. At query time, broadcast the query to all shards and merge results.
Another pattern: use a vector database like ChromaDB or Qdrant for persistence and replication. They handle index updates, rebalancing, and replication out of the box.
Monitoring: track embedding generation latency, index query latency, and recall. Use a hold-out set of 1000 known queries to measure recall weekly.
index = faiss.index_cpu_to_all_gpus(index). Training 10M vectors on CPU can take hours. On GPU, it's minutes.Common Mistakes with Specific Examples
- Not normalizing embeddings: Cosine similarity on unnormalized vectors gives wrong results. Always normalize. Example:
model.encode(text, normalize_embeddings=True). - Using the wrong metric: FAISS default is L2 distance. For cosine similarity, use inner product after normalization. Set
faiss.METRIC_INNER_PRODUCT. - Not rebuilding index after model upgrade: We learned this the hard way (see incident). Pin model revision.
- Ignoring chunking strategy: For RAG, chunk size matters. Too small: lost context. Too large: irrelevant results. We use 256 tokens with 32-token overlap.
- Not testing recall: We deployed with 80% recall and users complained. Use a hold-out set of known queries to measure recall weekly.
Comparison vs Alternatives: FAISS vs ChromaDB vs Qdrant
FAISS is a library, not a database. It gives you full control over indexing and search, but you manage persistence, replication, and updates yourself.
ChromaDB is a lightweight vector database. It's easy to set up (pip install) and supports metadata filtering. Good for small to medium datasets (<1M documents).
Qdrant is a production-grade vector database. It supports filtering, sharding, replication, and CRUD operations. Better for large-scale, multi-tenant systems.
Our recommendation: start with FAISS for experimentation, move to ChromaDB for simple deployments, and use Qdrant for production at scale.
Debugging and Monitoring in Production
Monitoring semantic search in production requires tracking both the system (latency, throughput) and the quality (recall, relevance).
- Embedding generation latency: p50, p99
- Index query latency: p50, p99
- Recall@5 (measured weekly)
- Cosine similarity distribution (should be stable)
- Index size and memory usage
Tools: Prometheus for metrics, Grafana for dashboards. Use OpenTelemetry for tracing.
Alert on: recall drop >5%, latency spike >2x, index size change >10%.
The Silent Embedding Drift That Broke Our Semantic Search
sentence-transformers from 2.2.0 to 2.3.0 was a minor patch that would not affect embedding quality. We did not pin the model version in our requirements.txt.all-MiniLM-L6-v2 model in sentence-transformers 2.3.0 had a different internal tokenizer configuration than 2.2.0. The same input text produced a different embedding vector. Our FAISS index was built with the old vectors, so queries encoded with the new model were searching in a different space. Cosine similarity dropped from an average of 0.85 to 0.12.sentence-transformers==2.2.0 in requirements.txt and pinned the model by its Hugging Face revision hash.
2. Rebuilt the FAISS index from scratch using the correct model version.
3. Added a CI pipeline that compares embedding cosine similarity for a fixed set of test sentences before and after any model upgrade. If the mean similarity drops below 0.95, the build fails.
4. Added a version field to the index metadata so we can detect mismatches at query time.- Pin both the library version and the model revision hash. A model upgrade is not a patch.
- Add a regression test that measures embedding stability. Compare cosine similarity of a fixed test set across versions.
- Store the embedding model version in index metadata. Validate it at query time and return a clear error if mismatched.
python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('all-MiniLM-L6-v2'); emb1 = model.encode('test query'); emb2 = model.encode('test document'); print(emb1 @ emb2 / (np.linalg.norm(emb1)*np.linalg.norm(emb2)))" Compare with a known-good embedding from a previous version.faiss.index_factory with IVF might have too few centroids. Run index.nprobe to verify. Also check if the index is on disk or in memory. Use faiss.read_index and faiss.index_cpu_to_all_gpus for GPU acceleration.index.ntotal should be > 0. If it's zero, the index was not built or was corrupted. Check the indexing pipeline logs for errors. Also verify that the embedding dimension matches: index.d vs len(query_embedding).import tracemalloc; tracemalloc.start() to track allocations. Also check for memory leaks in the embedding model: model.encode may cache results. Use model.encode(sentences, show_progress_bar=False) to disable caching.python -c "import sentence_transformers; print(sentence_transformers.__version__)"python -c "from sentence_transformers import SentenceTransformer; m = SentenceTransformer('all-MiniLM-L6-v2'); print(m._modules)"pip install sentence-transformers==2.2.0 then re-run indexing script.Key takeaways
Common mistakes to avoid
4 patternsForgetting to normalize embeddings
vec = vec / np.linalg.norm(vec). Use inner product search instead of cosine if normalized.Mixing embedding models across index/query
sentence-transformers/all-MiniLM-L6-v2@v1). Re-embed entire corpus on model change.Using top-k without a similarity cutoff
results = [r for r in results if r.score > 0.7]. Return empty set if none pass.Not chunking documents properly
Interview Questions on This Topic
Explain how embeddings are generated for semantic search. What happens under the hood?
Frequently Asked Questions
That's RAG. Mark it forged?
3 min read · try the examples if you haven't