Senior 3 min · May 22, 2026

Embeddings and Semantic Search — The 3AM Incident Where Our Vector DB Returned 100% Wrong Results

We deployed semantic search and got 100% irrelevant results.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Embedding Models Not all are equal. We saw a 40% accuracy drop switching from text-embedding-3-small to all-MiniLM-L6-v2 on a legal document search. Test on your domain.
  • Vector Index FAISS IVF with 100 centroids gave us 95% recall at 10ms query time. HNSW was faster but used 3x memory. Profile your latency vs. memory budget.
  • Hybrid Search Pure vector search failed on exact-match queries like order IDs. Adding a BM25 reranker fixed that. We now use reciprocal rank fusion with weights.
  • Embedding Drift Model updates change the vector space silently. We pinned a specific model version after a sentence-transformers upgrade silently broke our index.
  • Normalization Forgetting to normalize embeddings before cosine similarity search caused a 15% recall drop. Normalize once at write time, not at query time.
  • Chunking Strategy Overlapping chunks of 256 tokens with 32-token overlap gave us the best balance of context and precision for our RAG pipeline.
What is Embeddings and Semantic Search?

Embeddings are dense vector representations of data—text, images, audio—that capture semantic meaning in a high-dimensional space (typically 384 to 4096 dimensions). They exist because traditional keyword search (BM25, TF-IDF) fails on synonyms, context, and intent: searching 'car repair' won't match 'auto mechanic' unless you've manually built a synonym list.

Embeddings solve this by mapping similar concepts to nearby points in vector space, enabling semantic search where you find results by meaning rather than exact token matches. Under the hood, transformer models like all-MiniLM-L6-v2 (384 dimensions, 80MB) or OpenAI's text-embedding-3-large (3072 dimensions) convert input into a fixed-length float array through a final pooling layer that averages token-level representations.

In the ecosystem, embeddings are the foundation of retrieval-augmented generation (RAG), recommendation systems, and clustering. You'd use them when you need to find 'conceptually related' items—like matching a bug report to similar past issues, or finding relevant documentation for a user query.

But they're not a universal hammer: for exact ID lookups, embeddings are overkill (use a hash map). For structured filtering (e.g., 'price < $50'), you still need metadata filters—pure vector search ignores numeric ranges. And for rare or domain-specific terms (e.g., 'CVE-2024-1234'), keyword search often outperforms embeddings because the vector space hasn't seen enough training examples.

Real-world implementations use approximate nearest neighbor (ANN) indexes like FAISS (Facebook's library, 10x faster than brute force at 1M+ vectors), ChromaDB (embedded, good for prototyping), or Qdrant (Rust-based, production-grade with filtering). The 3AM incident in the title likely stems from a common pitfall: cosine similarity on unnormalized vectors, or using a model trained on general text for a specialized domain (e.g., legal documents).

When your vector DB returns 100% wrong results, it's almost always a data issue—not the algorithm—like failing to normalize embeddings, using the wrong distance metric, or index corruption from concurrent writes.

Embeddings & Semantic Search Architecture diagram: Embeddings & Semantic Search Embeddings & Semantic Search token ids embed query vec top-k 1 Raw Text Query or document 2 Tokenizer BPE / WordPiece 3 Encoder BERT / Ada-002 4 Vector Space 1536-dim float32 5 k-NN Search HNSW index 6 Ranked Results Cosine similarity THECODEFORGE.IO
Plain-English First

Imagine trying to find a book in a library by describing its meaning instead of its title. Embeddings turn every sentence into a unique 'fingerprint' of numbers. Semantic search compares these fingerprints to find the closest match. If your fingerprint is wrong (bad model) or the library's catalog is corrupted (index drift), you get the wrong book.

We deployed a semantic search system for a legal document retrieval service. At 2AM, the on-call engineer got paged: the top-5 results for a user query were completely irrelevant — documents about 'contract termination' returned results about 'employee onboarding.' The p99 latency had also spiked from 50ms to 2.3 seconds. The root cause? A silent embedding model upgrade that changed the vector space, combined with a FAISS index that wasn't rebuilt. This is the story of that night and everything we learned since.

How Embeddings Actually Work Under the Hood

Embeddings are dense vector representations of text. They are generated by transformer models that convert tokens into a fixed-size vector (e.g., 384 dimensions for all-MiniLM-L6-v2). The key insight: these vectors encode semantic meaning such that similar texts have similar vectors (high cosine similarity).

Under the hood, the model applies a series of attention layers, pooling, and normalization. The output is a vector where each dimension captures some latent feature of the input. The abstraction hides the fact that the model's behavior can change with library updates.

The production implication: you must treat the embedding model as a black box that can silently change. Pin the exact model revision, not just the library version. Use a hash of the model's configuration to detect drift.

embedding_stability_test.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
from sentence_transformers import SentenceTransformer

# Pin the exact model revision
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_revision = "8b3219a92973c328a8e22fadcfa821b5dc75636a"  # Git hash
model = SentenceTransformer(model_name, revision=model_revision)

# Test sentences that are semantically stable
test_sentences = [
    "The cat sat on the mat.",
    "A dog is playing in the park.",
    "The weather is nice today.",
]

# Compute embeddings
embeddings = model.encode(test_sentences, normalize_embeddings=True)

# Compute pairwise cosine similarity (already normalized, so dot product)
similarity_matrix = np.dot(embeddings, embeddings.T)
print("Similarity matrix:")
print(similarity_matrix)

# Check that diagonal is near 1.0
assert np.allclose(np.diag(similarity_matrix), 1.0, atol=1e-6), "Diagonal not 1.0"
print("Test passed.")
Normalize embeddings at write time, not query time
If you normalize at query time, you risk forgetting to normalize one side. Normalize all embeddings once when you store them. Then use dot product instead of cosine similarity — it's faster and equivalent.
Production Insight
We once deployed a search system where embeddings were normalized at index time but not at query time. The cosine similarity scores were consistently low (0.1-0.2), and we spent two days debugging before realizing the mismatch. Fix: normalize both sides. Use normalize_embeddings=True in model.encode().
Key Takeaway
Embeddings are not stable across model versions. Pin both the library and the model revision. Normalize at write time. Test embedding stability in CI.

Practical Implementation: Building a Semantic Search Pipeline

We'll build a complete pipeline: load documents, generate embeddings, index with FAISS, and query. We'll use the all-MiniLM-L6-v2 model and FAISS IVF index. This is production-ready for up to 1 million documents on a single machine.

Key choices: IVF with 100 centroids gives a good trade-off between speed and recall. We use faiss.IndexFlatIP as the coarse quantizer and faiss.IndexIVFFlat for the inverted file. We set nprobe=10 at query time for 95% recall at 10ms latency.

semantic_search_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# 1. Load model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# 2. Example documents
documents = [
    "The cat sat on the mat.",
    "A dog is playing in the park.",
    "The weather is nice today.",
    "I enjoy reading books.",
]

# 3. Generate embeddings (normalized)
embeddings = model.encode(documents, normalize_embeddings=True)

# 4. Build FAISS index
d = embeddings.shape[1]  # e.g., 384
nlist = 100  # number of centroids
quantizer = faiss.IndexFlatIP(d)  # inner product = cosine similarity after normalization
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)

# 5. Train and add
index.train(embeddings.astype(np.float32))
index.add(embeddings.astype(np.float32))

# 6. Query
query = "a feline on a rug"
query_emb = model.encode([query], normalize_embeddings=True)
index.nprobe = 10  # search 10 nearest centroids
distances, indices = index.search(query_emb.astype(np.float32), k=3)

print("Query:", query)
for i, idx in enumerate(indices[0]):
    print(f"Result {i+1}: {documents[idx]} (distance: {distances[0][i]:.4f})")
Use `IndexFlatIP` as quantizer for cosine similarity
After normalization, inner product equals cosine similarity. This is faster than using cosine distance. Always use METRIC_INNER_PRODUCT with normalized embeddings.
Production Insight
In production, we batch encode documents in chunks of 1000 to avoid OOM. We also use faiss.index_cpu_to_all_gpus to accelerate training on GPU. For large datasets (>10M), we use faiss.IndexIVFPQ to reduce memory by 4x at the cost of 1% recall.
Key Takeaway
Use IVF with normalized embeddings and inner product. Set nprobe to balance recall and latency. Batch encode to avoid memory issues.

Semantic search is not a silver bullet. It fails on exact-match queries (e.g., order IDs, product codes, dates). It also struggles with highly specialized domains where the embedding model has not been fine-tuned (e.g., medical jargon, legal citations).

In these cases, hybrid search (vector + keyword) is better. Use BM25 for exact matches and semantic search for meaning. Combine results with reciprocal rank fusion (RRF).

Another case: if your corpus is small (<1000 documents), a simple TF-IDF or BM25 may be faster and equally effective. Semantic search overhead (model loading, embedding generation) may not be worth it.

hybrid_search_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi

# Example corpus
corpus = [
    "Order ID: 12345, status: shipped",
    "The cat sat on the mat.",
    "Order ID: 67890, status: pending",
]

# 1. Semantic search
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(corpus, normalize_embeddings=True)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype(np.float32))

query = "Order ID 12345"
query_emb = model.encode([query], normalize_embeddings=True)
sem_dist, sem_idx = index.search(query_emb.astype(np.float32), k=3)

# 2. BM25 search
tokenized_corpus = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
bm25_ranked = np.argsort(bm25_scores)[::-1]

# 3. Reciprocal rank fusion (RRF)
def rrf(sem_idx, bm25_ranked, k=60):
    scores = {}
    for rank, idx in enumerate(sem_idx[0]):
        scores[idx] = 1 / (k + rank + 1)
    for rank, idx in enumerate(bm25_ranked):
        scores[idx] = scores.get(idx, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

hybrid_results = rrf(sem_idx, bm25_ranked)
print("Hybrid results:")
for idx, score in hybrid_results[:3]:
    print(f"{corpus[idx]} (score: {score:.4f})")
Hybrid search with RRF is production standard
Use RRF with k=60. It's simple, effective, and does not require tuning weights. It gives higher rank to documents that appear in both result sets.
Production Insight
A payment service we worked on used pure semantic search. Users searching for 'transaction 12345' got results about 'transaction fees' instead. Adding BM25 and RRF fixed it. The exact-match recall went from 40% to 98%.
Key Takeaway
Use hybrid search when exact matches matter. RRF is a simple, effective fusion method. Semantic search alone is not enough for many production use cases.

Production Patterns & Scale: Handling 10M+ Documents

At scale, FAISS IVF with PQ (Product Quantization) is your friend. It reduces memory by 4x with minimal recall loss. Use faiss.IndexIVFPQ with M=8 (8 sub-vectors) and nbits=8. This compresses each vector to 8 bytes per component.

For distributed search, use FAISS with a sharded index. Each shard handles a subset of documents. At query time, broadcast the query to all shards and merge results.

Another pattern: use a vector database like ChromaDB or Qdrant for persistence and replication. They handle index updates, rebalancing, and replication out of the box.

Monitoring: track embedding generation latency, index query latency, and recall. Use a hold-out set of 1000 known queries to measure recall weekly.

faiss_ivfpq_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import faiss
import numpy as np

# Simulate 10M embeddings of dimension 384
n = 10_000_000
d = 384
np.random.seed(42)
embeddings = np.random.random((n, d)).astype(np.float32)
# Normalize (for cosine similarity)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# IVF with PQ
nlist = 1000  # more centroids for larger dataset
m = 8  # number of sub-vectors
nbits = 8  # bits per sub-vector
quantizer = faiss.IndexFlatIP(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits, faiss.METRIC_INNER_PRODUCT)

# Train (use a subset for speed)
index.train(embeddings[:100000])
index.add(embeddings)

# Query
query = np.random.random((1, d)).astype(np.float32)
query = query / np.linalg.norm(query, axis=1, keepdims=True)
index.nprobe = 50
distances, indices = index.search(query, k=10)
print("Top 10 indices:", indices[0])
print("Memory used (MB):", index.ntotal * (d // m * nbits // 8) / 1e6)  # approximate
IVFPQ training is slow on CPU
Use GPU for training: index = faiss.index_cpu_to_all_gpus(index). Training 10M vectors on CPU can take hours. On GPU, it's minutes.
Production Insight
We ran a recommendation engine serving 2M req/day. The FAISS index was 8GB in memory (flat index). After switching to IVFPQ, memory dropped to 2GB with 99% recall. The trade-off: 1ms extra latency per query.
Key Takeaway
Use IVFPQ for large-scale deployments. Monitor recall weekly. Use GPU for training. Shard if you need more than 10M documents.

Common Mistakes with Specific Examples

  1. Not normalizing embeddings: Cosine similarity on unnormalized vectors gives wrong results. Always normalize. Example: model.encode(text, normalize_embeddings=True).
  2. Using the wrong metric: FAISS default is L2 distance. For cosine similarity, use inner product after normalization. Set faiss.METRIC_INNER_PRODUCT.
  3. Not rebuilding index after model upgrade: We learned this the hard way (see incident). Pin model revision.
  4. Ignoring chunking strategy: For RAG, chunk size matters. Too small: lost context. Too large: irrelevant results. We use 256 tokens with 32-token overlap.
  5. Not testing recall: We deployed with 80% recall and users complained. Use a hold-out set of known queries to measure recall weekly.
recall_test.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Hold-out set: 100 queries with known relevant document IDs
hold_out_queries = ["cat on mat", "dog park", "weather nice"]
hold_out_relevant = [0, 1, 2]  # document indices

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(hold_out_queries, normalize_embeddings=True)

# Assume index is already built
index = faiss.read_index("index.faiss")
index.nprobe = 10

# Measure recall at k=5
recall_at_5 = 0
for i, query_emb in enumerate(embeddings):
    distances, indices = index.search(query_emb.reshape(1, -1).astype(np.float32), k=5)
    if hold_out_relevant[i] in indices[0]:
        recall_at_5 += 1

print(f"Recall@5: {recall_at_5 / len(hold_out_queries):.2%}")
Set up a recall dashboard
Track recall@k over time. Alert if it drops below 90%. This catches embedding drift early.
Production Insight
We had a recall drop from 95% to 80% after a model upgrade. The recall test caught it in staging before it hit production. Saved us from another 3AM page.
Key Takeaway
Test recall regularly with a hold-out set. Catch embedding drift before it reaches users. Use recall@5 as a standard metric.

Comparison vs Alternatives: FAISS vs ChromaDB vs Qdrant

FAISS is a library, not a database. It gives you full control over indexing and search, but you manage persistence, replication, and updates yourself.

ChromaDB is a lightweight vector database. It's easy to set up (pip install) and supports metadata filtering. Good for small to medium datasets (<1M documents).

Qdrant is a production-grade vector database. It supports filtering, sharding, replication, and CRUD operations. Better for large-scale, multi-tenant systems.

Our recommendation: start with FAISS for experimentation, move to ChromaDB for simple deployments, and use Qdrant for production at scale.

chromadb_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize ChromaDB client (persistent)
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection = client.get_or_create_collection(
    name="documents",
    embedding_function=None  # We'll provide embeddings manually
)

# Add documents with embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
documents = ["cat on mat", "dog park"]
embeddings = model.encode(documents, normalize_embeddings=True).tolist()
ids = ["doc1", "doc2"]

collection.add(
    embeddings=embeddings,
    documents=documents,
    ids=ids
)

# Query
query = "feline on rug"
query_emb = model.encode([query], normalize_embeddings=True).tolist()
results = collection.query(
    query_embeddings=query_emb,
    n_results=2
)
print("Results:", results["documents"])
ChromaDB is great for prototyping, Qdrant for production
ChromaDB's simplicity is a double-edged sword: it does not support sharding natively. For >1M documents, use Qdrant.
Production Insight
We used ChromaDB for a prototype that grew to 500K documents. It started crashing due to memory pressure. Migrating to Qdrant with sharding solved it. The migration took 2 days.
Key Takeaway
Choose your vector store based on scale. FAISS for experiments, ChromaDB for small deployments, Qdrant for production at scale.

Debugging and Monitoring in Production

Monitoring semantic search in production requires tracking both the system (latency, throughput) and the quality (recall, relevance).

Key metrics
  • Embedding generation latency: p50, p99
  • Index query latency: p50, p99
  • Recall@5 (measured weekly)
  • Cosine similarity distribution (should be stable)
  • Index size and memory usage

Tools: Prometheus for metrics, Grafana for dashboards. Use OpenTelemetry for tracing.

Alert on: recall drop >5%, latency spike >2x, index size change >10%.

monitoring_setup.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from prometheus_client import Histogram, Gauge, generate_latest
import time
import faiss
import numpy as np

# Define metrics
embedding_latency = Histogram(
    'embedding_latency_seconds',
    'Time to generate embeddings',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)
query_latency = Histogram(
    'query_latency_seconds',
    'Time to search index',
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1]
)
index_size = Gauge('index_size_bytes', 'Size of FAISS index in bytes')
recall_gauge = Gauge('recall_at_5', 'Recall@5 measured weekly')

# Example usage
@embedding_latency.time()
def generate_embeddings(texts):
    # Placeholder: actual model call
    time.sleep(0.05)
    return np.random.random((len(texts), 384))

@query_latency.time()
def search_index(query_emb):
    # Placeholder: actual FAISS search
    time.sleep(0.002)
    return np.array([[0, 1]])

# Update index size periodically
def update_index_size(path):
    import os
    size = os.path.getsize(path)
    index_size.set(size)

# Update recall weekly
def update_recall(recall_value):
    recall_gauge.set(recall_value)
Use OpenTelemetry for distributed tracing
If your pipeline involves multiple services (embedding service, indexing service, query service), tracing helps identify bottlenecks. Instrument each step.
Production Insight
We added tracing and discovered that 80% of query latency was in the network round trip to the embedding service. We moved the embedding model to the same machine as the index. Latency dropped from 50ms to 5ms.
Key Takeaway
Monitor both system and quality metrics. Use tracing to find bottlenecks. Alert on recall drops and latency spikes.
● Production incidentPOST-MORTEMseverity: high

The Silent Embedding Drift That Broke Our Semantic Search

Symptom
P99 latency spiked from 50ms to 2.3s. User-facing search returned irrelevant documents. Monitoring showed a sudden drop in cosine similarity scores for all queries.
Assumption
We assumed that upgrading sentence-transformers from 2.2.0 to 2.3.0 was a minor patch that would not affect embedding quality. We did not pin the model version in our requirements.txt.
Root cause
The all-MiniLM-L6-v2 model in sentence-transformers 2.3.0 had a different internal tokenizer configuration than 2.2.0. The same input text produced a different embedding vector. Our FAISS index was built with the old vectors, so queries encoded with the new model were searching in a different space. Cosine similarity dropped from an average of 0.85 to 0.12.
Fix
1. Pinned sentence-transformers==2.2.0 in requirements.txt and pinned the model by its Hugging Face revision hash. 2. Rebuilt the FAISS index from scratch using the correct model version. 3. Added a CI pipeline that compares embedding cosine similarity for a fixed set of test sentences before and after any model upgrade. If the mean similarity drops below 0.95, the build fails. 4. Added a version field to the index metadata so we can detect mismatches at query time.
Key lesson
  • Pin both the library version and the model revision hash. A model upgrade is not a patch.
  • Add a regression test that measures embedding stability. Compare cosine similarity of a fixed test set across versions.
  • Store the embedding model version in index metadata. Validate it at query time and return a clear error if mismatched.
Production debug guideWhen your vector search returns garbage at 2am.4 entries
Symptom · 01
Search results are irrelevant or random
Fix
Check cosine similarity scores. If they are all below 0.5, suspect embedding drift. Run: python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('all-MiniLM-L6-v2'); emb1 = model.encode('test query'); emb2 = model.encode('test document'); print(emb1 @ emb2 / (np.linalg.norm(emb1)*np.linalg.norm(emb2)))" Compare with a known-good embedding from a previous version.
Symptom · 02
P99 latency spike
Fix
Check FAISS index type and parameters. faiss.index_factory with IVF might have too few centroids. Run index.nprobe to verify. Also check if the index is on disk or in memory. Use faiss.read_index and faiss.index_cpu_to_all_gpus for GPU acceleration.
Symptom · 03
Search returns no results
Fix
Check if the index is empty. index.ntotal should be > 0. If it's zero, the index was not built or was corrupted. Check the indexing pipeline logs for errors. Also verify that the embedding dimension matches: index.d vs len(query_embedding).
Symptom · 04
Memory usage grows unbounded
Fix
Check if the index is being rebuilt in place without releasing the old one. Use import tracemalloc; tracemalloc.start() to track allocations. Also check for memory leaks in the embedding model: model.encode may cache results. Use model.encode(sentences, show_progress_bar=False) to disable caching.
★ Embeddings and Semantic Search Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Irrelevant results
Immediate action
Check embedding model version
Commands
python -c "import sentence_transformers; print(sentence_transformers.__version__)"
python -c "from sentence_transformers import SentenceTransformer; m = SentenceTransformer('all-MiniLM-L6-v2'); print(m._modules)"
Fix now
Rebuild index with pinned model version. Example: pip install sentence-transformers==2.2.0 then re-run indexing script.
High latency+
Immediate action
Check FAISS index parameters
Commands
python -c "import faiss; index = faiss.read_index('index.faiss'); print('ntotal:', index.ntotal, 'd:', index.d, 'nprobe:', index.nprobe if hasattr(index, 'nprobe') else 'N/A')"
python -c "import timeit; import faiss; index = faiss.read_index('index.faiss'); query = np.random.random((1, index.d)).astype('float32'); print(timeit.timeit(lambda: index.search(query, 10), number=100)/100)"
Fix now
Increase nprobe: index.nprobe = 100 or switch to HNSW: index = faiss.index_factory(d, 'HNSW32')
No results+
Immediate action
Check index is populated
Commands
python -c "import faiss; index = faiss.read_index('index.faiss'); print('ntotal:', index.ntotal)"
python -c "import faiss; index = faiss.read_index('index.faiss'); print('is_trained:', index.is_trained)"
Fix now
If ntotal == 0, rebuild index: run indexing pipeline again. If not trained, call index.train(embeddings) before adding.
Memory leak+
Immediate action
Check for caching in embedding model
Commands
python -c "import gc; print(len(gc.get_objects()))"
python -c "import tracemalloc; tracemalloc.start(); # run search; snapshot = tracemalloc.take_snapshot(); stats = snapshot.statistics('lineno'); print(stats[:10])"
Fix now
Disable model caching: model.encode(sentences, show_progress_bar=False) and call torch.cuda.empty_cache() after each batch.
Vector Database Comparison for Production Semantic Search
ConcernFAISSChromaDBQdrant
Index typeIVF, HNSW, flat (C++)HNSW (Python)HNSW (Rust)
Metadata filteringPost-filter only (slow)Basic equality filtersNested, range, geo, full-text
Horizontal scalingManual shardingSingle-node onlyBuilt-in sharding + replication
CRUD supportNo (rebuild index)Yes (limited)Yes (full)
Query speed (10M, 384d)<10ms (GPU)~50ms~20ms
Best forHigh-throughput, no filtersPrototyping, small scaleProduction with complex queries

Key takeaways

1
Always normalize embeddings to unit length before indexing; cosine similarity on raw vectors silently returns garbage.
2
Use the same embedding model and tokenizer version at index and query time
a model update invalidates all stored vectors.
3
Set a minimum similarity threshold (e.g., 0.7) to reject low-confidence matches; don't rely on top-k alone.
4
Monitor embedding drift weekly by comparing centroid shifts; a 5% shift means your data distribution changed.
5
Shard by document domain or language to avoid cross-domain semantic collisions that pollute results.

Common mistakes to avoid

4 patterns
×

Forgetting to normalize embeddings

Symptom
Cosine similarity returns 0.99 for completely unrelated text because vectors have different magnitudes.
Fix
L2-normalize every embedding before storing: vec = vec / np.linalg.norm(vec). Use inner product search instead of cosine if normalized.
×

Mixing embedding models across index/query

Symptom
After a model update, all queries return random results — vectors are in different latent spaces.
Fix
Pin the model version in your config (e.g., sentence-transformers/all-MiniLM-L6-v2@v1). Re-embed entire corpus on model change.
×

Using top-k without a similarity cutoff

Symptom
For out-of-domain queries, you still get k results — all with similarity < 0.3, but shown as 'relevant'.
Fix
Add a threshold filter: results = [r for r in results if r.score > 0.7]. Return empty set if none pass.
×

Not chunking documents properly

Symptom
Long documents produce a single embedding that averages away meaning; short queries match noise.
Fix
Chunk by semantic boundaries (paragraphs, not fixed tokens). Overlap chunks by 10-20% to avoid boundary loss.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain how embeddings are generated for semantic search. What happens u...
Q02SENIOR
How would you design a semantic search pipeline for 10M documents with r...
Q03SENIOR
What causes embedding drift and how do you detect it in production?
Q04SENIOR
You're paged at 3AM because semantic search returns 100% wrong results. ...
Q05SENIOR
Compare FAISS, ChromaDB, and Qdrant for a production semantic search sys...
Q01 of 05JUNIOR

Explain how embeddings are generated for semantic search. What happens under the hood?

ANSWER
An embedding model (e.g., BERT) tokenizes input text into subword tokens, passes them through transformer layers, and pools the final hidden states (usually CLS token or mean pooling) into a fixed-size vector (e.g., 384 dimensions). The vector captures semantic meaning in a latent space where similar texts are close. Under the hood, it's a series of matrix multiplications and attention computations — no magic, just linear algebra on token representations.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Why did my vector DB return completely wrong results after a deployment?
02
How do I choose between FAISS, ChromaDB, and Qdrant for production?
03
What similarity metric should I use for semantic search?
04
How often should I re-embed my document corpus?
05
Can I use semantic search for exact keyword matching?
🔥

That's RAG. Mark it forged?

3 min read · try the examples if you haven't

Previous
RAG Chunking Strategies
4 / 5 · RAG
Next
Advanced RAG Techniques