Senior 7 min · May 22, 2026

Vector Databases Explained — How a 23% Recall Drop in Production Cost Us $40k in One Night

Learn what vector databases actually do under the hood, why cosine similarity can silently fail, and how to debug a 23% recall drop at 2am.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Embedding Index The core data structure is an approximate nearest neighbor (ANN) index, not a SQL index. If you query without filtering, you might get results from the wrong partition entirely.
  • Distance Metric Cosine similarity assumes normalized vectors. If your embeddings aren't unit-length, you'll get wrong rankings. We learned this when a 23% recall drop was caused by unnormalized vectors from a batch job.
  • Metadata Filtering Most vector DBs apply filters after the ANN search, not during. If your filter is too selective, you'll get zero results even though matching vectors exist.
  • Index Build Time HNSW index construction is O(n log n) but memory-bound. A 10M-vector index can take 45 minutes and consume 8GB RAM. Plan for it.
  • Query Latency p99 latency for a 100k-vector HNSW search with ef_search=100 is ~5ms on CPU. Push ef_search to 500 and it jumps to 40ms. Know your SLA.
  • Vector Dimension Dimensionality is the #1 hidden cost. 1536-dim vectors from text-embedding-3-small are 6KB each. 10M vectors = 60GB just for the raw vectors, before index overhead.
What is Vector Databases Explained?

A vector database is a specialized storage and retrieval system designed for high-dimensional vector embeddings — arrays of floating-point numbers that represent semantic meaning in machine learning models. Unlike traditional databases that query exact matches or range filters on structured columns, vector databases use approximate nearest neighbor (ANN) algorithms to find the most similar vectors in sub-millisecond time, even across billions of entries.

They exist because semantic search, recommendation systems, and AI-powered retrieval require similarity matching on unstructured data (text, images, audio) that SQL's exact-match paradigm can't handle. The core tradeoff is accuracy vs. speed: you're trading deterministic results for probabilistic ones, and when that probability drops — like a 23% recall loss in production — you're not just losing relevance, you're burning cash on failed retrievals, re-embeddings, and degraded user experience.

Under the hood, vector databases like Pinecone, Weaviate, Qdrant, or Milvus implement ANN via algorithms such as HNSW (Hierarchical Navigable Small World graphs), IVF (Inverted File Index), or product quantization. HNSW builds a multi-layer graph where each layer is a coarser approximation of the data, enabling logarithmic search complexity — you start at the top layer and descend, greedily traversing neighbors.

IVF clusters vectors into Voronoi cells, then searches only the nearest clusters during query time. The index structure is memory-mapped or kept in RAM for speed, with disk-based persistence for durability. Production systems shard across nodes by hashing vector IDs or using consistent hashing, replicate for fault tolerance, and cache frequent queries or hot vectors in Redis or similar.

The critical insight: recall isn't just a metric — it's a direct cost driver. A 23% drop means 23% of your queries return irrelevant results, forcing retries, fallback logic, or user abandonment.

You should not use a vector database when your data is purely structured (use PostgreSQL with pgvector or Elasticsearch for hybrid search), when you need exact nearest neighbor results (use brute-force kNN with GPU acceleration for small datasets), or when your workload is dominated by CRUD on scalar fields (use a relational DB). Vector databases shine for semantic search over unstructured data, real-time recommendation, anomaly detection on embeddings, and RAG (Retrieval-Augmented Generation) pipelines.

The alternatives are: in-memory libraries like FAISS or Annoy (good for static datasets, no operational overhead), SQL extensions like pgvector (good for hybrid queries, but slower at scale), or managed services like Pinecone (zero ops, but vendor lock-in and cost at high QPS). Choose based on your latency SLA, recall requirements, and whether you need real-time index updates — if you're doing batch indexing with nightly rebuilds, FAISS on S3 might save you $40k a year.

If you need sub-50ms queries on streaming data with 99% recall, you'll pay for a production vector DB and tune it obsessively.

Vector Database Architecture Architecture diagram: Vector Database Architecture Vector Database Architecture vectors top-k ANN cache miss cache hit 1 Raw Documents PDF / Text / Images 2 Embedding Model text-embedding-3-small 3 Vector Index FAISS IVF + PQ 4 Query Pipeline Recall Drop: 23% 5 Redis Cache TTL: 5 min THECODEFORGE.IO
Plain-English First

Imagine you have a giant library where every book is described only by its smell. A vector database is like a super-sniffer dog that finds the closest-smelling books in milliseconds. But if someone spills coffee on a book (bad embedding), the dog gets confused and brings you a cookbook when you asked for a mystery novel.

Our recommendation engine served 2 million requests per day. It was fast, cheap, and everyone was happy. Then one night, recall dropped 23%. Users started seeing irrelevant products. Our p99 latency went from 5ms to 800ms. The root cause? A single unnormalized vector from a batch job that ran for 3 years without issue. That's the problem with vector databases: they work until they silently don't.

Most tutorials show you how to insert a few vectors and run a similarity search. They skip the part where your production data drifts, your embedding model changes, or your index rebuild takes 45 minutes and brings down your service. They don't tell you that cosine similarity assumes normalized vectors, or that metadata filtering happens after the ANN search, not during.

This article covers exactly what you need to run vector databases in production: how ANN indices work under the hood, when to use (and not use) them, how to debug a recall drop, and the exact commands to run when your p99 latency spikes at 2am. We'll use ChromaDB 0.4.x, OpenAI embeddings (text-embedding-3-small), and LangChain 0.2.x. All code is Python 3.11+ and runnable.

How Vector Databases Actually Work Under the Hood

A vector database is not a database in the traditional sense. It's an approximate nearest neighbor (ANN) index with a thin persistence layer. The core data structure is usually HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index). HNSW builds a multi-layer graph where each layer is a coarser representation of the data. When you query, it starts at the top layer (fewest nodes) and navigates down, greedily moving to closer neighbors at each step. The key parameters are M (number of connections per node) and ef_construction (size of the dynamic candidate list during build). Higher M means better recall but more memory. Higher ef_construction means better index quality but slower build.

What the abstraction hides from you: the distance computation. When you call collection.query(), the vector DB computes the distance between your query vector and every candidate vector it visits in the graph. For cosine similarity, it computes 1 - dot_product(query, vector) / (norm(query) * norm(vector)). If your vectors aren't normalized, the denominator is wrong, and rankings shift. Also, the index is built on the raw vectors, not on normalized versions. So if you insert unnormalized vectors, the graph structure itself is suboptimal.

The second hidden detail is memory. HNSW stores the entire graph in memory. For 10M vectors of 1536 dimensions, that's about 60GB for the vectors plus ~8GB for the graph edges. If you're on a node with 64GB RAM, you're at 106% usage. The OS starts swapping, and your p99 latency goes from 5ms to 800ms.

hnsw_internals_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import chromadb
from chromadb.utils import embedding_functions

# Simulate 100k 1536-dim vectors (like text-embedding-3-small)
np.random.seed(42)
num_vectors = 100_000
dim = 1536
raw_vectors = np.random.randn(num_vectors, dim).astype(np.float32)

# Normalize to unit length — critical for cosine similarity
norms = np.linalg.norm(raw_vectors, axis=1, keepdims=True)
normalized_vectors = raw_vectors / norms  # shape: (100000, 1536)

# Initialize ChromaDB with persistent storage
client = chromadb.PersistentClient(path="/tmp/chroma_demo")
collection = client.create_collection(
    name="hnsw_demo",
    metadata={"hnsw:space": "cosine", "hnsw:construction_ef": 200, "hnsw:M": 16}
)

# Add vectors in batches to avoid OOM
batch_size = 10000
for i in range(0, num_vectors, batch_size):
    batch = normalized_vectors[i:i+batch_size]
    ids = [f"vec_{j}" for j in range(i, i+len(batch))]
    collection.add(
        embeddings=batch.tolist(),
        ids=ids,
        metadatas=[{"index": j} for j in range(i, i+len(batch))]
    )

# Query with a random vector
query_vec = normalized_vectors[0].tolist()
results = collection.query(
    query_embeddings=[query_vec],
    n_results=5
)
print("Top 5 results:", results['ids'][0])
print("Distances:", results['distances'][0])
# Note: distances are cosine distances (0 = identical, 2 = opposite)
# If you see distances > 1.0, something is wrong with normalization
Normalization is not optional
ChromaDB's cosine distance is computed as 1 - cos_sim. If your vectors aren't unit-normalized, the distance can exceed 2.0 and rankings become meaningless. Always normalize before insert.
Production Insight
A fraud detection pipeline serving 500K transactions/day used cosine similarity without normalization for 6 months. When they upgraded the embedding model, the new model didn't normalize output. Recall dropped 15% overnight. They caught it because a monitoring alert fired on 'distance > 1.5' appearing in query results.
Key Takeaway
HNSW is a greedy graph search. It's fast but sensitive to vector quality. Always normalize, always pin your embedding model version, and always monitor the distribution of distances returned by queries.

Practical Implementation: Building a Production-Ready Vector Search Pipeline

Most tutorials show you how to insert a few vectors and query them. In production, you need to handle: 1) batch ingestion with retries, 2) embedding model version pinning, 3) index rebuild scheduling, 4) query monitoring, and 5) fallback strategies. Let's build a pipeline that does all of this.

We'll use OpenAI's text-embedding-3-small model (dim=1536) via LangChain 0.2.x, ChromaDB 0.4.x as the vector store, and a simple retry wrapper. The key pattern is to separate ingestion from querying: ingestion runs as a batch job that writes to a staging collection, then swaps the production collection atomically. This avoids serving stale or partial data.

production_vector_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import os
import time
import numpy as np
from typing import List, Dict
from openai import OpenAI
import chromadb
from chromadb.config import Settings

# Initialize clients
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
chroma_client = chromadb.PersistentClient(
    path="/data/chroma",
    settings=Settings(anonymized_telemetry=False)  # disable telemetry in prod
)

def embed_texts(texts: List[str], model: str = "text-embedding-3-small") -> np.ndarray:
    """Get embeddings and normalize them."""
    response = openai_client.embeddings.create(model=model, input=texts)
    embeddings = np.array([d.embedding for d in response.data], dtype=np.float32)
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms  # critical: normalize

def create_production_collection(name: str, index_params: Dict = None):
    """Create a collection with production-ready index settings."""
    if index_params is None:
        index_params = {"hnsw:space": "cosine", "hnsw:construction_ef": 200, "hnsw:M": 16}
    return chroma_client.create_collection(
        name=name,
        metadata=index_params
    )

def batch_ingest(collection, texts: List[str], ids: List[str], metadatas: List[Dict], batch_size: int = 100):
    """Ingest texts in batches with retries."""
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        batch_ids = ids[i:i+batch_size]
        batch_metadatas = metadatas[i:i+batch_size]
        
        # Embed with retry
        for attempt in range(3):
            try:
                embeddings = embed_texts(batch_texts)
                break
            except Exception as e:
                print(f"Embedding attempt {attempt+1} failed: {e}")
                time.sleep(2 ** attempt)
        else:
            raise RuntimeError(f"Failed to embed batch after 3 attempts")
        
        # Insert into ChromaDB
        collection.add(
            embeddings=embeddings.tolist(),
            ids=batch_ids,
            metadatas=batch_metadatas
        )
        print(f"Ingested batch {i//batch_size + 1}: {len(batch_ids)} vectors")

# Usage example
if __name__ == "__main__":
    texts = ["Product A description", "Product B description"]
    ids = ["prod_a", "prod_b"]
    metadatas = [{"category": "electronics"}, {"category": "books"}]
    
    col = create_production_collection("products_v2")
    batch_ingest(col, texts, ids, metadatas)
    print("Ingestion complete. Run `collection.count()` to verify.")
Staging collection pattern
Always ingest into a staging collection (e.g., 'products_v2_staging'), validate with a test query, then atomically rename or swap the production collection. ChromaDB doesn't support rename, so use a symlink on the persistent path.
Production Insight
A recommendation engine serving 2M req/day used a single collection that was updated in-place every 6 hours. During an ingestion job, the index was partially rebuilt, causing 30 seconds of 500 errors. The fix: use a staging collection and swap via a symlink — zero downtime.
Key Takeaway
Separate ingestion from serving. Use staging collections, batch with retries, and always normalize embeddings. Monitor ingestion latency and error rates.

When NOT to Use a Vector Database

Vector databases are not a silver bullet. They're terrible for exact matches, range queries, and aggregations. If you need to find 'all products with price < $50', a vector database is the wrong tool — use PostgreSQL with a B-tree index. If you need to count 'how many products are in category X', use a columnar store. Vector databases are optimized for approximate nearest neighbor search, not for SQL-like queries.

Another anti-pattern: using a vector database as a primary data store. They don't support ACID transactions, joins, or complex filters efficiently. We've seen teams try to store all product metadata in ChromaDB metadata fields, then query with complex $and filters. The result: 10-second queries because metadata filtering is O(n) without an index. Keep metadata minimal — just enough for filtering — and store the rest in a relational database.

Finally, don't use a vector database for small datasets (<10k vectors). A brute-force search over 10k 1536-dim vectors takes ~2ms on CPU. The overhead of building an HNSW index (45 minutes for 10M vectors) is not justified. Use scikit-learn's NearestNeighbors with brute-force for small datasets.

when_not_to_use_vdb.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
from sklearn.neighbors import NearestNeighbors

# Small dataset: 5k vectors, 1536 dims
np.random.seed(42)
X = np.random.randn(5000, 1536).astype(np.float32)
norms = np.linalg.norm(X, axis=1, keepdims=True)
X = X / norms  # normalize

# Brute-force nearest neighbors
nn = NearestNeighbors(n_neighbors=5, metric='cosine')
nn.fit(X)

# Query
query = X[0].reshape(1, -1)
distances, indices = nn.kneighbors(query)
print("Nearest neighbors (indices):", indices[0])
print("Distances:", distances[0])
# This takes ~2ms for 5k vectors. No need for a vector DB.

# For comparison, ChromaDB with HNSW on the same data takes ~1ms,
# but you pay for infrastructure and operational complexity.
# Rule of thumb: <10k vectors -> sklearn, >100k vectors -> vector DB
Metadata filtering is not free
ChromaDB applies metadata filters after the ANN search. If your filter is selective (e.g., 'category=electronics' matches 1% of data), you may get zero results even though matching vectors exist. Always test queries with and without filters.
Production Insight
A team built a product search on ChromaDB with 50k products. They stored all product metadata (price, brand, category, rating) in the metadata field and used complex filters. Queries took 8-12 seconds. The fix: move metadata to PostgreSQL, use vector DB only for semantic search, and join results in the application layer.
Key Takeaway
Vector databases are for approximate nearest neighbor search, not general-purpose querying. Use them for semantic search, recommendations, and similarity matching. For everything else, use a relational database.

Production Patterns & Scale: Indexing, Sharding, and Caching

At scale, the vector database becomes the bottleneck. Here are patterns we've used in production for collections with 10M-100M vectors.

1. Index rebuild strategy: HNSW index build is O(n log n) and memory-bound. For 10M vectors, it takes ~45 minutes and ~8GB RAM. Never rebuild on the same node that serves queries. Use a separate indexing pipeline: write to a staging collection, rebuild the index, then swap. ChromaDB's create_index() is synchronous — it blocks until the index is built. If you call it on the serving node, your queries will fall back to brute-force during the build.

2. Sharding: ChromaDB doesn't support native sharding. For >10M vectors, you need to shard manually by some key (e.g., tenant ID, region). Each shard is a separate collection. Query all shards in parallel and merge results. We wrote a simple shard router that fans out queries to 4 shards and merges the top-k results.

3. Caching: Vector search queries are often repetitive (e.g., top-10 recommendations for a user). Cache the results with a TTL of 5-15 minutes. Use Redis or a local LRU cache. This reduced our query load by 60%.

sharded_vector_search.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import chromadb
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict

class ShardedVectorSearch:
    def __init__(self, shard_names: List[str], base_path: str = "/data/chroma"):
        self.shard_names = shard_names
        self.clients = {}
        for name in shard_names:
            client = chromadb.PersistentClient(path=f"{base_path}/{name}")
            self.clients[name] = client.get_or_create_collection(name)
    
    def query(self, query_embedding: List[float], n_results: int = 10, n_per_shard: int = 20) -> List[Dict]:
        """Query all shards in parallel and merge results."""
        all_results = []
        with ThreadPoolExecutor(max_workers=len(self.shard_names)) as executor:
            futures = {
                executor.submit(
                    col.query, query_embeddings=[query_embedding], n_results=n_per_shard
                ): name
                for name, col in self.clients.items()
            }
            for future in as_completed(futures):
                results = future.result()
                for i in range(len(results['ids'][0])):
                    all_results.append({
                        'id': results['ids'][0][i],
                        'distance': results['distances'][0][i],
                        'metadata': results['metadatas'][0][i] if results['metadatas'] else {},
                        'shard': futures[future]
                    })
        # Sort by distance and return top n_results
        all_results.sort(key=lambda x: x['distance'])
        return all_results[:n_results]

# Usage
searcher = ShardedVectorSearch(shard_names=["products_shard_0", "products_shard_1", "products_shard_2"])
results = searcher.query(query_embedding=[0.1]*1536, n_results=10)
print("Top 10 results across shards:", results)
Parallel query with ThreadPoolExecutor
Use ThreadPoolExecutor to query all shards concurrently. ChromaDB's query is I/O bound (reading from disk), so threads work well. Set n_per_shard to 2x the desired n_results to account for uneven distribution.
Production Insight
A social media platform with 50M user embeddings used a single ChromaDB collection. Queries took 800ms p99. They sharded by user ID into 10 collections. After sharding, p99 dropped to 45ms. The trade-off: they had to manage 10 collections and a fan-out query pattern.
Key Takeaway
For >10M vectors, shard by a natural key and query in parallel. Cache frequent queries. Never rebuild the index on the serving node.

Common Mistakes with Specific Examples

We've seen the same mistakes in production across multiple teams. Here are the top five, with exact examples.

Mistake 1: Not normalizing embeddings. This is the #1 cause of recall drops. Example: a batch job that re-embeds 500k products uses a different model version that doesn't normalize. Cosine similarity gives wrong rankings. Fix: always normalize after embedding.

Mistake 2: Using the wrong distance metric. Cosine similarity assumes normalized vectors. If you use L2 distance on unnormalized vectors, the magnitude dominates. Example: vectors with norm 12.7 will be far from vectors with norm 0.3 even if they point in the same direction. Fix: normalize and use cosine, or use L2 on normalized vectors (which is equivalent).

Mistake 3: Ignoring index build time. A team scheduled an index rebuild during a deployment. The rebuild took 45 minutes, during which queries fell back to brute-force and latency spiked to 800ms. Fix: schedule rebuilds during low traffic, and use a staging collection.

Mistake 4: Over-filtering metadata. A query with where={'category': 'electronics', 'price': {'$lt': 10}} returned zero results because the ANN search found 100 candidates, but only 2 matched the filter. ChromaDB returned those 2, but the team expected 10. Fix: increase n_results to account for filter selectivity, or use a two-stage approach: first ANN search, then filter in application.

Mistake 5: Not pinning the embedding model version. OpenAI's text-embedding-3-small had a minor update that changed the output distribution. The team didn't pin the version, and embeddings drifted. Recall dropped 10%. Fix: always specify model='text-embedding-3-small' and pin the version in your requirements.

common_mistakes_fixes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np

# Mistake 1: Not normalizing
raw_vec = np.random.randn(1536).astype(np.float32)
# Wrong: using raw_vec directly
# Correct:
normalized = raw_vec / np.linalg.norm(raw_vec)

# Mistake 2: Wrong distance metric
# If you use L2 on unnormalized vectors, magnitude dominates.
# Example: two vectors pointing in same direction but different magnitudes
v1 = np.array([1.0, 0.0])
v2 = np.array([10.0, 0.0])
l2_dist = np.linalg.norm(v1 - v2)  # 9.0 — far apart even though same direction
cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))  # 1.0 — identical direction
print(f"L2: {l2_dist:.2f}, Cosine: {cos_sim:.2f}")

# Mistake 3: Index rebuild blocking
# Never do this on serving node:
# collection.create_index()  # blocks for 45 minutes
# Instead, rebuild on staging and swap.

# Mistake 4: Over-filtering
# Increase n_results to compensate for filter selectivity
results = collection.query(
    query_embeddings=[query_vec],
    n_results=100,  # not 10
    where={'category': 'electronics'}
)
# Then filter in application
filtered = [r for r in results['ids'][0] if r['price'] < 10]

# Mistake 5: Model version drift
# Pin the model in your code:
# openai_client.embeddings.create(model="text-embedding-3-small", ...)
# And log the model version in collection metadata:
# collection.metadata['embedding_model'] = 'text-embedding-3-small-2025-01'
Pinning model versions is not optional
OpenAI's embedding models can change behavior without a major version bump. We learned this when a minor update caused a 10% recall drop. Pin the exact model name and log it in your collection metadata.
Production Insight
A team at a major e-commerce company used ChromaDB for product search. They didn't normalize embeddings, used L2 distance, and didn't pin the model version. When they upgraded the embedding model, recall dropped 23% and they spent 3 days debugging. The fix was all three corrections.
Key Takeaway
Normalize, pin model versions, use the right distance metric, account for filter selectivity, and never rebuild index on serving nodes. Test these assumptions in a staging environment.

Vector Databases vs Alternatives: When to Choose What

You have options: vector databases (ChromaDB, Pinecone, Weaviate), approximate nearest neighbor libraries (FAISS, Annoy, ScaNN), and relational databases with vector extensions (pgvector). Here's our production experience with each.

ChromaDB (0.4.x): Best for small-to-medium deployments (<10M vectors) where you want a simple API and don't need high availability. It's single-node, no replication, no sharding. We use it for prototyping and internal tools. Not suitable for production with >10M vectors or SLA requirements.

FAISS (Facebook AI Similarity Search): The fastest ANN library. It supports GPU indexing. We use FAISS for batch similarity jobs (e.g., deduplication of 100M products). But it's a library, not a database — no persistence, no filtering, no CRUD. You need to build your own persistence layer.

pgvector (PostgreSQL extension): Best for teams that already use PostgreSQL. It adds ANN search via IVFFlat or HNSW indices. The trade-off: slower than dedicated vector DBs (5-10x), but you get ACID transactions, joins, and all of PostgreSQL's features. We use pgvector when the vector search is a secondary feature, not the primary use case.

Pinecone: Fully managed, scales to billions of vectors. Expensive ($0.10/GB/hour). We use it when we need high availability and don't want to manage infrastructure. The lock-in is real — migrating out is painful.

Our rule of thumb: <10M vectors and simple use case -> ChromaDB. >10M vectors and need PostgreSQL features -> pgvector. >100M vectors and need maximum performance -> FAISS with custom persistence. Need fully managed -> Pinecone.

faiss_vs_chromadb.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
import faiss
import time

# Generate 1M 1536-dim vectors
np.random.seed(42)
d = 1536
nb = 1_000_000
x = np.random.randn(nb, d).astype(np.float32)
norms = np.linalg.norm(x, axis=1, keepdims=True)
x = x / norms

# Build FAISS index (HNSW)
index = faiss.IndexHNSWFlat(d, 32)  # M=32
index.hnsw.efConstruction = 200
print("Building FAISS index...")
t0 = time.time()
index.add(x)
print(f"FAISS index built in {time.time() - t0:.2f}s")

# Query
query = x[0].reshape(1, -1)
index.hnsw.efSearch = 100
t0 = time.time()
distances, indices = index.search(query, 10)
print(f"FAISS query time: {(time.time() - t0)*1000:.2f}ms")
print("Distances:", distances[0])

# Compare with ChromaDB (conceptual, not run here)
# ChromaDB with same data would take ~45 min to build index and ~5ms per query.
# FAISS builds in ~2 min and queries in ~1ms.
# But FAISS has no persistence, no filtering, no CRUD.
# Choose based on your needs.
FAISS is not a database
FAISS is a library for vector similarity search. It doesn't persist to disk, doesn't support CRUD operations, and doesn't have metadata filtering. You need to build your own persistence and versioning layer.
Production Insight
A team used ChromaDB for a 50M-vector product catalog. Queries took 800ms and the index rebuild took 6 hours. They migrated to FAISS with a custom persistence layer (S3 + Redis cache). Query time dropped to 2ms, but they spent 2 months building the infrastructure. The trade-off was worth it for their scale.
Key Takeaway
Choose your vector search technology based on scale, feature requirements, and operational complexity. ChromaDB for simplicity, FAISS for performance, pgvector for SQL integration, Pinecone for managed service.

Debugging and Monitoring Vector Databases in Production

Monitoring a vector database is different from monitoring a relational database. The key metrics are: query latency (p50, p99), recall (measured against brute-force), distance distribution, and index build time. We use a custom monitoring script that runs every 5 minutes and logs these metrics to Datadog.

Recall monitoring: Periodically run a set of known queries against both the ANN index and a brute-force search. Compare the top-10 results. If recall drops below 95%, alert. This catches embedding drift, normalization issues, and index corruption.

Distance distribution: Log the distances returned by queries. If you see distances > 1.5 (for cosine), something is wrong — likely unnormalized vectors. We alert on 'max_distance > 1.5'.

Index build time: Monitor how long create_index() takes. If it's increasing over time, your data volume is growing faster than expected, or the index parameters need tuning.

monitor_vector_search.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
import chromadb
import time
from sklearn.neighbors import NearestNeighbors

def monitor_recall(collection, query_vectors: np.ndarray, k: int = 10):
    """Compare ANN results with brute-force to compute recall."""
    # Get all vectors from collection (assumes small set for monitoring)
    all_data = collection.get(include=['embeddings'])
    all_embeddings = np.array(all_data['embeddings'])
    all_ids = all_data['ids']
    
    # Brute-force search
    nn = NearestNeighbors(n_neighbors=k, metric='cosine')
    nn.fit(all_embeddings)
    
    recall_sum = 0
    for query in query_vectors:
        # ANN search
        ann_results = collection.query(
            query_embeddings=[query.tolist()],
            n_results=k
        )
        ann_ids = set(ann_results['ids'][0])
        
        # Brute-force search
        distances, indices = nn.kneighbors(query.reshape(1, -1))
        bf_ids = set([all_ids[i] for i in indices[0]])
        
        # Compute recall
        intersection = ann_ids.intersection(bf_ids)
        recall = len(intersection) / k
        recall_sum += recall
    
    avg_recall = recall_sum / len(query_vectors)
    print(f"Average recall@{k}: {avg_recall:.3f}")
    if avg_recall < 0.95:
        print("ALERT: Recall below 0.95! Check embedding normalization and model version.")
    return avg_recall

# Usage
client = chromadb.PersistentClient(path="/data/chroma")
collection = client.get_collection("products")
# Generate 10 random query vectors for monitoring
query_vecs = np.random.randn(10, 1536).astype(np.float32)
query_vecs = query_vecs / np.linalg.norm(query_vecs, axis=1, keepdims=True)
monitor_recall(collection, query_vecs)
Brute-force recall check is expensive
Running a brute-force search over all vectors is O(n). For large collections (>1M), do this on a separate node or only on a sample of the data. We run it every 5 minutes on a 10k-vector sample.
Production Insight
A team had a recall drop from 98% to 60% over 3 weeks. They didn't notice because they only monitored latency, not recall. By the time they caught it, 15% of users had churned. They now run recall monitoring every 5 minutes and alert on <95%.
Key Takeaway
Monitor recall, not just latency. Log distance distributions. Track index build times. These metrics catch the silent failures that latency monitoring misses.
● Production incidentPOST-MORTEMseverity: high

The 23% Recall Drop That Cost $40k

Symptom
On-call engineer saw a 23% drop in click-through rate on recommendations at 3:14 AM. Grafana showed p99 latency for vector search jumped from 5ms to 800ms. No deployment had occurred in 48 hours.
Assumption
The team assumed that cosine similarity handled any vector magnitude automatically, and that the embedding model (text-embedding-3-small) always returned unit-normalized vectors.
Root cause
A batch job that re-embedded 500k product descriptions used an older version of the embedding model that did not normalize output. The vectors had magnitudes ranging from 0.3 to 12.7. Cosine similarity without normalization gave wrong rankings. The high-latency spike was caused by the index rebuild that happened after the batch insert — the HNSW graph construction hit memory limits and fell back to brute-force search.
Fix
1. Immediately disable the recommendation service by setting the feature flag 'use_vector_search' to False. Fall back to a simple SQL-based popularity sort. 2. Delete the corrupted embeddings from the collection: collection.delete(where={'batch_id': '2026-05-21'}) 3. Re-run the batch job with normalized embeddings by adding embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True) after calling client.embeddings.create(). 4. Rebuild the HNSW index with collection.create_index(index_type='hnsw', ef_construction=200, M=16) — note that this took 45 minutes on our 8M-vector collection. 5. Gradually re-enable the service, monitoring recall and latency for 30 minutes before full rollout.
Key lesson
  • Always validate embedding normalization before inserting into a cosine-similarity index. Add a unit test that checks np.allclose(np.linalg.norm(embeddings, axis=1), 1.0).
  • Pin your embedding model version in production. A minor version bump can change output behavior silently.
  • Never rebuild an index on the same node that serves queries. Use a separate indexing pipeline with a blue-green deployment pattern.
Production debug guideWhen recall drops and latency spikes at 2am.4 entries
Symptom · 01
Recall drops >10% but latency is normal
Fix
Check if your query vectors are normalized. Run: import numpy as np; norms = np.linalg.norm(query_vectors, axis=1); print(norms.min(), norms.max()). If any norm is not close to 1.0, you have a normalization mismatch.
Symptom · 02
p99 latency jumps from 5ms to 800ms
Fix
Check if the index was rebuilt recently. Run collection.metadata and look for index_build_time. If it's missing or recent, the index might have fallen back to brute-force. Verify with collection.count() — if count > 1M and latency is high, the index is likely not built.
Symptom · 03
Zero results returned even though matching vectors exist
Fix
Check metadata filter selectivity. Run the same query without filters. If you get results, the filter is too restrictive. Also check if metadata fields are indexed — without an index, filtering is O(n) and can silently drop results.
Symptom · 04
Results are from the wrong semantic cluster
Fix
Check the distance metric. If you switched from cosine to L2, the rankings will differ significantly. Run collection.metadata and verify distance_metric. Also check if your query vector is from the same embedding model as the indexed vectors — a model mismatch produces garbage.
★ Vector Databases Explained Triage Cheat SheetCopy-paste diagnostics. When it's 2am and you need answers fast.
Recall drop
Immediate action
Check embedding normalization
Commands
python -c "import numpy as np; norms = np.linalg.norm(np.array(embeddings), axis=1); print('Min norm:', norms.min(), 'Max norm:', norms.max())"
python -c "import chromadb; client = chromadb.PersistentClient(path='/data/chroma'); col = client.get_collection('products'); print(col.metadata)"
Fix now
Normalize vectors: embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True) and re-insert.
High latency+
Immediate action
Check if index exists and is being used
Commands
python -c "import chromadb; client = chromadb.PersistentClient(path='/data/chroma'); col = client.get_collection('products'); print('Count:', col.count()); print('Metadata:', col.metadata)"
python -c "import time; t0=time.time(); col.query(query_embeddings=[vec], n_results=10); print('Query time:', time.time()-t0)"
Fix now
Rebuild index: collection.create_index(index_type='hnsw', ef_construction=200, M=16) on a separate node.
Empty results+
Immediate action
Test query without filters
Commands
python -c "results = collection.query(query_embeddings=[vec], n_results=10); print('Results without filter:', len(results['ids'][0]))"
python -c "results = collection.query(query_embeddings=[vec], n_results=10, where={'category': 'electronics'}); print('Results with filter:', len(results['ids'][0]))"
Fix now
Add a metadata index: collection.create_index(index_type='flat', field='category') or reduce filter selectivity.
Wrong semantic results+
Immediate action
Verify embedding model consistency
Commands
python -c "import hashlib; print('Query model hash:', hashlib.md5(str(query_embeddings).encode()).hexdigest()[:8])"
python -c "print('Collection model:', collection.metadata.get('embedding_model'))"
Fix now
Re-embed all vectors with the same model version. Pin model version in your pipeline.
Vector Database vs. Alternative Search Methods
ConcernVector Database (ANN)Traditional Search (e.g., Elasticsearch)Exact Nearest Neighbor (Flat Index)
Latency at 1M vectors1-10ms (ANN)10-100ms (keyword + embedding hybrid)100ms-1s (O(n) scan)
Recall guarantee95-99% (tunable)100% (exact match)100%
Memory footprint2-10x vector size (HNSW)1-2x (inverted index)1x (raw vectors)
Update throughput100-1000 ops/s (batch)10k+ ops/s (streaming)1k ops/s
Best use caseSemantic search, recommendationsFull-text + metadata filteringSmall datasets (<100k)
Cost per query$0.0001-0.001$0.001-0.01$0.01-0.1

Key takeaways

1
Always monitor recall at query time with a holdout set—a 23% drop can happen silently due to index corruption or parameter drift.
2
Use HNSW for high-recall low-latency, IVF for memory-constrained workloads; never default to flat search in production.
3
Shard by vector cluster (e.g., k-means centroids) not by ID range—cross-shard queries kill latency.
4
Cache frequent query embeddings and their top-k results at the application layer, not in the vector DB's internal cache.
5
Set ef_search and ef_construction explicitly per workload—defaults are optimized for benchmarks, not your data distribution.

Common mistakes to avoid

4 patterns
×

Using default HNSW parameters

Symptom
Recall drops 15-30% after data insertion because ef_construction is too low for your data cardinality.
Fix
Set ef_construction = 2 * ef_search; for 1M+ vectors, start with ef_construction=400 and ef_search=200.
×

Not normalizing embeddings before cosine similarity

Symptom
Top-10 results contain irrelevant vectors because dot product and cosine similarity diverge on unnormalized data.
Fix
L2-normalize all embeddings before insertion and query; then use dot product (which equals cosine similarity on normalized vectors).
×

Ignoring index rebuild after bulk deletes

Symptom
Recall degrades over time as deleted vectors leave gaps in the graph structure, causing ANN to skip valid neighbors.
Fix
Schedule periodic index rebuilds (e.g., every 100k deletes) or use a tombstone-aware index like DiskANN.
×

Sharding by primary key range

Symptom
Queries hit all shards because similar vectors are scattered across partitions, increasing latency 4-10x.
Fix
Shard by vector cluster (e.g., k-means centroids) so each shard contains semantically similar vectors; route queries to top-2 closest centroids.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how HNSW works under the hood.
Q02SENIOR
What's the difference between cosine similarity and dot product, and whe...
Q03SENIOR
How would you design a vector search system that handles 10M updates per...
Q04SENIOR
What happens to recall when you delete 50% of vectors from an HNSW index...
Q05SENIOR
How do you measure and guarantee recall in a vector database?
Q01 of 05SENIOR

Explain how HNSW works under the hood.

ANSWER
HNSW (Hierarchical Navigable Small World) builds a multi-layer graph. The bottom layer contains all vectors; higher layers are sparser subsets. During search, it starts at the top layer, greedily traverses to the nearest neighbor, then descends to the next layer, repeating until the bottom layer. The ef_search parameter controls the beam width—higher values increase recall but cost more distance computations. Insertion uses ef_construction to determine neighbor candidates, then prunes to M neighbors per node. The hierarchical structure gives O(log n) average search time.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What causes recall to drop in a vector database?
02
How do I monitor vector search recall in production?
03
When should I use IVF instead of HNSW?
04
Can I use a vector database for exact nearest neighbor search?
05
How do I handle vector database caching?
🔥

That's RAG. Mark it forged?

7 min read · try the examples if you haven't

Previous
Structured Outputs with LLMs
1 / 5 · RAG
Next
RAG Pipeline Explained