MongoDB $in Query COLLSCAN — Non-Selective Index Gotcha
A $in query on status caused 30-second response times due to COLLSCAN—single-field index was non-selective.
- MongoDB indexes are B-tree data structures that map indexed field values to document locations
- A single-field index sorts values and enables fast lookup, range queries, and sort operations
- Compound indexes support queries on multiple fields; the order of fields must follow the ESR rule (Equality, Sort, Range)
- explain() reveals whether a query uses an index (IXSCAN) or scans the entire collection (COLLSCAN)
- Performance insight: index lookup is O(log n) vs COLLSCAN's O(n) — on 10M documents, that's ~25 reads vs 10M reads
- Production insight: wrong index order can make queries slower than no index because MongoDB may still read many documents to filter unindexed fields
- Covered queries skip document fetch — fast reads, great for high-frequency paths
Imagine a 1,000-page cookbook with no table of contents. Every time you want a pasta recipe, you flip every single page until you find one. An index is that table of contents — it tells MongoDB exactly which page (document) to turn to without reading every other page first. The difference between a query taking 50ms and 50 seconds is almost always whether the right index exists.
MongoDB can hold hundreds of millions of documents and still answer a query in single-digit milliseconds — but only if you've built the right indexes. Without them, every query triggers a COLLSCAN (collection scan), meaning MongoDB reads every document in the collection sequentially. At scale, that's the difference between a snappy user experience and a timed-out API call that wakes your on-call engineer at 3 AM.
The problem indexes solve is deceptively simple: avoid reading data you don't need. But the implementation details are where engineers get into trouble. The wrong index order in a compound index, a missing index on a sort field, or a wildcard index used carelessly can make performance worse than no index at all — because now MongoDB has to maintain extra data structures on every write.
By the end of this article you'll understand exactly how MongoDB's B-tree indexes are structured on disk, how to construct compound indexes using the ESR rule, how to read an explain() plan like a senior engineer, and the production gotchas that trip up teams who learned indexing from a five-minute tutorial. The silent killer: a missing index on a sort field can turn a 10ms query into a 10-second blocking sort. Don't let that be you.
What is MongoDB Indexing?
MongoDB Indexing is a core concept in Database. Rather than starting with a dry definition, let's see it in action and understand why it exists.
Indexes in MongoDB are special data structures that store a small subset of the collection's data in an easy-to-traverse form. They point to the location of documents so the query engine can find data without scanning every document. Think of them as a book's index — you don't read every page to find a topic, you flip to the index and then jump directly to the relevant pages.
In production, forgetting an index on a frequently queried field is the #1 cause of performance degradation. The worst part? It's silent until your collection hits a few million documents and your API latency spikes from 50ms to 30s. Indexing isn't optional — it's the difference between your database feeling like a spreadsheet and feeling like a proper DBMS.
Here's the thing: a single COLLSCAN on a 10M document collection can take 10 seconds even on fast hardware. That same query with an index runs in under 10ms. But indexes cost you on writes — each insert or update must update every index. Over-indexing is a real problem, especially in high-throughput systems. We'll cover the trade-offs later.
How B-Tree Indexes Work Internally
MongoDB indexes use a B-tree data structure — the same family as MySQL and PostgreSQL indexes. The B-tree organises keys in sorted order and uses internal nodes to speed up search, insert, and delete. A typical index entry stores the indexed field value(s) and a pointer to the document's location (BSON object ID or disk location).
The branching factor (number of children per node) is typically around 1000 for modern MongoDB versions. This keeps tree depth low: a 10-million document index is only 3-4 levels deep. Searching a B-tree is O(log n) — for 10M documents, that's about 25 node accesses compared to a full collection scan of 10M documents.
MongoDB stores indexes in wiredTiger's B-tree engine by default. Each index is a separate B-tree that must be maintained on every write. That's why over-indexing hurts write performance.
Here's the thing many engineers miss: the B-tree is also used for sorting. If your sort field matches the index order, MongoDB can traverse the index leaf nodes in order and avoid an in-memory sort. That's where the ESR rule becomes critical.
Additionally, B-tree indexes are stored as multikey structures when indexing arrays. Each array element becomes a separate index entry, which can cause index explosion. If you have an array with 1000 elements, each document will create 1000 index entries for that one index.
Another subtlety: B-tree page splits happen when a node overflows. On high-write systems with monotonic keys, constant right-side splits can cause fragmentation and wasted space. That's one reason hashed indexes help on shard keys.
- The root node is the directory: it tells you which drawer (child node) to open.
- Each internal node narrows the search range by comparing the query value with the keys stored in the node.
- Leaf nodes contain the actual index entries (key + doc pointer) and are linked together for efficient range scans.
- The tree is self-balancing — every leaf is at the same depth, ensuring consistent performance.
Compound Indexes and the ESR Rule
A compound index is an index on multiple fields. The order of fields in a compound index is critical because MongoDB can only use the index to support queries that filter prefixes of the indexed fields. For example, an index on {a:1, b:1} can support queries on {a} and {a,b} but not on {b} alone.
The ESR rule — Equality, Sort, Range — is the mental model senior engineers use to design compound indexes: - Fields that are compared by equality (=) must come first. - Next comes the field used for sorting (sort). - Last come fields used for range queries ($gt, $lt, $in, etc.).
ESR ensures that MongoDB can use the index to satisfy both the filter and the sort in one pass, avoiding an in-memory sort. Violating ESR usually forces a blocking sort stage, which can consume huge amounts of memory and slow queries.
In practice, I've seen production outages caused by a compound index with the sort field at the beginning. The query would use the index for the equality match, but then MongoDB had to sort 100k documents in memory, hitting the 32MB limit and throwing an error. The fix? Move the range field last and add the sort field as the second key.
But ESR isn't always perfect. When the sort field differs from the range field, you have a dilemma: you can satisfy either the sort with an index scan or the range with an index scan, but not both in one pass. MongoDB will choose an index based on selectivity. In such cases, consider two indexes: one for the filter and one for the sort, or use a covered query if possible.
A common mistake: the sort direction must match the index direction. If the index is {createdAt: 1} but the query sorts by {createdAt: -1}, MongoDB can still use the index but will traverse it backwards. That's fine for single-field indexes, but for compound indexes the direction of each key matters. MongoDB can reverse the entire index order, but not individual fields. So {a: 1, b: -1} can support sort {a: -1, b: 1} but not {a: 1, b: 1}.
Reading Explain Plans Like a Senior Engineer
When a query is slow, the command is your first diagnostic tool. Run explain()db.collection.find(...).explain('executionStats'). The output has three key sections: - queryPlanner: the plan MongoDB chose, including the winning index. - executionStats: actual execution metrics — nReturned, totalDocsExamined, totalKeysExamined, executionTimeMillis. - rejectedPlans: why other indexes were not chosen.
The key ratio is totalKeysExamined / nReturned. If it's close to 1, the index is highly selective. Ratios above 10 indicate the query is scanning many index entries for each returned document — consider adding more fields to the index.
Also check for stage values: IXSCAN (index scan), COLLSCAN (collection scan), SORT (in-memory sort), FETCH (retrieving documents after index lookup). A SORT stage combined with IXSCAN means the index is not covering the sort.
Senior engineers don't just look at executionTimeMillis. They look at the ratio of keys examined to documents returned, and they look at the number of documents fetched. If totalDocsExamined equals nReturned but totalKeysExamined is 10x higher, you have a selectivity problem, not a missing index.
Another hidden metric: totalDocsExamined. If that number is high, it means MongoDB is fetching many documents from disk even after the index. That happens when the index doesn't cover the projected fields. Consider a covered query.
And remember: the rejectedPlans array is gold. It shows why the optimizer rejected other index candidates. If you see a plan that looks better but wasn't chosen, you may need to hint or rewrite the query.
db.collection.explain('executionStats').find(...) to get execution stats without running the query twice. But be careful — in production, running executionStats on a heavy query may consume resources. For read-heavy workloads, consider using db.collection.explain('queryPlanner') first.Common Index Types and When to Use Them
MongoDB offers several index types beyond the basic single-field and compound indexes: - Multikey Index: automatically created when indexing an array field. Each array element gets an index entry. Useful for querying tags, categories. - Text Index: for full-text search on string fields. Supports stemming and stop words. Only one text index per collection. - Hashed Index: uses a hash of the field value to provide uniform distribution. Good for equality queries only, not range. - Wildcard Index: indexes all fields in a document (or a subset) without specifying field names. Useful for schemaless data but large and write-heavy. - 2dsphere Index: for geospatial queries on GeoJSON objects. Supports $near, $geoWithin, $geoIntersects.
Each type has trade-offs. For example, wildcard indexes consume significant memory and CPU on writes. Multikey indexes can cause index explosion if array elements are many.
I've seen teams create a wildcard index on a collection they used for logging — and then wonder why write throughput dropped by 80%. Wildcard indexes are not a shortcut; they're a last resort when query patterns are truly unpredictable. For most production workloads, explicit indexes on known query fields are always better.
Text indexes are particularly tricky: they support stemming and stop words in many languages, but they're not as fast as Elasticsearch. If you need full-text search at scale, consider using a dedicated search engine instead.
Another subtlety: TTL indexes automatically expire documents after a specified time. They're built on a single date field and run a background thread every 60 seconds. Great for session data, logs, or temporary records. But don't rely on them for precise expiry — the deletion can be delayed up to 60 seconds.
Production Gotchas: Over-Indexing, Covered Queries, and Sparse Indexes
Even well-designed indexes can cause problems at scale. Three common gotchas:
Over-indexing: More indexes means slower writes because each index B-tree must be updated. MongoDB uses an internal write lock per collection, but index maintenance adds CPU and memory pressure. A typical rule of thumb: no more than 5 indexes per collection for high-write workloads.
Covered queries: A query is 'covered' when all the fields it needs are in the index and MongoDB doesn't need to fetch documents from the collection. Covered queries are blazing fast because they only touch the index B-tree. To achieve coverage, include projected fields in the index (but be aware of the 1024-byte limit on index keys — MongoDB indexes can only store values up to 1024 bytes; larger values are skipped).
Sparse indexes: A sparse index only includes documents where the indexed field exists. This can save space but can cause queries to miss documents that don't have the field. For example, if you have a sparse index on email, a query for {email: null} will return no results.
I once debugged a production issue where a sparse index on an optional field caused all queries filtering on that field to miss documents that simply didn't have the field. The team assumed the index covered everything, but it only covered documents with the field set. The fix was to remove the sparse option and let the index handle nulls explicitly.
Another hidden gotcha: indexes on large fields (>1024 bytes) are silently ignored — MongoDB stores only the first 1024 bytes, making them useless for equality matching on long strings or binary data.
Also, collation can break index usage. If your index is created with a specific collation (e.g., case-insensitive), but the query doesn't specify the same collation, MongoDB may not use the index. Always match collation settings.
- If all the fields your query needs are in the index, MongoDB never touches the collection.
- Check the explain output for a 'FETCH' stage. If it's missing, the query is covered.
- Covered queries are the fastest possible read path in MongoDB.
- Trade-off: adding more fields to an index makes it larger and slower to insert.
Index Maintenance and Monitoring in Production
Indexes aren't a 'set and forget' concern. Over time, query patterns change, data grows, and old indexes become dead weight. MongoDB provides $indexStats to see index usage: how many times each index was used, and how long it took to build.
Every quarter, I run db.collection.aggregate([{$indexStats:{}}]) and look for indexes with zero access in the last 30 days. Those are candidates for removal. Dropping unused indexes is the easiest performance win — it reduces write overhead and frees up RAM.
Another common maintenance task is rebuilding indexes after major data changes. In MongoDB, you can rebuild an index with db.collection.reIndex() but be aware: this blocks all reads and writes on the collection. For production, use createIndex() with a background build option (though in recent versions this is non-blocking by default for new builds).
Also monitor index size vs available RAM. If the hot index keys don't fit in memory, MongoDB will page from disk, and your query latency will spike. Use and compare db.collection.stats()indexSizes with the WiredTiger cache size.
For large indexes, consider using createIndexes() with multiple indexes in one command to reduce round trips. Also, index builds can be resource-intensive — schedule them during low-traffic windows.
Additionally, use the MongoDB profiler to capture slow queries. db.setProfilingLevel(1, { slowms: 100 }) logs all queries over 100ms to the system.profile collection. That's your early warning system for missing or misused indexes.
createIndex() with {background: true} (default in 4.2+ for builds but still has some impact). For large collections, consider rolling index builds with replica sets. Also, index builds can cause replication lag, so monitor secondary replicas.Indexing Strategies for High-Throughput Systems
In high-throughput environments (e.g., real-time analytics, event logging, IoT), every millisecond counts. Here are proven strategies:
Partial Indexes: Use createIndex({ status: 1 }, { partialFilterExpression: { active: true } }) to index only documents matching a condition. This reduces index size and write overhead when only a subset of documents are queried.
TTL Indexes: For time-series data, use a TTL index on a timestamp field to automatically expire old documents. This keeps the index lean and avoids manual cleanup.
Rarely Used Indexes on Write-Heavy Collections: If you have a collection that receives 100k writes/second but only occasional reads, consider using an index only for the read-heavy secondary replica, not on the primary.
Sorted vs Unsorted Index Maintenance: If your insert patterns create keys in monotonic order (e.g., timestamps), the B-tree can become right-heavy, causing page splits and fragmentation. Use {key: 1} on a hashed field or shard key to distribute writes evenly.
I once worked on a system where a monotonically increasing _id field caused index fragmentation on a high-write collection. The fix was to use a hashed shard key and rebuild the index periodically.
Another tactic: compound indexes with a hashed prefix can spread writes across the B-tree while still supporting equality queries on the hash. For example, { hashedUserId: 1, timestamp: 1 }. The hash ensures each user's writes go to different parts of the tree.
{ readPreference: 'secondary' }The Missing Index That Killed Search Latency
db.orders.find({ status: { $in: ['pending', 'shipped', 'cancelled'] } }). While a single-field index on 'status' existed, the query still scanned millions of documents because the index only stored the field value — filtering by multiple values required scanning a large portion of the index tree. MongoDB decided to COLLSCAN instead because the index was not selective enough.- Always analyze query patterns before creating indexes — single-field indexes may not be sufficient for multi-value queries.
- Use
explain()with executionStats to see 'totalDocsExamined' vs 'nReturned'. If they differ significantly, the index is not covering the query well. - Monitor slow queries with the profiler; don't assume indexes will automatically be used.
- Always test with production-like data volume before release.
db.collection.getIndexes(). Check if the query uses operators like $regex or $where that prevent index usage. Also check if the query includes a sort field that is not part of the index.db.collection.aggregate([ { $indexStats: {} } ]) to see index usage. Remove unused indexes. Consider whether some indexes can be merged into compound indexes.db.collection.find({field: {$exists: false}}).count() to find documents missing the field. Sparse indexes exclude missing fields — design accordingly.Key takeaways
Common mistakes to avoid
10 patternsMemorising syntax before understanding the concept
explain() on real queries. Understand the B-tree structure and impact on execution.Skipping practice and only reading theory
explain() results.Creating indexes on every field
Using wildcard indexes as a default
Not monitoring index usage
Assuming a compound index covers all queries regardless of field order
Ignoring the 1024-byte key limit
Overusing sparse indexes without understanding null semantics
Not covering sort direction in compound index
Using $in with index but still seeing high keysExamined
Interview Questions on This Topic
How does a B-tree index work in MongoDB? Explain with respect to read and write performance.
Frequently Asked Questions
That's NoSQL. Mark it forged?
11 min read · try the examples if you haven't