MongoDB $in Query COLLSCAN — Non-Selective Index Gotcha
A $in query on status caused 30-second response times due to COLLSCAN—single-field index was non-selective.
20+ years shipping high-throughput database systems. Written from production experience, not tutorials.
- MongoDB indexes are B-tree data structures that map indexed field values to document locations
- A single-field index sorts values and enables fast lookup, range queries, and sort operations
- Compound indexes support queries on multiple fields; the order of fields must follow the ESR rule (Equality, Sort, Range)
- explain() reveals whether a query uses an index (IXSCAN) or scans the entire collection (COLLSCAN)
- Performance insight: index lookup is O(log n) vs COLLSCAN's O(n) — on 10M documents, that's ~25 reads vs 10M reads
- Production insight: wrong index order can make queries slower than no index because MongoDB may still read many documents to filter unindexed fields
- Covered queries skip document fetch — fast reads, great for high-frequency paths
Imagine a 1,000-page cookbook with no table of contents. Every time you want a pasta recipe, you flip every single page until you find one. An index is that table of contents — it tells MongoDB exactly which page (document) to turn to without reading every other page first. The difference between a query taking 50ms and 50 seconds is almost always whether the right index exists.
MongoDB can hold hundreds of millions of documents and still answer a query in single-digit milliseconds — but only if you've built the right indexes. Without them, every query triggers a COLLSCAN (collection scan), meaning MongoDB reads every document in the collection sequentially. At scale, that's the difference between a snappy user experience and a timed-out API call that wakes your on-call engineer at 3 AM.
The problem indexes solve is deceptively simple: avoid reading data you don't need. But the implementation details are where engineers get into trouble. The wrong index order in a compound index, a missing index on a sort field, or a wildcard index used carelessly can make performance worse than no index at all — because now MongoDB has to maintain extra data structures on every write.
By the end of this article you'll understand exactly how MongoDB's B-tree indexes are structured on disk, how to construct compound indexes using the ESR rule, how to read an explain() plan like a senior engineer, and the production gotchas that trip up teams who learned indexing from a five-minute tutorial. The silent killer: a missing index on a sort field can turn a 10ms query into a 10-second blocking sort. Don't let that be you.
How MongoDB Indexing Actually Works — And Why It Fails
MongoDB indexing is a B-tree data structure that maps indexed field values to disk locations, enabling the query engine to locate documents without scanning every document in the collection. Without an index, every query triggers a collection scan (COLLSCAN) — an O(n) operation that reads every document. With a proper index, lookups become O(log n) via tree traversal.
An index stores a sorted reference of field values plus a pointer to the document. When a query uses the indexed field, the query planner selects the most efficient index based on cardinality, selectivity, and sort order. A compound index can cover multiple fields, but only the leftmost prefix is used for equality matches. Range queries, sorting, and $in clauses each interact with the index differently — $in, for example, can degrade to multiple index scans if the list is large or the index is non-selective.
Use indexes when a field appears frequently in query filters, sort operations, or join conditions. In production, a single missing index can cause latency spikes from 2ms to 2s as collections grow beyond memory. Indexes also consume RAM and slow writes, so you must balance read performance against write overhead. The rule: index for your most critical read paths, not every field.
How B-Tree Indexes Work Internally
MongoDB indexes use a B-tree data structure — the same family as MySQL and PostgreSQL indexes. The B-tree organises keys in sorted order and uses internal nodes to speed up search, insert, and delete. A typical index entry stores the indexed field value(s) and a pointer to the document's location (BSON object ID or disk location).
The branching factor (number of children per node) is typically around 1000 for modern MongoDB versions. This keeps tree depth low: a 10-million document index is only 3-4 levels deep. Searching a B-tree is O(log n) — for 10M documents, that's about 25 node accesses compared to a full collection scan of 10M documents.
MongoDB stores indexes in wiredTiger's B-tree engine by default. Each index is a separate B-tree that must be maintained on every write. That's why over-indexing hurts write performance.
Here's the thing many engineers miss: the B-tree is also used for sorting. If your sort field matches the index order, MongoDB can traverse the index leaf nodes in order and avoid an in-memory sort. That's where the ESR rule becomes critical.
Additionally, B-tree indexes are stored as multikey structures when indexing arrays. Each array element becomes a separate index entry, which can cause index explosion. If you have an array with 1000 elements, each document will create 1000 index entries for that one index.
Another subtlety: B-tree page splits happen when a node overflows. On high-write systems with monotonic keys, constant right-side splits can cause fragmentation and wasted space. That's one reason hashed indexes help on shard keys.
- The root node is the directory: it tells you which drawer (child node) to open.
- Each internal node narrows the search range by comparing the query value with the keys stored in the node.
- Leaf nodes contain the actual index entries (key + doc pointer) and are linked together for efficient range scans.
- The tree is self-balancing — every leaf is at the same depth, ensuring consistent performance.
Compound Indexes and the ESR Rule
A compound index is an index on multiple fields. The order of fields in a compound index is critical because MongoDB can only use the index to support queries that filter prefixes of the indexed fields. For example, an index on {a:1, b:1} can support queries on {a} and {a,b} but not on {b} alone.
The ESR rule — Equality, Sort, Range — is the mental model senior engineers use to design compound indexes: - Fields that are compared by equality (=) must come first. - Next comes the field used for sorting (sort). - Last come fields used for range queries ($gt, $lt, $in, etc.).
ESR ensures that MongoDB can use the index to satisfy both the filter and the sort in one pass, avoiding an in-memory sort. Violating ESR usually forces a blocking sort stage, which can consume huge amounts of memory and slow queries.
In practice, I've seen production outages caused by a compound index with the sort field at the beginning. The query would use the index for the equality match, but then MongoDB had to sort 100k documents in memory, hitting the 32MB limit and throwing an error. The fix? Move the range field last and add the sort field as the second key.
But ESR isn't always perfect. When the sort field differs from the range field, you have a dilemma: you can satisfy either the sort with an index scan or the range with an index scan, but not both in one pass. MongoDB will choose an index based on selectivity. In such cases, consider two indexes: one for the filter and one for the sort, or use a covered query if possible.
A common mistake: the sort direction must match the index direction. If the index is {createdAt: 1} but the query sorts by {createdAt: -1}, MongoDB can still use the index but will traverse it backwards. That's fine for single-field indexes, but for compound indexes the direction of each key matters. MongoDB can reverse the entire index order, but not individual fields. So {a: 1, b: -1} can support sort {a: -1, b: 1} but not {a: 1, b: 1}.
Reading Explain Plans Like a Senior Engineer
When a query is slow, the command is your first diagnostic tool. Run explain()db.collection.find(...).explain('executionStats'). The output has three key sections: - queryPlanner: the plan MongoDB chose, including the winning index. - executionStats: actual execution metrics — nReturned, totalDocsExamined, totalKeysExamined, executionTimeMillis. - rejectedPlans: why other indexes were not chosen.
The key ratio is totalKeysExamined / nReturned. If it's close to 1, the index is highly selective. Ratios above 10 indicate the query is scanning many index entries for each returned document — consider adding more fields to the index.
Also check for stage values: IXSCAN (index scan), COLLSCAN (collection scan), SORT (in-memory sort), FETCH (retrieving documents after index lookup). A SORT stage combined with IXSCAN means the index is not covering the sort.
Senior engineers don't just look at executionTimeMillis. They look at the ratio of keys examined to documents returned, and they look at the number of documents fetched. If totalDocsExamined equals nReturned but totalKeysExamined is 10x higher, you have a selectivity problem, not a missing index.
Another hidden metric: totalDocsExamined. If that number is high, it means MongoDB is fetching many documents from disk even after the index. That happens when the index doesn't cover the projected fields. Consider a covered query.
And remember: the rejectedPlans array is gold. It shows why the optimizer rejected other index candidates. If you see a plan that looks better but wasn't chosen, you may need to hint or rewrite the query.
db.collection.explain('executionStats').find(...) to get execution stats without running the query twice. But be careful — in production, running executionStats on a heavy query may consume resources. For read-heavy workloads, consider using db.collection.explain('queryPlanner') first.Common Index Types and When to Use Them
MongoDB offers several index types beyond the basic single-field and compound indexes: - Multikey Index: automatically created when indexing an array field. Each array element gets an index entry. Useful for querying tags, categories. - Text Index: for full-text search on string fields. Supports stemming and stop words. Only one text index per collection. - Hashed Index: uses a hash of the field value to provide uniform distribution. Good for equality queries only, not range. - Wildcard Index: indexes all fields in a document (or a subset) without specifying field names. Useful for schemaless data but large and write-heavy. - 2dsphere Index: for geospatial queries on GeoJSON objects. Supports $near, $geoWithin, $geoIntersects.
Each type has trade-offs. For example, wildcard indexes consume significant memory and CPU on writes. Multikey indexes can cause index explosion if array elements are many.
I've seen teams create a wildcard index on a collection they used for logging — and then wonder why write throughput dropped by 80%. Wildcard indexes are not a shortcut; they're a last resort when query patterns are truly unpredictable. For most production workloads, explicit indexes on known query fields are always better.
Text indexes are particularly tricky: they support stemming and stop words in many languages, but they're not as fast as Elasticsearch. If you need full-text search at scale, consider using a dedicated search engine instead.
Another subtlety: TTL indexes automatically expire documents after a specified time. They're built on a single date field and run a background thread every 60 seconds. Great for session data, logs, or temporary records. But don't rely on them for precise expiry — the deletion can be delayed up to 60 seconds.
Production Gotchas: Over-Indexing, Covered Queries, and Sparse Indexes
Even well-designed indexes can cause problems at scale. Three common gotchas:
Over-indexing: More indexes means slower writes because each index B-tree must be updated. MongoDB uses an internal write lock per collection, but index maintenance adds CPU and memory pressure. A typical rule of thumb: no more than 5 indexes per collection for high-write workloads.
Covered queries: A query is 'covered' when all the fields it needs are in the index and MongoDB doesn't need to fetch documents from the collection. Covered queries are blazing fast because they only touch the index B-tree. To achieve coverage, include projected fields in the index (but be aware of the 1024-byte limit on index keys — MongoDB indexes can only store values up to 1024 bytes; larger values are skipped).
Sparse indexes: A sparse index only includes documents where the indexed field exists. This can save space but can cause queries to miss documents that don't have the field. For example, if you have a sparse index on email, a query for {email: null} will return no results.
I once debugged a production issue where a sparse index on an optional field caused all queries filtering on that field to miss documents that simply didn't have the field. The team assumed the index covered everything, but it only covered documents with the field set. The fix was to remove the sparse option and let the index handle nulls explicitly.
Another hidden gotcha: indexes on large fields (>1024 bytes) are silently ignored — MongoDB stores only the first 1024 bytes, making them useless for equality matching on long strings or binary data.
Also, collation can break index usage. If your index is created with a specific collation (e.g., case-insensitive), but the query doesn't specify the same collation, MongoDB may not use the index. Always match collation settings.
- If all the fields your query needs are in the index, MongoDB never touches the collection.
- Check the explain output for a 'FETCH' stage. If it's missing, the query is covered.
- Covered queries are the fastest possible read path in MongoDB.
- Trade-off: adding more fields to an index makes it larger and slower to insert.
Index Maintenance and Monitoring in Production
Indexes aren't a 'set and forget' concern. Over time, query patterns change, data grows, and old indexes become dead weight. MongoDB provides $indexStats to see index usage: how many times each index was used, and how long it took to build.
Every quarter, I run db.collection.aggregate([{$indexStats:{}}]) and look for indexes with zero access in the last 30 days. Those are candidates for removal. Dropping unused indexes is the easiest performance win — it reduces write overhead and frees up RAM.
Another common maintenance task is rebuilding indexes after major data changes. In MongoDB, you can rebuild an index with db.collection.reIndex() but be aware: this blocks all reads and writes on the collection. For production, use createIndex() with a background build option (though in recent versions this is non-blocking by default for new builds).
Also monitor index size vs available RAM. If the hot index keys don't fit in memory, MongoDB will page from disk, and your query latency will spike. Use and compare db.collection.stats()indexSizes with the WiredTiger cache size.
For large indexes, consider using createIndexes() with multiple indexes in one command to reduce round trips. Also, index builds can be resource-intensive — schedule them during low-traffic windows.
Additionally, use the MongoDB profiler to capture slow queries. db.setProfilingLevel(1, { slowms: 100 }) logs all queries over 100ms to the system.profile collection. That's your early warning system for missing or misused indexes.
createIndex() with {background: true} (default in 4.2+ for builds but still has some impact). For large collections, consider rolling index builds with replica sets. Also, index builds can cause replication lag, so monitor secondary replicas.Indexing Strategies for High-Throughput Systems
In high-throughput environments (e.g., real-time analytics, event logging, IoT), every millisecond counts. Here are proven strategies:
Partial Indexes: Use createIndex({ status: 1 }, { partialFilterExpression: { active: true } }) to index only documents matching a condition. This reduces index size and write overhead when only a subset of documents are queried.
TTL Indexes: For time-series data, use a TTL index on a timestamp field to automatically expire old documents. This keeps the index lean and avoids manual cleanup.
Rarely Used Indexes on Write-Heavy Collections: If you have a collection that receives 100k writes/second but only occasional reads, consider using an index only for the read-heavy secondary replica, not on the primary.
Sorted vs Unsorted Index Maintenance: If your insert patterns create keys in monotonic order (e.g., timestamps), the B-tree can become right-heavy, causing page splits and fragmentation. Use {key: 1} on a hashed field or shard key to distribute writes evenly.
I once worked on a system where a monotonically increasing _id field caused index fragmentation on a high-write collection. The fix was to use a hashed shard key and rebuild the index periodically.
Another tactic: compound indexes with a hashed prefix can spread writes across the B-tree while still supporting equality queries on the hash. For example, { hashedUserId: 1, timestamp: 1 }. The hash ensures each user's writes go to different parts of the tree.
{ readPreference: 'secondary' }Dropping Indexes Without Causing a Meltdown
Deleting an index sounds trivial. In production, it’s a live grenade. If that index was supporting a critical query path, your database instantly falls back to a full collection scan. Latency spikes. Connections queue. Your pager goes off.
Why drop an index at all? Bloat from unused indexes wastes RAM and slows writes. You should audit every index quarterly. A query that hasn’t run in six months doesn’t need its crutch.
Use db.collection.dropIndex("index_name") — but always drop in a maintenance window. First confirm no query touches it by checking the output of $indexStats. Then drop, wait thirty seconds, and verify explain("executionStats") on the target queries still shows IXSCAN. If you see COLLSCAN for even one query, recreate the index and rethink your strategy.
Never drop a compound index under load. The index rebuild blocks writes on that collection. That’s how you turn a five-second query into a five-minute outage.
explain() to confirm safety.View All Indexes Without Guessing
You inherited a Mongo cluster from someone who left no documentation. Or you’ve been adding indexes ad hoc for six months. Either way, you need a single source of truth. getIndexes() is that source.
Run db.collection.getIndexes() and it spits back every index on that collection — name, key pattern, and any options like sparse, unique, or TTL. The output is a JSON array. Pipe it to before you try to read it raw.pretty()
But a collection-by-collection approach is slow in a schema with dozens of collections. Use a short JavaScript loop in the shell to dump all indexes across the database at once. Store the output somewhere — a file, a wiki, a Slack thread. That snapshot is your baseline.
The real insight comes from cross-referencing this list with your application’s query patterns. An index on user_email_1 that is unique? That’s a constraint, not a performance tool. A compound index on status_1_created_at_-1? That’s your report query. If you see a single-field index that is never used in a compound index, you may have a fragmentation problem — drop and consolidate.
Introduction
Indexes are the single most impactful performance lever in MongoDB, yet they are frequently misunderstood or misapplied. This guide cuts through the noise to give you a senior engineer's understanding of index internals, practical selection strategies, and production pitfalls. We start with why indexes exist: without them, MongoDB performs a collection scan on every query, reading every document to find matches. That works for tiny datasets but collapses under real-world loads. Indexes trade write overhead for read speed—a tradeoff you must evaluate for every collection. You will learn how B-Tree structures enable logarithmic lookups, how compound indexes obey the ESR rule (Equality, Sort, Range), and how to read explain plans to diagnose performance. We also cover over-indexing, covered queries, maintenance, and high-throughput strategies. By the end, you will not only know which index to create but also when not to create one, avoiding the silent production meltdowns that plague teams who treat indexing as a checklist item instead of a design decision.
{ background: true } or build during maintenance windows to avoid downtime.Step 1 — Preparing the Sample Database
Before you can test index performance, you need a realistic dataset. We'll create a database called shop with a collection orders that contains at least 100,000 documents. This volume ensures B-Tree depth becomes visible in explain plans. Start by switching to the shop database: use shop;. Insert a batch of documents with fields: orderId, customerId, total, status, and createdAt. Use a script or the insertMany command with a loop. For example, generate 100k documents where orderId increments, customerId cycles through 1000 distinct values, total is random between 10.00 and 1000.00, status is one of 'pending', 'shipped', 'delivered', and createdAt is a random date in the last year. This distribution mirrors a real e-commerce workload. After insertion, confirm the count with db.orders.countDocuments(). Then create a simple index on status to observe the difference between a collection scan and an index scan in the next steps. Ensure your MongoDB instance is running locally or on a test server—never experiment on production data.
initializeUnorderedBulkOp() for large inserts—it’s faster and handles partial failures gracefully. Always validate the document count after insertion.The Missing Index That Killed Search Latency
db.orders.find({ status: { $in: ['pending', 'shipped', 'cancelled'] } }). While a single-field index on 'status' existed, the query still scanned millions of documents because the index only stored the field value — filtering by multiple values required scanning a large portion of the index tree. MongoDB decided to COLLSCAN instead because the index was not selective enough.- Always analyze query patterns before creating indexes — single-field indexes may not be sufficient for multi-value queries.
- Use
explain()with executionStats to see 'totalDocsExamined' vs 'nReturned'. If they differ significantly, the index is not covering the query well. - Monitor slow queries with the profiler; don't assume indexes will automatically be used.
- Always test with production-like data volume before release.
db.collection.getIndexes(). Check if the query uses operators like $regex or $where that prevent index usage. Also check if the query includes a sort field that is not part of the index.db.collection.aggregate([ { $indexStats: {} } ]) to see index usage. Remove unused indexes. Consider whether some indexes can be merged into compound indexes.db.collection.find({field: {$exists: false}}).count() to find documents missing the field. Sparse indexes exclude missing fields — design accordingly.db.collection.find({status: 'active'}).explain('executionStats')Check 'stage' field: 'IXSCAN' means index used, 'COLLSCAN' means full collection scanKey takeaways
Common mistakes to avoid
10 patternsMemorising syntax before understanding the concept
explain() on real queries. Understand the B-tree structure and impact on execution.Skipping practice and only reading theory
explain() results.Creating indexes on every field
Using wildcard indexes as a default
Not monitoring index usage
Assuming a compound index covers all queries regardless of field order
Ignoring the 1024-byte key limit
Overusing sparse indexes without understanding null semantics
Not covering sort direction in compound index
Using $in with index but still seeing high keysExamined
Interview Questions on This Topic
How does a B-tree index work in MongoDB? Explain with respect to read and write performance.
Frequently Asked Questions
20+ years shipping high-throughput database systems. Written from production experience, not tutorials.
That's NoSQL. Mark it forged?
14 min read · try the examples if you haven't