Mid-level 12 min · March 05, 2026

Silent Write Failures in MongoDB — 16MB Document Wall

MongoDB's 16MB doc limit silently swallowed BSONObjectTooLarge on a viral post with 32K comments, losing comments.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • MongoDB stores data as flexible BSON documents inside collections — no fixed schema, no mandatory columns
  • Each document is a self-contained JSON-like object that can nest arrays and sub-objects natively
  • CRUD uses JSON filter objects: find(), insertOne(), updateOne() with $set, deleteOne()
  • Indexes on filter/sort fields are mandatory in production — a missing index causes COLLSCAN at scale
  • Aggregation pipeline ($match → $group → $sort) replaces SQL GROUP BY — filter early or pay in latency
  • Biggest production trap: bare object in updateOne replaces the entire document — always use $set
Plain-English First

Imagine your school keeps student records not in a giant shared spreadsheet — where every row must have the same columns — but in a filing cabinet full of individual folders. Each folder can hold whatever papers that student needs. Some folders have report cards, others have medical notes, some have both, and a few have an extra section for extracurricular achievements that most other folders don't even have a slot for. MongoDB is that filing cabinet. Each folder is a document, and the cabinet itself is a collection. No two folders have to look the same, and you can find any folder instantly by its label. The trade-off: if you later need to add a field to every folder, you have to walk through the entire cabinet and update them one by one — there's no 'add a column' equivalent that updates everything at once. That's not a flaw; it's the deal you make for the flexibility.

Every app you use daily — from your food delivery tracker to your social media feed — stores data somewhere. Relational databases like PostgreSQL are brilliant when your data is predictable and heavily interconnected. But the moment your data gets irregular, deeply nested, or needs to scale horizontally across dozens of servers, SQL starts fighting you. That's the real world MongoDB was built for.

MongoDB solves a specific, painful problem: storing data that doesn't fit neatly into rows and columns. A product in an e-commerce store might have two attributes or twenty. A user profile on one platform needs a bio field; on another it needs a portfolio array. Forcing that variety into a rigid table schema means either wasting columns, creating awkward join tables, or writing painful migration scripts every time requirements change. MongoDB lets the data own its shape.

But MongoDB is not just a relaxed version of PostgreSQL. It makes different trade-offs: embedding related data inside documents eliminates JOINs at read time but complicates writes when that data needs updating everywhere. Flexible schema means your application owns validation rather than the database. Horizontal sharding is built in, but multi-document transactions carry more overhead than they do in Postgres.

By the end of this article you'll understand not just how to run MongoDB CRUD commands, but why the document model exists, when to choose it over SQL, how to design collections that won't haunt you at 5 million documents, and the query patterns that show up in production systems every day. You'll also walk away knowing exactly what to say when an interviewer asks you to compare MongoDB to a relational database.

The Document Model — Why JSON-Like Storage Changes Everything

In a relational database, a user lives across multiple tables. Basic info in users, their addresses in user_addresses, their preferences in user_settings. To reconstruct one complete user, you JOIN three tables. That JOIN is fast when your dataset fits on one server and the query planner has good statistics. When the data is spread across ten servers, that JOIN is suddenly a network call — and network calls are slow and unpredictable in ways that local disk reads are not.

MongoDB stores that entire user as a single document. One read, no joins. The document is stored in BSON (Binary JSON) format internally, which means it supports richer types than plain JSON — native Date objects, 64-bit integers, binary data, and ObjectId values that encode both a timestamp and a server ID without string conversion hacks.

Every document lives inside a collection. A collection is roughly equivalent to a SQL table, but it enforces no schema by default. Two documents in the same collection can have completely different fields. This is not chaos — it's intentional flexibility. You're trading schema enforcement at the database level for schema ownership at the application level.

This trade matters in practice because in fast-moving products, your schema changes weekly. With MongoDB, you add a new field to new documents without touching old ones, and your application handles the absence gracefully. No ALTER TABLE, no downtime, no migration script that locks a 50-million-row table for three hours during a deploy.

The flip side is real and worth stating directly: your application must own validation. MongoDB will not tell you that you stored a string where you expected a number. Libraries like Mongoose, Zod, or MongoDB's own JSON Schema validators fill this gap. Treating MongoDB as schema-free rather than schema-flexible is how teams end up with inconsistent data that's painful to query and report on.

document_model_intro.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Connect to MongoDB using the Node.js driver (mongosh syntax works identically)
// Run this in mongosh or as a Node.js script with the mongodb package

// --- Step 1: Switch to (or create) our working database ---
use('ecommerce_store');

// --- Step 2: Insert two product documents with intentionally DIFFERENT shapes ---
// Notice: simple_mug has no variants; custom_tshirt has a nested variants array.
// In SQL this would require a separate 'product_variants' table and a JOIN.
// In MongoDB it's just an array field inside the document — one read gets everything.

db.products.insertMany([
  {
    // A simple product — flat structure, no variants needed
    name: 'Ceramic Coffee Mug',
    sku: 'MUG-001',
    price_usd: 12.99,
    stock_count: 150,
    category: 'kitchenware',
    tags: ['ceramic', 'handmade', 'dishwasher-safe'],
    created_at: new Date('2024-01-15')
    // No 'variants' field — and that's fine. MongoDB won't error on a missing field.
  },
  {
    // A complex product with nested variants (size + colour combinations)
    // This structure would require 3 tables in a relational schema.
    // Here it lives in one document — one read, no JOINs.
    name: 'Custom Logo T-Shirt',
    sku: 'TSH-042',
    base_price_usd: 24.99,
    category: 'apparel',
    tags: ['cotton', 'customizable', 'unisex'],
    variants: [
      { size: 'S',  color: 'black', stock: 80  },
      { size: 'M',  color: 'black', stock: 120 },
      { size: 'L',  color: 'white', stock: 60  }
    ],
    customization_options: {
      max_logo_size_cm: 10,
      allowed_positions: ['chest', 'back', 'sleeve']
    },
    created_at: new Date('2024-03-22')
  }
]);

// --- Step 3: Query all apparel products ---
// The filter object mirrors the document shape — just use the field name
const apparelProducts = db.products.find(
  { category: 'apparel' },                       // filter
  { name: 1, base_price_usd: 1, _id: 0 }        // projection: only return these fields
).toArray();

console.log('Apparel products found:', JSON.stringify(apparelProducts, null, 2));

// --- Step 4: Query inside a nested array using dot notation ---
// Find products that have a size 'M' variant in stock
const hasMedium = db.products.find(
  { 'variants.size': 'M' },   // dot-notation queries nested fields and array elements
  { name: 1, _id: 0 }
).toArray();

console.log('Products with size M variant:', JSON.stringify(hasMedium, null, 2));
Output
Apparel products found: [
{
"name": "Custom Logo T-Shirt",
"base_price_usd": 24.99
}
]
Products with size M variant: [
{
"name": "Custom Logo T-Shirt"
}
]
The Document Model Mental Model
  • In SQL, you normalize at write time and pay JOIN cost at read time — the read path requires assembling multiple tables
  • In MongoDB, you denormalize at write time and pay update complexity at write time — the read path is a single document fetch
  • The right choice depends on your read/write ratio — read-heavy workloads favour denormalization; write-heavy or update-heavy workloads often favour referencing
  • A document is an I/O boundary: everything inside it is one read operation, everything outside it is an additional round-trip
  • Schema flexibility means your application owns validation — the database will not catch type mismatches, and neither will your logs until a query breaks
Production Insight
In production, the document model's biggest advantage is read-path simplicity — a single findOne() replaces a 3-table JOIN and the associated query planner overhead.
The trade-off is write-path complexity: updating a field embedded inside 50,000 documents requires 50,000 individual update operations or an updateMany() that holds locks during execution.
Rule: embed when reads dominate and the nested data belongs exclusively to one parent; reference when writes are frequent, data is shared, or arrays can grow without bound.
Key Takeaway
Documents are pre-joined records — you pay the denormalization cost at write time to eliminate JOINs at read time.
The trade-off is real: updating embedded data across millions of documents is expensive and not atomic across documents by default.
Choose embed vs reference based on your access pattern, not your data model preferences.

CRUD in the Real World — Beyond the Basic Insert and Find

Most tutorials show you insertOne, findOne, updateOne and deleteOne in isolation with trivial examples. That's fine for learning syntax, but it hides the decisions you'll actually make in production. Let's walk through a realistic user-account lifecycle — creating a user, enriching their profile incrementally, querying by nested fields and array membership, and cleaning up test data — because that pattern mirrors what real application code does.

The most critical update operator to understand deeply is $set. It does not replace a document — it surgically modifies only the fields you name and leaves everything else untouched. Compare that to passing a bare object to updateOne without any operator, which is actually a document replacement: every field not in your replacement object is permanently gone with no error and no warning. This is the #1 cause of silent data loss in MongoDB production systems.

For queries, the filter object mirrors the document shape. Want to query a nested field? Use dot-notation: { 'address.city': 'Mumbai' }. Want to check if an array contains a value? Pass the value directly — MongoDB automatically checks for membership: { permissions: 'write' }. Want all users who joined in the last 30 days? Use comparison operators: { joined_at: { $gte: thirtyDaysAgo } }. These patterns appear in virtually every MongoDB-backed application.

user_account_crud.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
use('saas_platform');

// ─────────────────────────────────────────
// CREATE — Register a new user
// insertOne returns an object with insertedId — always check it
// ─────────────────────────────────────────
const insertResult = db.users.insertOne({
  email: 'priya.sharma@example.com',
  display_name: 'Priya Sharma',
  hashed_password: '$2b$12$exampleHashedPasswordHere',
  plan: 'free',
  address: {
    city: 'Mumbai',
    country: 'IN'
  },
  permissions: ['read', 'comment'],
  joined_at: new Date(),
  last_login: null   // null is valid — she hasn't logged in yet
});

console.log('Inserted ID:', insertResult.insertedId);
// Inserted ID: ObjectId('664a1f3b2c1d4e5f6a7b8c9d')

// ─────────────────────────────────────────
// READ — Find users in Mumbai on the free plan
// Dot-notation queries nested sub-document fields directly
// Array field with a scalar value checks for array membership automatically
// ─────────────────────────────────────────
const mumbaiFreeUsers = db.users.find(
  {
    'address.city': 'Mumbai',  // dot-notation: queries the nested 'city' field
    plan: 'free'
  },
  { email: 1, display_name: 1, _id: 0 }  // projection: include only these fields
).toArray();

console.log('Mumbai free-plan users:', mumbaiFreeUsers);

// ─────────────────────────────────────────
// RANGE QUERY — Users who joined in the last 30 days
// ─────────────────────────────────────────
const thirtyDaysAgo = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000);
const recentUsers = db.users.find(
  { joined_at: { $gte: thirtyDaysAgo } },
  { email: 1, joined_at: 1, _id: 0 }
).sort({ joined_at: -1 }).toArray();

console.log('Recent signups:', recentUsers.length);

// ─────────────────────────────────────────
// UPDATE — Priya upgrades to 'pro' and gets write permission
//
// $set: modifies ONLY the named fields — everything else is untouched
// $push: appends ONE item to an array without overwriting the array
// $addToSet: like $push but ignores the item if it already exists
//
// DANGER: updateOne(filter, { plan: 'pro' }) WITHOUT $set
// replaces the ENTIRE document with { plan: 'pro' } — email gone, everything gone
// ─────────────────────────────────────────
const updateResult = db.users.updateOne(
  { email: 'priya.sharma@example.com' },  // filter: which document to update
  {
    $set:  { plan: 'pro', last_login: new Date() },  // surgical field update
    $push: { permissions: 'write' }                   // append to array
  }
);

console.log('Matched:', updateResult.matchedCount, 'Modified:', updateResult.modifiedCount);
// Matched: 1  Modified: 1

// ─────────────────────────────────────────
// VERIFY — Read back the updated document to confirm
// ─────────────────────────────────────────
const updatedUser = db.users.findOne(
  { email: 'priya.sharma@example.com' },
  { email: 1, plan: 1, permissions: 1, last_login: 1, _id: 0 }
);

console.log('Updated user:', JSON.stringify(updatedUser, null, 2));

// ─────────────────────────────────────────
// UPSERT — Update if exists, insert if not
// { upsert: true } creates the document when the filter matches nothing
// Useful for 'create or update' patterns without a separate existence check
// ─────────────────────────────────────────
const upsertResult = db.users.updateOne(
  { email: 'new.user@example.com' },
  {
    $set: { display_name: 'New User', plan: 'free' },
    $setOnInsert: { joined_at: new Date(), permissions: ['read'] }  // only on new doc
  },
  { upsert: true }
);

console.log('Upserted:', upsertResult.upsertedCount === 1 ? 'inserted new doc' : 'updated existing');

// ─────────────────────────────────────────
// DELETE — Remove a test or spam account
// deleteOne removes the FIRST match only — it won't throw if nothing matches
// ─────────────────────────────────────────
const deleteResult = db.users.deleteOne({ email: 'spam-bot@junk.io' });
console.log('Deleted count:', deleteResult.deletedCount);
// Deleted count: 1 (or 0 if the email did not exist — no error thrown either way)
Output
Inserted ID: ObjectId('664a1f3b2c1d4e5f6a7b8c9d')
Mumbai free-plan users: [ { email: 'priya.sharma@example.com', display_name: 'Priya Sharma' } ]
Recent signups: 1
Matched: 1 Modified: 1
Updated user: {
"email": "priya.sharma@example.com",
"plan": "pro",
"permissions": ["read", "comment", "write"],
"last_login": "2024-04-22T10:31:00.000Z"
}
Upserted: inserted new doc
Deleted count: 1
Watch Out: updateOne With a Bare Object Is Not a Merge — It Is a Replace
If you call db.users.updateOne({ email: '...' }, { plan: 'pro' }), you do not update the plan field. You replace the entire document with { plan: 'pro' }. Priya's email, name, permissions, join date — all permanently deleted. MongoDB throws no error. The write result shows modifiedCount: 1. The data is gone. Always use { $set: { field: value } } inside updateOne. Reserve bare objects for replaceOne() when you explicitly intend a full document replacement.
Production Insight
The $set vs bare-object mistake is the #1 MongoDB data-loss bug in production. It looks identical to a correct update in code review unless you know exactly what to look for.
In Node.js applications, the pattern collection.updateOne(filter, updateObject) where updateObject is assembled dynamically from request body data is particularly dangerous — if the object accidentally lacks $set, any field present in the DB but absent from the request body is deleted.
Rule: treat the absence of $set in an updateOne call as a code review red flag. Every updateOne should have an operator as the outermost key.
Key Takeaway
$set surgically modifies named fields; a bare update object replaces the entire document — these look nearly identical in code and are completely different operations.
This distinction is invisible in application logs — modifiedCount: 1 is returned in both cases.
Always structure update arguments as { $set: { ... } } unless replaceOne is your explicit intent.

Indexes and Schema Design — The Two Decisions That Make or Break Performance

A MongoDB collection with no indexes is a filing cabinet where every search requires opening every folder one at a time. That's acceptable at 100 documents. At 5 million documents it produces queries that take 10-15 seconds and saturate disk I/O, which cascades into timeouts across your entire application. An index is a sorted shortcut: MongoDB builds and maintains a separate data structure mapping field values to document locations so it can jump directly to the relevant documents instead of scanning all of them.

The golden rule: create an index on every field you filter or sort by in production queries. MongoDB's explain('executionStats') method is your best diagnostic tool — it tells you whether a query used an index (IXSCAN) or scanned the entire collection (COLLSCAN), how many documents were examined versus returned, and how long execution took. The ratio of totalDocsExamined to nReturned tells you the efficiency of your query. A ratio of 1:1 is ideal. A ratio of 100,000:1 means you examined 100,000 documents to return 1 — you need an index.

For compound indexes, field order matters in a specific way: put equality filter fields first, then sort fields, then range filter fields. This ordering maximises the portion of the query that can be resolved by the index. A compound index on { plan: 1, joined_at: -1 } serves a query filtering by plan and sorting by join date without loading any documents into memory for the sort.

Schema design in MongoDB comes down to one core question that has a real answer: do you embed or reference? The answer depends entirely on your access pattern. Embed when the nested data belongs exclusively to one parent, you always read them together, and the array is bounded in size. Reference when the data is shared across multiple parents, needs independent queries, or can grow without a predictable upper limit. Getting this wrong at design time — embedding an unbounded array — is how you hit the 16MB document limit at the worst possible moment.

indexes_and_schema_design.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
use('saas_platform');

// ─────────────────────────────────────────
// INDEXES — Match your indexes to your actual query patterns
// ─────────────────────────────────────────

// Single-field unique index: login lookups always filter by email
// unique: true enforces no duplicate emails at the database level
db.users.createIndex(
  { email: 1 },
  { unique: true, name: 'idx_users_email_unique' }
);

// Compound index for the admin dashboard: filters by plan, sorts by join date
// Field order: equality filter (plan) first, sort field (joined_at) second
// This allows the query to use the index for both filtering AND sorting
db.users.createIndex(
  { plan: 1, joined_at: -1 },
  { name: 'idx_users_plan_joined' }
);

// Sparse index: only indexes documents where last_login exists
// Useful when many documents don't have the field at all
db.users.createIndex(
  { last_login: -1 },
  { sparse: true, name: 'idx_users_last_login' }
);

// TTL index: automatically deletes documents after a time period
// Useful for session tokens, temporary verification codes, expiring cache docs
db.password_reset_tokens.createIndex(
  { created_at: 1 },
  { expireAfterSeconds: 3600, name: 'idx_tokens_ttl_1h' }  // auto-delete after 1 hour
);

// Text index: enables full-text search on multiple string fields
db.products.createIndex(
  { name: 'text', description: 'text' },
  { name: 'idx_products_text_search' }
);

// ─────────────────────────────────────────
// EXPLAIN — Verify every production query uses an index
// Run this BEFORE shipping a new query — never assume
// ─────────────────────────────────────────
const queryPlan = db.users
  .find({ plan: 'pro' })
  .sort({ joined_at: -1 })
  .explain('executionStats');

console.log('Winning stage:',    queryPlan.queryPlanner.winningPlan.inputStage.stage);
console.log('Docs examined:',    queryPlan.executionStats.totalDocsExamined);
console.log('Docs returned:',    queryPlan.executionStats.nReturned);
console.log('Execution time ms:', queryPlan.executionStats.executionTimeMillis);

// Good: stage is IXSCAN, examined equals returned (ratio 1:1)
// Bad:  stage is COLLSCAN, examined >> returned (ratio 1000:1 or worse)

// ─────────────────────────────────────────
// SCHEMA DESIGN — Embed vs Reference examples side by side
// ─────────────────────────────────────────

// EMBED: Order stores its own line items
// Rationale: line items belong exclusively to this order, always read together,
// bounded in size (a realistic order has 1-50 items, never 50,000)
db.orders.insertOne({
  order_number: 'ORD-20240312-001',
  customer_id: ObjectId('664a1f3b2c1d4e5f6a7b8c9d'),  // reference to users
  status: 'shipped',
  placed_at: new Date('2024-03-12'),
  line_items: [
    // Embedded sub-documents — no separate collection needed for this pattern
    { sku: 'MUG-001', name: 'Ceramic Coffee Mug',   qty: 2, unit_price_usd: 12.99 },
    { sku: 'TSH-042', name: 'Custom Logo T-Shirt',  qty: 1, unit_price_usd: 24.99 }
  ],
  total_usd: 50.97
});

// REFERENCE: Blog post stores author as an ObjectId, not embedded author data
// Rationale: author exists independently, writes many posts.
// If we embedded author name and the author changes their name,
// we'd need to update every post they ever wrote — one update vs thousands.
db.blog_posts.insertOne({
  title: 'Getting Started with MongoDB Indexes',
  slug: 'mongodb-indexes-guide',
  author_id: ObjectId('664a1f3b2c1d4e5f6a7b8c9d'),  // reference — not embedded
  body: 'Indexes are the single biggest performance lever in MongoDB...',
  published_at: new Date('2024-04-01'),
  tags: ['mongodb', 'performance', 'indexing'],
  view_count: 0
});

// Create index on the foreign key so $lookup and find({author_id: ...}) is fast
db.blog_posts.createIndex({ author_id: 1 }, { name: 'idx_posts_author_id' });

console.log('Indexes created and schema examples inserted.');
Output
Winning stage: IXSCAN
Docs examined: 43
Docs returned: 43
Execution time ms: 2
Indexes created and schema examples inserted.
Pro Tip: TTL Indexes and the 16MB Document Limit
Two index patterns that are underused in practice: TTL indexes automatically delete documents after a set time period, which is ideal for session tokens, verification codes, and temporary data — no cron job required. And the 16MB document limit catches teams off guard when an embedded array grows beyond expectations. If you can't guarantee an array stays under 100-200 items, reference it. The Bucket Pattern — grouping items into fixed-size bucket documents of 100 items each — handles the middle ground where you want some locality without the unbounded growth risk.
Production Insight
A missing index is completely invisible during development with 500 test documents — queries return in under 5ms via COLLSCAN because the collection fits in memory.
In production with 5 million documents, that same COLLSCAN takes 8-15 seconds and saturates disk read throughput, which cascades into application-level timeouts and connection pool exhaustion across the entire service.
Rule: run explain('executionStats') on every query you write before it ships. Not once a week, not before the next deployment — before every single query goes to production.
Key Takeaway
Indexes are non-negotiable in production — a COLLSCAN on 5M documents takes seconds, not milliseconds, and the latency is invisible in your development environment.
Schema design is a single recurring decision: embed for bounded exclusive data you always read together, reference for shared, independent, or potentially unbounded data.
Never ship a query without running explain() first — confirm IXSCAN before you commit to the query pattern.
Embed vs Reference — Decision Framework
IfData belongs exclusively to one parent, always read together, bounded size (under 100 items)
UseEmbed — one read gets everything, no extra round-trips, no additional collection to manage
IfData is shared across many parents — e.g., an author who writes many posts
UseReference — update the shared document once rather than in every parent that embeds it
IfSub-data needs independent queries — e.g., 'show all comments by user X across all posts'
UseReference — you need a separate indexed collection to query the data without loading every parent
IfArray can grow without a predictable upper bound — comments, messages, events
UseReference or Bucket Pattern — never embed unbounded arrays; the 16MB limit is real and will hit on your most popular content
IfRead-heavy workload where parent and child are always fetched together
UseEmbed — denormalization trades write complexity for read speed; ideal when the parent-child relationship is exclusive and bounded

Aggregation Pipelines — MongoDB's Answer to SQL GROUP BY and JOINs

The find() method takes you far. The moment you need to summarise, group, reshape, or join data across collections, you need the Aggregation Pipeline. Think of it as an assembly line: each stage receives a stream of documents from the previous stage, does exactly one job, and passes the results forward. The pipeline is the unit of work — you compose complex analytics queries by chaining simple stages.

The most-used stages: $match filters documents just like a find() query, $group aggregates and accumulates values like COUNT and SUM, $sort orders results, $project reshapes fields and controls what's returned, $lookup joins another collection, and $unwind flattens arrays into individual documents (essential before grouping on array elements).

The single most impactful pipeline rule: always put $match as the first stage. A $match that reduces 2 million documents to 50,000 before the $group stage makes every subsequent stage 40x cheaper. Putting $group or $sort before $match forces the pipeline to process the entire collection before filtering — a completely avoidable performance tax that will time out pipelines on large collections.

$lookup deserves special mention because it's MongoDB's JOIN equivalent, and it behaves very differently from a SQL JOIN. It runs per-document in the left collection — if your left collection has 100,000 documents, that's 100,000 individual index lookups against the foreign collection. The foreign field must be indexed, or you've just caused a COLLSCAN per document. $lookup is expensive by nature; prefer embedding when possible and reach for $lookup only when the data genuinely needs to live in separate collections.

aggregation_pipeline_examples.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
use('ecommerce_store');

// ─────────────────────────────────────────
// EXAMPLE 1: Revenue by product for March 2024
// This replaces a multi-JOIN GROUP BY in SQL.
// Pipeline order: $match first (filter early), then $unwind, $group, $sort, $limit, $project
// ─────────────────────────────────────────
const revenueByProduct = db.orders.aggregate([

  // Stage 1 — $match: filter to March 2024 shipped/delivered orders
  // THIS MUST BE FIRST — reduces 2M docs to ~50K before any grouping
  {
    $match: {
      placed_at: {
        $gte: new Date('2024-03-01'),
        $lt:  new Date('2024-04-01')
      },
      status: { $in: ['shipped', 'delivered'] }  // exclude cancelled orders
    }
  },

  // Stage 2 — $unwind: flatten line_items array
  // Before: 1 order doc with 3 line items
  // After:  3 docs, one per line item, each carrying the parent order fields
  { $unwind: '$line_items' },

  // Stage 3 — $group: calculate revenue and units per SKU
  {
    $group: {
      _id: '$line_items.sku',
      total_revenue: {
        $sum: { $multiply: ['$line_items.qty', '$line_items.unit_price_usd'] }
      },
      total_units_sold: { $sum: '$line_items.qty' },
      order_count: { $sum: 1 }
    }
  },

  // Stage 4 — $sort: highest revenue first
  { $sort: { total_revenue: -1 } },

  // Stage 5 — $limit: top 5 products only
  { $limit: 5 },

  // Stage 6 — $project: rename _id to sku, round currency to 2 decimal places
  {
    $project: {
      _id: 0,
      sku: '$_id',
      total_revenue:    { $round: ['$total_revenue', 2] },
      total_units_sold: 1,
      order_count: 1
    }
  }

]).toArray();

console.log('Top 5 products by March revenue:');
console.log(JSON.stringify(revenueByProduct, null, 2));

// ─────────────────────────────────────────
// EXAMPLE 2: $lookup — Join blog posts with author display name
// MongoDB's LEFT JOIN equivalent — runs per-document, so foreign field MUST be indexed
// ─────────────────────────────────────────
const postsWithAuthors = db.blog_posts.aggregate([

  // $match first — filter before joining to reduce the number of $lookup operations
  { $match: { published_at: { $gte: new Date('2024-01-01') } } },

  {
    $lookup: {
      from: 'users',           // the collection to join against
      localField: 'author_id', // field in blog_posts
      foreignField: '_id',     // field in users — must be indexed
      as: 'author_details'     // result stored as an array field
    }
  },

  // $lookup always produces an array — unwind since each post has exactly one author
  { $unwind: '$author_details' },

  // $project: shape the final output — pull author name up from the nested join result
  {
    $project: {
      _id: 0,
      title: 1,
      slug: 1,
      author_name: '$author_details.display_name',
      published_at: 1,
      tags: 1
    }
  }

]).toArray();

console.log('Posts with authors:', JSON.stringify(postsWithAuthors, null, 2));

// ─────────────────────────────────────────
// EXAMPLE 3: Cohort analysis — users grouped by signup month
// Real-world reporting pattern used in SaaS dashboards
// ─────────────────────────────────────────
const signupCohorts = db.users.aggregate([
  {
    $group: {
      _id: {
        year:  { $year:  '$joined_at' },
        month: { $month: '$joined_at' }
      },
      new_users: { $sum: 1 },
      pro_users: {
        $sum: { $cond: [{ $eq: ['$plan', 'pro'] }, 1, 0] }  // conditional count
      }
    }
  },
  { $sort: { '_id.year': 1, '_id.month': 1 } },
  {
    $project: {
      _id: 0,
      month: {
        $concat: [
          { $toString: '$_id.year' }, '-',
          { $toString: '$_id.month' }
        ]
      },
      new_users: 1,
      pro_users: 1,
      conversion_rate: {
        $round: [{ $multiply: [{ $divide: ['$pro_users', '$new_users'] }, 100] }, 1]
      }
    }
  }
]).toArray();

console.log('Signup cohorts with conversion rates:', JSON.stringify(signupCohorts, null, 2));
Output
Top 5 products by March revenue:
[
{ "sku": "TSH-042", "total_revenue": 749.70, "total_units_sold": 30, "order_count": 22 },
{ "sku": "MUG-001", "total_revenue": 519.60, "total_units_sold": 40, "order_count": 31 },
{ "sku": "HAT-007", "total_revenue": 389.25, "total_units_sold": 25, "order_count": 18 },
{ "sku": "BAG-019", "total_revenue": 299.00, "total_units_sold": 10, "order_count": 9 },
{ "sku": "PIN-003", "total_revenue": 89.55, "total_units_sold": 15, "order_count": 14 }
]
Posts with authors: [
{
"title": "Getting Started with MongoDB Indexes",
"slug": "mongodb-indexes-guide",
"author_name": "Priya Sharma",
"published_at": "2024-04-01T00:00:00.000Z",
"tags": ["mongodb", "performance", "indexing"]
}
]
Signup cohorts with conversion rates: [
{ "month": "2024-1", "new_users": 142, "pro_users": 18, "conversion_rate": 12.7 },
{ "month": "2024-2", "new_users": 198, "pro_users": 31, "conversion_rate": 15.7 },
{ "month": "2024-3", "new_users": 267, "pro_users": 52, "conversion_rate": 19.5 }
]
Interview Gold: Pipeline Order Is an Architectural Decision
Interviewers love asking why aggregation pipelines are slow. The answer is almost always $match placed too late in the pipeline. A $match at stage 1 that cuts your working set from 2 million to 50,000 documents makes every subsequent stage 40x cheaper in CPU, memory, and I/O. This is not a style preference — it is the difference between a pipeline that completes in 200ms and one that times out at the default 60-second limit.
Production Insight
A reporting pipeline ran $group before $match, processing the entire 3M-document orders collection before filtering by date range.
The pipeline timed out daily in production and passed every test in development because the test collection had 500 documents.
Moving $match to stage 1 reduced the working set from 3M to 45K documents and cut execution time from timeout to 340ms.
Rule: $match first, always. Treat any pipeline where $group or $sort appears before $match as a bug.
Key Takeaway
Pipeline stage order is a performance decision, not a stylistic one — $match must be the first stage to filter early and make every subsequent stage cheaper.
$lookup is a per-document operation, not a set-based JOIN — the foreign collection's join field must be indexed or you get a COLLSCAN per document.
The aggregation pipeline is MongoDB's answer to GROUP BY, JOINs, and analytics queries, but you must architect the stage order and index strategy yourself.

JSON vs BSON — What Makes MongoDB's Storage Format Different

When you insert a document into MongoDB, the data is stored on disk as BSON (Binary JSON), not plain JSON. BSON is a binary serialization format designed to be lightweight, traversable, and efficient for both storage and scanning. Understanding the differences between JSON and BSON is critical for estimating storage costs, choosing data types, and debugging size-related issues like BSONObjectTooLarge.

BSON extends the JSON data model with extra types that matter in real applications: - ObjectId: 12-byte identifier (timestamp + machine ID + process ID + counter) — no need for UUID strings or auto-increment integers - Date: millisecond precision from Unix epoch — no string parsing overhead - Int32 / Int64 / Double: explicit numeric types — no ambiguity between integers and floats - Binary Data: raw byte storage with subtype support — for images, encrypted values - Regular Expression: native regex type — no string escaping for pattern matching

The BSON format is not a compression scheme; it actually adds a small overhead per field because it stores field names and type bytes. However, for typical documents, BSON is more compact than JSON because it encodes numbers and dates in fixed-width binary rather than variable-length strings.

FeatureJSONBSON
Data typesObjects, Arrays, Strings, Numbers (all IEEE-754 doubles), Booleans, NullAll JSON types + ObjectId, Date, Int32, Int64, Decimal128, Binary, Regex, Timestamp
EncodingUTF-8 textBinary with type markers and field-length prefixes
Number handlingAll numbers parsed as double — integer precision loss above 2^53Explicit int32/int64/double/decimal — no precision loss for large integers
Date storageString (ISO 8601) — requires parsing and conversion64-bit signed integer of milliseconds since epoch — native Date type
Size overheadVariable — numbers as strings can be largeFixed-size binary for numbers and dates; field names stored per document
TraversalFull parsing required to find a fieldField marking with length prefixes allows O(1) skip of fields during scanning
SortingString comparison for numbers can produce incorrect orderNative numeric comparison works correctly

In practice, BSON's richer type system eliminates entire classes of bugs. Storing MongoDB IDs as strings leads to lexicographic sorting issues; storing dates as strings makes range queries require string comparison; storing large integers as JSON numbers loses precision above 2^53. BSON avoids all these problems at the storage layer. The trade-off is that field names are stored in every document — renaming a field after data is loaded requires a migration that updates every document. Use short, meaningful field names to balance clarity with storage efficiency.

BSON Size Calculation Tip
Use Object.bsonsize(doc) in mongosh to get the exact BSON byte size of any document. This is the only reliable way to measure how close you are to the 16MB limit — JSON-stringify approximations will be wrong because BSON encodes types differently. Run this on a sample document from your largest collection to establish a baseline.
Production Insight
When migrating from a relational database to MongoDB, teams often continue storing timestamps as ISO-format strings because 'that's how the API sends them.' This wastes 10-20 bytes per date field and makes range queries require string comparisons that can't use BSON's native date ordering. Store dates as BSON Date objects and use the $dateFromString aggregation operator only at the API boundary.
For numeric fields that never exceed 2^31, use Int32 explicitly — it's half the size of a string representation. For monetary values, Decimal128 avoids floating-point rounding errors. These type choices compound across millions of documents.
Key Takeaway
BSON is not a compressed version of JSON — it's a binary format with a richer type system that eliminates precision loss, date-parsing bugs, and sorting issues.
Field names are stored in every document, so short names have a measurable storage impact across large collections.
Use Object.bsonsize() to measure actual storage and understand how your schema choices affect the document size.

Document Structure — A Visual Guide

MongoDB documents are JSON-like objects that can contain nested fields, arrays, and sub-documents. To reason about data modeling, it helps to see the anatomy of a document with its three structural primitives: scalar values, arrays, and embedded objects.

A scalar field holds a single value of a specific BSON type — a string, number, date, or ObjectId. An array holds an ordered list of values (which can themselves be scalars or sub-documents). An embedded object nests a complete sub-document inside a field, creating a hierarchy.

The diagram below shows a representative user document with addresses nested as an array of objects, preferences as an embedded sub-document, and tags as a simple string array.

This structure means a single findOne() call retrieves the user plus all their addresses, preferences, and tags in one operation. In a relational database, this would require at least three JOINs across four tables. The visual highlights how deeply nested data is stored contiguously on disk, which makes reads fast but updates on nested elements require careful use of positional operators like $[elem] or the entire document may need to be rewritten.

Production Insight
When fetching a document, the entire BSON payload is loaded into RAM. For documents with large embedded arrays (e.g., thousands of comments), even if you only need the post title, you pay the full I/O cost of loading all comments. Use projections to limit returned fields, but be aware that the database still reads the full document from disk before applying the projection. For read-heavy workloads with large embedded arrays, consider moving the array to a separate collection and using a $lookup only when the array data is needed.
Key Takeaway
A MongoDB document can contain arrays and embedded objects — the structure mirrors your application's native data shapes.
The trade-off: one read fetches everything, but updates to nested fields require special operators and the entire document is loaded into memory even if you only need a subset of fields.

Embedding vs Referencing — Decision Matrix for Production Schema Design

The most consequential schema design decision in MongoDB is whether to embed related data inside the parent document or store it as a separate referenced document with a foreign key. This decision affects query performance, write complexity, data consistency, and the maximum document size. There is no universal answer — the right choice depends on your specific access pattern, data growth characteristics, and consistency requirements.

The following decision matrix formalizes the trade-offs using real-world production patterns. Use it as a checklist during schema design reviews.

The Bucket Pattern: Middle Ground for Medium-Sized Arrays
When an array is too large for practical embedding (1000+ items) but too small or performance-sensitive for fully referenced queries, use the Bucket Pattern. Store items in groups of 100 inside bucket documents keyed by a common grouping field. For example, store 100 comments per bucket document with a post_id field. This keeps each document under ~50KB, allows efficient retrieval of a range of comments, and avoids the 16MB wall. Use a sort field (like created_at) to order comments within the bucket and page through buckets.
Production Insight
A common mistake is to always embed 'for performance' without considering write amplification. If you embed a user's address and the user moves, you update exactly one user document. But if you embed the address in every order they've ever placed, you must update thousands of order documents — each update rewriting the entire order document. This write amplification can saturate your primary's write capacity.
Rule: embed when the child data is read-intensive and infrequently updated; reference when the child data changes often or is shared. Profile your actual read/write ratio: a 90:10 read-heavy workload favors embedding; a 50:50 read/write pattern often favors referencing.
Key Takeaway
The embed-vs-reference decision is not about data modeling purity — it's about your application's read/write ratio, array growth bounds, and consistency requirements.
Use the decision matrix as a structural guide: embed for exclusive, read-together, bounded data; reference for shared, independently queried, or potentially unbounded data.
The Bucket Pattern provides a middle ground for arrays that are too large to embed but too performance-sensitive to fully reference.

GridFS — Storing Files Larger Than 16MB

When you need to store files larger than MongoDB's 16MB document size limit — audio files, high-resolution images, PDFs, or video clips — you cannot store them as a single document. GridFS is MongoDB's built-in specification for storing and retrieving large binary objects by splitting them into smaller chunks.

GridFS stores the file across two collections in the same database: - fs.files: stores metadata about the file (filename, content type, size, MD5 hash, upload date) - fs.chunks: stores the actual binary data in 255KB chunks by default, each chunk referencing the file via a files_id field

GridFS is not a separate service — it's a convention implemented by the MongoDB drivers and mongosh. The chunks are automatically split, stored, and reassembled when you read the file. The default chunk size is 255KB, which is a compromise between the number of chunks and the size of each chunk. You can change this when writing the file if your workload benefits from larger or smaller chunks.

When should you use GridFS? When the file size exceeds 16MB and you need to keep it inside MongoDB for replication or backup consistency, or when you need to access portions of a file (e.g., skip to a specific byte offset in a video). Do not use GridFS for files smaller than 16MB — storing them as a regular document with a binData field is simpler and faster. Also, GridFS is not a replacement for a dedicated file storage system like S3 or web servers; it's best when the file is tightly coupled with your MongoDB data and you want transactional consistency between metadata and file content.

Performance considerations: reading a large file via GridFS involves querying the fs.chunks collection with a range query on n (chunk index). Ensure an index on { files_id: 1, n: 1 } exists to make chunk retrieval efficient. For write-heavy file uploads, the chunk writes are not atomic as a group — each chunk is individually written. If an upload fails mid-way, you must clean up orphaned chunks manually.

gridfs_example.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// Using mongosh's GridFS methods or the Node.js driver
// The following examples work in mongosh directly

// ─────────────────────────────────────────
// WRITE a file to GridFS
// mongosh provides the 'mongofiles' command, but we can also use the 'fs' collection directly
// Here we use the 'GridFSBucket' pattern available in Node.js driver.

// For mongosh, use the 'mongofiles' shell utility or:
// In Node.js with the 'mongodb' package:

const { MongoClient, GridFSBucket } = require('mongodb');
const fs = require('fs');

async function uploadFile() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();
  const db = client.db('myfiles');

  const bucket = new GridFSBucket(db, { bucketName: 'user_uploads' });

  // Upload a file; the driver splits it into 255KB chunks automatically
  const readableStream = fs.createReadStream('./profile_photo_hires.jpg');
  const uploadStream = bucket.openUploadStream('profile_photo_hires.jpg', {
    metadata: { userId: ObjectId('...') }  // attach arbitrary metadata
  });

  readableStream.pipe(uploadStream);

  uploadStream.on('finish', () => {
    console.log('File uploaded successfully. ID:', uploadStream.id);
    // For mongosh, you can verify:
    // db.user_uploads.files.findOne({ filename: '...' })
    client.close();
  });
}

// ─────────────────────────────────────────
// READ a file from GridFS
async function downloadFile(fileId) {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();
  const db = client.db('myfiles');
  const bucket = new GridFSBucket(db, { bucketName: 'user_uploads' });

  const downloadStream = bucket.openDownloadStream(ObjectId(fileId));
  const writeStream = fs.createWriteStream('./downloaded_photo.jpg');

  downloadStream.pipe(writeStream);

  writeStream.on('finish', () => {
    console.log('File downloaded successfully.');
    client.close();
  });
}

// ─────────────────────────────────────────
// LIST metadata for all files in a bucket
async function listFiles() {
  const client = new MongoClient('mongodb://localhost:27017');
  await client.connect();
  const db = client.db('myfiles');
  const cursor = db.collection('user_uploads.files').find({});
  await cursor.forEach(doc => {
    console.log(doc.filename, doc.length, doc.uploadDate);
  });
  client.close();
}
Output
File uploaded successfully. ID: ObjectId('665...')
File downloaded successfully.
profile_photo_hires.jpg 25165824 2024-05-01T12:00:00.000Z
GridFS Is Not a General-Purpose File System
GridFS performs poorly for large numbers of small files (thousands of files under 1MB) because each file creates multiple documents in fs.files and fs.chunks, increasing index size and query overhead. For such cases, store small files as base64-encoded strings or BSON Binary directly in a document (if under 16MB total document size). For large-scale file storage, consider S3 or a similar object store and store only the URL/path in MongoDB.
Production Insight
GridFS is most valuable when you need transactional consistency between file metadata and other MongoDB data. For example, a user's profile photo should be deleted when the user account is deleted — if the photo is in GridFS, a deleteOne on the fs.files collection automatically removes all chunks due to the foreign key relationship (but only if you manually cascade or use a TTL index on chunk documents).
In high-throughput environments, writing many files simultaneously can cause contention on the _id index of fs.chunks. Consider using a separate database or sharding the fs.chunks collection on files_id if you expect heavy concurrent file uploads.
Rule: enable { writeConcern: { w: 'majority' } } for file uploads to ensure that the metadata document is written before any chunks are considered durable — otherwise you can end up with orphaned chunks if the upload fails after writing some chunks but before writing the files document.
Key Takeaway
GridFS breaks single files >16MB into 255KB chunks stored across fs.files and fs.chunks collections — it's built-in but has performance characteristics you must understand.
Use GridFS only when files exceed 16MB or when you need byte-range access; for smaller files, a Binary field in a regular document is simpler.
Always index { files_id: 1, n: 1 } on fs.chunks for efficient retrieval and consider write concern to avoid orphaned chunks.
● Production incidentPOST-MORTEMseverity: high

The 16MB Document Wall — When Embedding Everything Kills Your Writes

Symptom
Comment insert operations returned success acknowledgment to the application, but comments never appeared on the post. MongoDB logs showed BSONObjectTooLarge errors. The viral post document had grown to 16.2MB. Customer support started receiving complaints about 'lost' comments before the engineering team was paged.
Assumption
The team assumed MongoDB's flexible document model meant 'put everything in one document because joins are expensive.' They treated the 16MB limit as a theoretical concern — something that only happens at Facebook scale, not at a startup with a niche developer blog.
Root cause
MongoDB enforces a hard 16MB limit per document. An unbounded embedded array — like comments on a post that hits the front page of Hacker News — will eventually hit this wall regardless of your traffic expectations. Each comment was roughly 500 bytes; 32,000 comments crossed the threshold. The application code did not inspect write result objects for errors, so BSONObjectTooLarge failures were silently swallowed and the application continued returning HTTP 200 to the commenter.
Fix
Migrated comments to a separate comments collection with a post_id reference field. Created an index on post_id for efficient retrieval. For posts that legitimately needed a denormalized comment count for display purposes without loading all comments, implemented the Bucket Pattern — storing 100 comments per bucket document instead of all comments in one unbounded array. Added write result error checking to all insert and update paths.
Key lesson
  • Never embed arrays that can grow without a fixed upper bound — if you cannot cap the array at 100-200 items with certainty, use a reference collection
  • Always inspect write result objects for errors — MongoDB returning an insertedId does not guarantee the write actually persisted, especially when document-size limits are in play
  • The 16MB limit is real and will hit you on your most popular content, not your average content — design your schema for your best-case traffic spike, not your median case
  • Silent write failures are worse than loud ones — always propagate storage errors to the application layer and log them with enough context to diagnose the cause
Production debug guideSymptom-driven actions for the most common production issues5 entries
Symptom · 01
Query latency spikes from under 10ms to over 5 seconds after data growth
Fix
Run .explain('executionStats') on the slow query. Check executionStats.totalDocsExamined vs executionStats.nReturned. If examined is orders of magnitude larger than returned, you have a COLLSCAN. Create an index on the filter field and re-run explain to confirm the winning plan changes from COLLSCAN to IXSCAN.
Symptom · 02
Aggregation pipeline times out on large collections
Fix
Check pipeline stage order — if $group or $sort appears before $match, move $match to stage 1. Verify the working set fits in RAM via db.serverStatus().wiredTiger.cache. If cache used approaches cache max, your working set has outgrown available memory and you need to either add RAM or reduce the dataset with earlier $match filtering.
Symptom · 03
updateOne call loses fields that were present before the update
Fix
You passed a bare replacement object instead of { $set: { ... } }. The bare object replaced the entire document, deleting every field not present in the replacement. Check your update call structure immediately and restore missing fields from a backup or replica. Add $set to the update and audit all other updateOne calls in the codebase for the same pattern.
Symptom · 04
Writes fail with BSONObjectTooLarge error
Fix
A document has hit the 16MB limit — almost certainly an unbounded embedded array. Check the document size: Object.bsonsize(db.collection.findOne({_id: yourId})). Migrate the large array to a referenced collection with an appropriate index. Consider the Bucket Pattern if you need some denormalization for performance.
Symptom · 05
Sort operation logs 'Sort exceeded memory limit of 104857600 bytes'
Fix
The sort field lacks an index, or the compound index field order doesn't match the sort. Create an index that matches your sort field and direction. As a temporary relief, add { allowDiskUse: true } to the aggregation options, but treat this as a signal to fix the index — disk-based sort is a performance symptom, not a solution.
★ MongoDB Quick Debug ReferenceCommands to run when something is broken in production. No theory — just copy, paste, diagnose.
Query is slow — users reporting latency or timeouts
Immediate action
Check if the query is doing a full collection scan instead of using an index
Commands
db.collection.find({yourFilter}).explain('executionStats')
db.collection.getIndexes()
Fix now
If winningPlan.stage is COLLSCAN, create an index on the filter field: db.collection.createIndex({ field: 1 }). If the field is in a compound query, create a compound index matching the query's equality filters first, then sort fields.
Document won't insert or update — silent failure or BSONObjectTooLarge+
Immediate action
Check if the target document has exceeded the 16MB BSON size limit
Commands
Object.bsonsize(db.collection.findOne({_id: ObjectId('your-id-here')}))
db.collection.stats().avgObjSize
Fix now
If bsonsize exceeds 16777216, split the embedded array into a referenced collection with an indexed foreign key field. Check avgObjSize — if it is climbing steadily over time, you have an array growth problem across many documents.
Aggregation pipeline is slow or timing out+
Immediate action
Verify $match is the first stage and that the $match filter field has an index
Commands
db.collection.explain('executionStats').aggregate(yourPipeline)
db.serverStatus().wiredTiger.cache
Fix now
Move $match to stage 1. If cache bytes in use exceeds 80% of maximum cache bytes configured, your working set exceeds RAM — add RAM or reduce dataset size with earlier filters. Check if $lookup join fields are indexed on the foreign collection.
Replica set secondary falling behind primary — replication lag growing+
Immediate action
Check replication lag and oplog window to determine if secondary can catch up or needs resync
Commands
rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()
Fix now
If replication lag exceeds the oplog window, the secondary cannot catch up by replaying the oplog and needs a full resync. Restart the secondary with --resync or restore from a recent snapshot. Increase oplog size if this recurs: db.adminCommand({ replSetResizeOplog: 1, size: 10240 })
Too many open connections — application connection errors or MongoDB connection pool exhausted+
Immediate action
Check current connection count against server maximum and identify long-running operations
Commands
db.serverStatus().connections
db.currentOp({ active: true, secs_running: { $gt: 5 } })
Fix now
Kill long-running operations with db.killOp(opid). Set your application connection pool maxPoolSize to a value that aligns with your MongoDB server's ulimit and available resources. Do not set maxPoolSize higher than the server can handle across all application instances combined.
MongoDB vs PostgreSQL — Feature Comparison
Feature / AspectMongoDB (Document DB)PostgreSQL (Relational DB)
Data shapeFlexible — each document in a collection can have different fields and nesting depthsFixed — all rows in a table must conform to the same column schema
Schema changesAdd fields to new documents without migrating old ones — app must handle missing fields gracefullyRequires ALTER TABLE — can lock the table during migration on large datasets without tooling like pg_repack
Joins$lookup in aggregation pipeline — per-document operation, foreign field must be indexed, more expensive than SQL JOINNative JOIN with query planner optimisation — first-class, set-based, highly optimised
Horizontal scalingBuilt-in sharding distributes data across shards using a shard key — designed for horizontal scale from day oneVertical scaling by default; horizontal sharding requires Citus, manual partitioning, or application-level sharding
TransactionsMulti-document ACID transactions since v4.0 — available but carry overhead; single-document operations are atomic by defaultFull ACID transactions since day one — mature, efficient, widely understood
Query languageJSON filter objects + aggregation pipeline — powerful but requires MongoDB-specific knowledgeDeclarative SQL — portable, standardised, known by virtually every backend engineer
Best forVariable-structure data, product catalogues, content management, IoT telemetry, rapid iteration with evolving schemasFinancial records, billing systems, heavily relational data, reporting with complex ad-hoc queries
Nested dataFirst-class — embed arrays and objects natively, query with dot-notation, no additional tables neededAwkward — JSONB columns support nesting but lose relational query optimisations; separate tables are the idiomatic approach

Key takeaways

1
MongoDB stores data as BSON documents inside collections
no rows, no fixed columns. Two documents in the same collection can have completely different fields. This is intentional flexibility, not chaos, but it means your application must own schema validation rather than relying on the database to enforce it.
2
Always use $set in updateOne calls unless you intend a full document replacement. A bare update object in updateOne replaces the entire document, silently deleting every field you did not include. This produces no error and returns modifiedCount
1 — it is the most common silent data-loss bug in MongoDB production systems.
3
Every field you filter or sort by in production needs an index. Run explain('executionStats') on every query before it ships and confirm the winning plan shows IXSCAN with a totalDocsExamined to nReturned ratio close to 1:1. A missing index is invisible in development and catastrophic in production.
4
The aggregation pipeline is MongoDB's answer to SQL GROUP BY and JOINs
but stage order is your responsibility. Always put $match first to filter the working set early. Treat any pipeline where $group or $sort precedes $match as a bug, not a style choice.

Common mistakes to avoid

5 patterns
×

Using updateOne with a bare replacement object instead of $set

Symptom
The entire document is silently replaced, losing all fields not present in the replacement object. No error is thrown — modifiedCount returns 1. The data is gone and there is nothing in the application logs to indicate it.
Fix
Always structure updates as { $set: { fieldToChange: newValue } } unless you explicitly intend a full document replacement via replaceOne. Audit all existing updateOne calls in your codebase and flag any that lack an update operator as the outermost key.
×

Not creating indexes on filter and sort fields before going to production

Symptom
Queries work in under 5ms in development with 500 test documents, then take 10-15 seconds in production with 5 million documents because every query performs a full COLLSCAN. Disk I/O saturates and timeouts cascade across the application.
Fix
Run db.collection.find(yourFilter).explain('executionStats') and confirm the winning plan shows IXSCAN with a totalDocsExamined to nReturned ratio close to 1:1. Create compound indexes that match your most common query filter and sort patterns. Do this before load testing, not after your first production incident.
×

Embedding unbounded arrays inside documents

Symptom
An array such as post comments or chat messages grows until the 16MB document limit is hit. Insert operations fail with BSONObjectTooLarge errors. If write error results are not inspected, failures are swallowed silently and users see missing data.
Fix
Any array that cannot be capped with certainty at 100-200 items should be a referenced collection with an indexed foreign key field. Use the Bucket Pattern — groups of 100 items per bucket document — if you need some data locality without an unbounded single document.
×

Placing $group or $sort before $match in an aggregation pipeline

Symptom
The aggregation pipeline processes the entire collection before filtering, causing excessive memory usage, slow execution, and timeouts on large collections. The bug is invisible in development with small datasets.
Fix
Always place $match as the very first stage. If you need to filter after grouping, use a second $match after $group — but ensure the first $match eliminates as many documents as possible. Run explain() on the pipeline to confirm the $match stage uses an index.
×

Not inspecting write result objects for errors

Symptom
Insert or update operations fail silently — the application continues without error even though the write did not persist. Common with BSONObjectTooLarge, duplicate key violations, and write concern failures that are swallowed by a bare try/catch.
Fix
Always check the write result: insertedId for insertOne, modifiedCount for updateOne, deletedCount for deleteOne. Catch exceptions and log them with enough context to identify the document, collection, and operation. Never assume success from the absence of an exception.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What's the difference between embedding and referencing in MongoDB schem...
Q02SENIOR
MongoDB is described as 'schema-less' — but experienced engineers say th...
Q03SENIOR
If a MongoDB aggregation pipeline is running slowly on a large collectio...
Q04SENIOR
Explain the difference between $set and passing a bare object to updateO...
Q01 of 04SENIOR

What's the difference between embedding and referencing in MongoDB schema design, and how do you decide which to use for a given relationship?

ANSWER
Embedding stores related data as nested sub-documents or arrays inside the parent document. Referencing stores an ObjectId in the parent that points to a document in another collection. The decision is driven entirely by access pattern. Embed when the nested data belongs exclusively to one parent, you always read parent and child together, and the array size is bounded — a realistic order's line items are a good example. Reference when the data is shared across many parents and updating it in one place needs to propagate everywhere — an author writing many posts is the textbook case. Also reference when the sub-data needs independent queries or when the array can grow without a predictable upper limit. The core trade-off: embedding optimises reads at the cost of write complexity when embedded data needs updates across many documents. Referencing optimises writes and independent queries at the cost of additional round-trips at read time.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between MongoDB and a SQL database?
02
Does MongoDB support transactions like SQL databases do?
03
When should I embed data vs reference it with an ObjectId in MongoDB?
04
How do I search for text in MongoDB documents?
05
What is the MongoDB 16MB document size limit and how do I design around it?
🔥

That's NoSQL. Mark it forged?

12 min read · try the examples if you haven't

Previous
Introduction to NoSQL Databases
2 / 15 · NoSQL
Next
MongoDB CRUD Operations