Senior 3 min · March 05, 2026

SQL EXPLAIN — Nested Loop on 5M Rows Caused 12-Min Query

Stale statistics on a 5M-row table: planner estimated 100 rows, chose Nested Loop, caused 12-min query.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • EXPLAIN shows the query plan the database will use; EXPLAIN ANALYZE actually runs the query and shows real timings
  • Seq Scan = full table scan — every row read — usually the problem node to fix
  • Index Scan = B-tree traversal + heap fetch per row; Index Only Scan = no heap fetch (fastest)
  • Cost is in arbitrary units: (startup_cost..total_cost rows=estimated width=bytes)
  • Nested Loop = one table drives, the other is looked up per row — good for small outer tables; Hash Join = build hash table from smaller table, probe with larger — good for large tables
  • Loops=N on an inner node means that node ran N times — the O(n²) signal for correlated subqueries
Plain-English First

Every slow query has a story, and EXPLAIN is how you read it. A query can be written in dozens of equivalent ways, and the database makes a decision about the most efficient path to execute it. EXPLAIN shows you that decision — like the GPS navigation plan before you drive. EXPLAIN ANALYZE is like driving the route and recording how long each segment actually took.

Every slow query has a story, and EXPLAIN is how you read it. In production, a query that ran fine on 10,000 rows can grind a system to a halt at 10 million. The difference isn't always the query itself — it's the execution plan the query optimiser chose. Developers who can read execution plans diagnose slow queries in minutes instead of hours of guesswork.

This guide teaches you to read execution plans the same way a senior DBA does: starting from the innermost node and reading the cost signals, not from the top down. By the end, you'll know the difference between every major node type, what the cost numbers actually mean, and how to use the plan to drive targeted fixes.

How to Read an Execution Plan — The Fundamentals

An execution plan is a tree. The query planner breaks your SQL into operations and arranges them from innermost (raw data retrieval) to outermost (final result). You read a plan from the inside out — the innermost nodes execute first.

Every node in the plan shows three things: the operation type (Seq Scan, Index Scan, Hash Join), the cost estimate (startup_cost..total_cost), and the row and width estimates. When you run EXPLAIN ANALYZE, you also get the actual time and actual rows — the ground truth.

The most important comparison: estimated rows vs actual rows. When these diverge by an order of magnitude, the planner is working from stale statistics and the plan choice may be wrong. A planner that thinks a table has 1,000 rows when it has 1,000,000 will choose algorithms designed for small data.

explain_analyze_reading.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- Basic EXPLAIN: shows estimated plan without executing
EXPLAIN
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name;

-- EXPLAIN ANALYZE: executes and shows actual timings
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name;

-- EXPLAIN ANALYZE BUFFERS: also shows cache hit ratio
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT c.name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
WHERE o.status = 'completed'
GROUP BY c.customer_id, c.name;
Output
HashAggregate (cost=8420.00..8425.00 rows=500 width=40)
(actual time=124.3..125.1 rows=487 loops=1)
-> Hash Join (cost=1250.00..8100.00 rows=12000 width=32)
(actual time=18.2..119.4 rows=11842 loops=1)
Hash Cond: (o.customer_id = c.customer_id)
-> Seq Scan on orders (cost=0.00..6200.00 rows=12000 width=16)
(actual time=0.05..98.3 rows=11842 loops=1)
Filter: ((status)::text = 'completed')
Rows Removed by Filter: 38158
-> Hash (cost=750.00..750.00 rows=50000 width=24)
-> Seq Scan on customers (cost=0..750 rows=50000 width=24)
Always Use EXPLAIN ANALYZE on the Actual Slow Query
EXPLAIN without ANALYZE shows estimated plans based on statistics — it does not run the query. The estimates can be wrong. EXPLAIN ANALYZE runs the query and shows real timing. For read queries on production, this is safe. For destructive queries (DELETE, UPDATE), wrap in a transaction and roll back: BEGIN; EXPLAIN ANALYZE DELETE FROM ...; ROLLBACK;
Production Insight
Read execution plans from the innermost indented node outward — the deepest node executes first.
The most useful signal: actual rows vs estimated rows. A 10x or greater divergence means statistics are stale — run ANALYZE.
For production query analysis, use EXPLAIN (ANALYZE, BUFFERS) — the BUFFERS output shows how many data pages came from cache vs disk, which is often the real performance signal.
Key Takeaway
Execution plans are trees — read from innermost node (deepest indentation) outward.
Actual rows vs estimated rows is the most important signal — large divergence means stale statistics.
EXPLAIN ANALYZE runs the query; EXPLAIN alone uses estimates. Always ANALYZE with care on production destructive queries.

Seq Scan, Index Scan, and Index Only Scan — The Core Node Types

Three scan types appear in most queries. Understanding what each does determines what fix to apply.

Seq Scan (Sequential Scan) reads every page of the table in order. This is not always wrong — on small tables or when the query returns >15-20% of all rows, a sequential scan is cheaper than following millions of index pointers. But Seq Scan on a large table with a selective WHERE clause is the most common performance problem in SQL.

Index Scan uses a B-tree index to find matching row locations, then fetches each matching row from the heap (the main table storage). The index narrows the search; the heap provides the full row data. This is the standard fix for a Seq Scan — add an index on the WHERE column.

Index Only Scan is the fastest path — all data the query needs is in the index itself (a covering index), so the heap is never accessed. Zero heap fetches. Achievable by adding INCLUDE columns to the index definition.

scan_types_diagnosis.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Seq Scan: no index, or planner chose not to use one
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending';
-- Seq Scan on orders (cost=0.00..8500.00 rows=1200 width=80)
-- Actual rows: 1200, loops=1  -- if this is >15% of the table, seq scan is correct

-- Add index -> Index Scan
CREATE INDEX idx_orders_status ON orders(status) WHERE status = 'pending';
-- partial index: only indexes the rows that match -- much smaller, more selective

EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending';
-- Index Scan using idx_orders_status on orders
-- (cost=0.42..180.00 rows=1200 width=80) -- cost dropped from 8500 to 180

-- Covering index -> Index Only Scan (no heap access)
CREATE INDEX idx_orders_customer_status
    ON orders(customer_id, status)
    INCLUDE (order_id, total, created_at);  -- add SELECT columns

EXPLAIN ANALYZE
SELECT order_id, total, created_at
FROM orders
WHERE customer_id = 42 AND status = 'completed';
-- Index Only Scan using idx_orders_customer_status
-- Heap Fetches: 0  <- fastest possible path
Output
-- Before index: Seq Scan cost=8500
-- After index: Index Scan cost=180
-- With covering index: Index Only Scan, Heap Fetches: 0
-- Cost reduced from 8500 to ~40
Seq Scan Is Not Always Wrong
The planner chooses Seq Scan when it estimates it is cheaper than following index pointers — typically when >15-20% of rows would be returned. On a table of 1,000 rows, Seq Scan is almost always chosen regardless of indexes — and correctly so. Only investigate Seq Scan when it appears on a large table with a highly selective WHERE clause.
Production Insight
Seq Scan on a large table + low estimated rows = missing index or function wrapping the column.
Seq Scan on a large table + high estimated rows = low selectivity — the planner is correct, the query needs redesigning.
Index Only Scan is the goal for your top read-heavy queries — identify them with pg_stat_statements and cover them.
Key Takeaway
Seq Scan = full table read — investigate when on large tables with selective WHERE.
Index Scan = B-tree lookup + heap fetch per row — standard fix for Seq Scan on selective queries.
Index Only Scan = zero heap access — achievable with INCLUDE columns, the highest-performance read path.

Join Algorithms — Nested Loop, Hash Join, Merge Join

Three join algorithms appear in execution plans. The planner chooses based on estimated table sizes, available indexes, and whether the data is sorted.

Nested Loop Join drives with one table (the outer loop) and probes the inner table once per outer row. Fast when the outer table is small and the inner table has an index on the join column. Catastrophic when the outer table is large — the inner probe repeats N times where N is the outer row count. The loops=N in the inner node of EXPLAIN ANALYZE reveals this.

Hash Join builds a hash table from the smaller table, then probes it for each row of the larger table. Two passes over the data, but O(n+m) rather than O(n×m). The standard choice for large table joins. Requires memory — if the hash table spills to disk (HashBatches > 1 in EXPLAIN output), performance degrades significantly.

Merge Join requires both tables to be sorted on the join column. Extremely fast when both sides are already sorted or can use an index scan in sorted order. Rarely chosen by the planner unless sort order is already available.

join_algorithms_explain.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Force specific join type for comparison (PostgreSQL)
-- Normally leave this to the planner -- these are for diagnosis only

SET enable_hashjoin = off;
SET enable_mergejoin = off;
EXPLAIN ANALYZE SELECT * FROM orders o JOIN customers c ON c.customer_id = o.customer_id;
-- Will use Nested Loop -- shows cost with forced algorithm

RESET enable_hashjoin;
RESET enable_mergejoin;
EXPLAIN ANALYZE SELECT * FROM orders o JOIN customers c ON c.customer_id = o.customer_id;
-- Planner chooses optimal -- typically Hash Join for large tables

-- Diagnose: is Hash Join spilling to disk?
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM large_table a JOIN another_large_table b ON a.id = b.a_id;
-- Look for: Batches: 4 (instead of 1)
-- Batches > 1 means the hash table spilled to disk -- increase work_mem
-- SET work_mem = '256MB'; -- session-level increase for expensive joins
Output
-- Nested Loop (forced, 100K rows outer):
-- Nested Loop (actual time=0.04..18420.3 rows=100000 loops=1)
-- -> Seq Scan on orders (loops=1)
-- -> Index Scan on customers (loops=100000) <- inner runs 100K times
-- Hash Join (planner chose, same query):
-- Hash Join (actual time=42.1..380.2 rows=100000 loops=1)
-- -> Seq Scan on orders (loops=1)
-- -> Hash -> Seq Scan on customers (loops=1) <- hash built once
-- Massive improvement: 18420ms -> 380ms
Hash Join Spilling to Disk
If EXPLAIN ANALYZE shows Batches: 4 (or any number > 1) on a Hash Join, the hash table exceeded work_mem and spilled to disk. This dramatically slows the join. Increase work_mem for the session: SET work_mem = '256MB'. This is a session-level setting — increase it for specific expensive queries, not globally, to avoid OOM on concurrent connections.
Production Insight
Nested Loop + large loops= number is the most common join performance problem — it means the planner underestimated the outer table's size (stale statistics).
Hash Join Batches > 1 means memory pressure — increase work_mem for the session, not globally.
Planner forcing (SET enable_hashjoin = off) is a diagnostic tool only — never leave it disabled in production.
Key Takeaway
Nested Loop = good for small outer tables + indexed inner; catastrophic for large outer tables.
Hash Join = large table joins; Batches > 1 means hash spilled to disk — increase work_mem.
Merge Join = when both sides are already sorted — rare but extremely fast when applicable.

Statistics, ANALYZE, and Why Estimates Go Wrong

The query planner does not read your data to make decisions — it reads statistics. PostgreSQL maintains per-column statistics: the number of distinct values, the most common values and their frequencies, and a histogram of the value distribution. These statistics live in pg_statistic and are updated by VACUUM ANALYZE.

When statistics are stale — after a bulk insert, a data migration, or a period of high insert/delete volume — the planner's row estimates can be orders of magnitude off. A planner that thinks a table has 100,000 rows when it has 5,000,000 will choose Nested Loop joins instead of Hash Joins, index scans instead of seq scans on highly selective queries, and estimate sort operations as cheap when they are expensive.

The practical workflow: whenever a query plan looks wrong (estimated rows diverges greatly from actual rows in EXPLAIN ANALYZE), run ANALYZE on the affected table, then re-run EXPLAIN ANALYZE. In most cases the plan improves immediately.

statistics_and_analyze.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Check how fresh the statistics are for a table
SELECT
    relname          AS table_name,
    n_live_tup       AS estimated_live_rows,
    n_dead_tup       AS dead_rows,
    last_analyze     AS last_manual_analyze,
    last_autoanalyze AS last_auto_analyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- Force statistics update after bulk import or migration
ANALYZE orders;
-- Or with verbosity:
ANALYZE VERBOSE orders;

-- Check column-level statistics (what the planner sees)
SELECT
    attname          AS column,
    n_distinct,      -- negative = fraction of total rows; -0.5 means 50% distinct
    correlation      -- 1.0 = perfectly sorted, 0 = random -- affects index vs seq scan choice
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('customer_id', 'status', 'created_at');

-- Identify tables with stale or missing statistics
SELECT relname, n_live_tup, last_analyze
FROM pg_stat_user_tables
WHERE last_analyze < NOW() - INTERVAL '7 days'
   OR last_analyze IS NULL
ORDER BY n_live_tup DESC;
Output
-- pg_stat_user_tables:
table_name | estimated_live_rows | last_analyze
orders | 5000000 | 2024-03-14 02:30:00 <- analyzed recently
customer_stats | 85000 | NULL <- NEVER analyzed! Statistics are default
-- pg_stats for orders:
column | n_distinct | correlation
customer_id| -0.8 | 0.12 <- 80% distinct, random distribution
status | 4 | 0.45 <- only 4 distinct values (low cardinality)
created_at | -1.0 | 0.98 <- highly distinct, mostly sorted
Always ANALYZE After Bulk Inserts and Migrations
ANALYZE must be run manually after bulk inserts — autovacuum may not keep up with sudden large row count changes. A table that grows from 100K to 5M rows in a single migration will have statistics based on the old row count until ANALYZE runs. Add ANALYZE to every migration script that inserts or deletes large numbers of rows.
Production Insight
The single most impactful EXPLAIN diagnostic: compare estimated rows vs actual rows in EXPLAIN ANALYZE output — any 10x divergence means stale statistics and a likely suboptimal plan.
For tables with high insert/delete rates, check last_autoanalyze frequency in pg_stat_user_tables — if autovacuum is not running frequently enough, adjust autovacuum_analyze_scale_factor.
Add ANALYZE tablename to every migration script that affects large row counts.
Key Takeaway
The planner uses statistics, not data — stale statistics produce wrong plans.
EXPLAIN ANALYZE shows estimated vs actual rows — large divergence = run ANALYZE.
Add ANALYZE to migration scripts; monitor last_autoanalyze in pg_stat_user_tables for high-volume tables.
● Production incidentPOST-MORTEMseverity: high

A Nested Loop Join on a 5-Million-Row Table Caused a 12-Minute Query

Symptom
A weekly revenue report that generated in 30 seconds started taking 12 minutes after a data migration increased the orders table from 100,000 rows to 5 million. The query had not changed.
Assumption
The query had been running fine for a year. No one checked whether the execution plan had changed after the migration.
Root cause
The query planner's statistics were stale — it estimated 100 rows for the orders table (based on pre-migration statistics) and chose a Nested Loop join as appropriate for small tables. At 5 million rows, a Nested Loop repeats the inner table probe 5 million times. The correct plan for 5 million rows was a Hash Join, which the planner would have chosen with accurate statistics.
Fix
ANALYZE orders; forced the statistics to update. The planner immediately chose a Hash Join on the next run. Query time dropped from 12 minutes to 18 seconds. Added ANALYZE to the post-migration checklist.
Key lesson
  • Run ANALYZE on large tables after bulk inserts or migrations — stale statistics cause wrong plan choices
  • Always run EXPLAIN ANALYZE after any significant data volume change to verify the plan has not regressed
  • The planner estimates rows based on statistics — if the estimate is wildly off from actual rows, statistics are stale
Production debug guideRead the plan systematically — innermost node first4 entries
Symptom · 01
Seq Scan on a large table in the plan
Fix
Check if the WHERE column has an index. If yes, check if the query uses a function on the column (WHERE LOWER(email) kills the index). Check selectivity — if >20% of rows match, the planner may prefer Seq Scan even with an index. Run ANALYZE to refresh statistics.
Symptom · 02
Actual rows is many times higher than estimated rows
Fix
Stale statistics. Run ANALYZE table_name. For tables with frequent inserts/deletes, check autovacuum settings — autovacuum may not be keeping statistics current. The planner chooses plans based on estimates — bad estimates produce bad plans.
Symptom · 03
Nested Loop with loops=N where N is very large
Fix
The outer table is large and the inner probe is repeating N times. Add an index on the inner table's JOIN column. If N is in the millions, the planner may have stale statistics causing it to underestimate the outer table's row count — run ANALYZE.
Symptom · 04
Sort node with high actual time
Fix
The query cannot use an index for ORDER BY — it is performing a filesort. Add an index that matches the ORDER BY column order. For window functions, add a composite index on (PARTITION_BY_col, ORDER_BY_col).
★ EXPLAIN Reading Cheat SheetKey nodes, what they mean, and what to do
Seq Scan on large table
Immediate action
Check for missing index or function wrapping the column
Commands
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
SELECT indexname FROM pg_indexes WHERE tablename = 'orders';
Fix now
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
Stale estimates — actual rows far exceeds estimated rows+
Immediate action
Refresh table statistics
Commands
ANALYZE orders;
EXPLAIN ANALYZE SELECT ... ; -- rerun the slow query
Fix now
If autovacuum is not keeping up, check: SELECT relname, n_live_tup, n_dead_tup, last_analyze FROM pg_stat_user_tables WHERE relname = 'orders';
Node TypeWhat It DoesWhen It AppearsWhat To Do
Seq ScanReads every table page sequentiallyNo index, low selectivity, or small tableAdd index if WHERE is selective; check if >15% of rows returned
Index ScanB-tree lookup + heap fetch per rowSelective WHERE with indexUsually optimal — consider INCLUDE for Index Only Scan
Index Only ScanB-tree lookup only, no heap accessCovering index on all needed columnsIdeal — add INCLUDE columns to achieve this
Nested LoopOuter table drives, inner probed per rowSmall outer table or index on inner JOIN colCatastrophic on large outer — add index to inner or check statistics
Hash JoinBuild hash table, probe with other tableLarge table joins without sort orderStandard for large joins — check Batches > 1 (spill to disk)
SortSort rows for ORDER BY or Merge JoinORDER BY without matching indexAdd index matching ORDER BY order
HashAggregateGroup rows for GROUP BYGROUP BY or DISTINCTUsually fine — check for large estimated distinct count

Key takeaways

1
Read execution plans from the innermost (deepest indented) node outward
innermost nodes execute first.
2
Estimated rows vs actual rows is the most important signal
10x divergence means stale statistics, run ANALYZE.
3
Seq Scan is only a problem on large tables with selective WHERE clauses
on small tables or low-selectivity queries, it is often correct.
4
Always run ANALYZE after bulk inserts and migrations
add it to every migration script that changes large row counts.

Common mistakes to avoid

3 patterns
×

Running EXPLAIN without ANALYZE and treating the estimated plan as ground truth

Symptom
The estimated plan looks fine but the query is slow in practice — actual rows is orders of magnitude higher than estimated rows
Fix
Always use EXPLAIN ANALYZE for diagnosis. For destructive queries, wrap in a transaction: BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK. The estimated plan from plain EXPLAIN is based on statistics that may be stale.
×

Immediately adding an index after seeing a Seq Scan without checking selectivity

Symptom
The new index is never used by the planner — it still chooses Seq Scan; the index adds write overhead for no read benefit
Fix
Check what percentage of rows the WHERE clause returns. If >15-20% of rows match, Seq Scan is correct and an index will not help. Only add an index when the WHERE clause is highly selective (returns a small fraction of total rows).
×

Not running ANALYZE after bulk inserts or data migrations

Symptom
Queries that were fast before a migration become slow afterward — the plan has regressed to a worse algorithm despite no query changes
Fix
Add ANALYZE table_name to every migration script after bulk inserts or deletes. Verify with EXPLAIN ANALYZE that estimated rows match actual rows after the migration.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How do you read a PostgreSQL execution plan — where do you start?
Q02SENIOR
A Hash Join in your execution plan shows Batches: 8. What does this mean...
Q03SENIOR
The execution plan estimated 100 rows but the query returned 5 million. ...
Q01 of 03SENIOR

How do you read a PostgreSQL execution plan — where do you start?

ANSWER
Start from the innermost node (the most deeply indented line) and read outward. The innermost nodes are the raw data retrieval operations — Seq Scan or Index Scan — and they execute first. Each outer node receives the result of its inner nodes. For each node, I look at three things: the node type (what operation), the cost estimate (startup..total), and most critically, the estimated rows vs actual rows when using EXPLAIN ANALYZE. A large divergence between estimated and actual rows means stale statistics and a potentially suboptimal plan. I then look for the most expensive nodes by actual time and investigate those first — Seq Scans on large tables, Nested Loops with high loops values, Sort nodes without index backing.
FAQ · 3 QUESTIONS

Frequently Asked Questions

01
What is the difference between EXPLAIN and EXPLAIN ANALYZE?
02
Does EXPLAIN ANALYZE work in MySQL?
03
Why does my query use an index in EXPLAIN but a Seq Scan in production?
🔥

That's SQL Advanced. Mark it forged?

3 min read · try the examples if you haven't

Previous
SQL Query Optimisation
9 / 16 · SQL Advanced
Next
Full-Text Search in SQL