An index is a sorted lookup structure that maps column values to disk locations, avoiding full table scans
B-Tree indexes are default and support equality and range queries; Hash indexes are O(1) for equality only
Clustered indexes reorder rows on disk (one per table); non-clustered indexes are separate structures requiring key lookups
Performance: Index seek on 1M rows is ~0.1ms vs full scan ~2s — a 20,000x difference
Production trap: Every index adds write overhead — 8 indexes on a 10K writes/sec table means 80K extra I/O ops per second
Biggest mistake: Indexing low-cardinality columns alone — the planner will ignore the index at >20% row selectivity
✦ Definition~90s read
What is Indexing in DBMS?
Indexing in a DBMS is the mechanism that transforms a table scan (O(n) — reading every row to find a match) into a lookup (O(log n) or O(1)). Without indexes, your database is a filing cabinet with no tabs: every query, even a simple WHERE user_id = 42, forces the engine to iterate through every row until it finds a match.
★
Imagine a 1,000-page cookbook with no table of contents.
That's why a missing foreign key index on order.user_id can silently crater a checkout flow under load — the join against the users table degrades from microseconds to seconds as the table grows, and your p95 latency spikes into timeout territory.
Indexes work by maintaining a separate data structure — typically a B-Tree or a hash table — that maps key values to their physical locations on disk. A B-Tree index keeps keys sorted and allows range scans (WHERE price BETWEEN 10 AND 20), while a hash index is optimized for exact-match lookups but can't do ranges.
Clustered indexes physically reorder the table's rows on disk to match the index order (InnoDB's primary key is always clustered), so a lookup via the clustered key avoids an extra pointer hop. Non-clustered indexes store the key and a pointer to the row, meaning two I/O operations: one to the index, one to the data.
The trade-off is write performance. Every INSERT, UPDATE, or DELETE must update every index on the table — more indexes mean slower writes. Composite indexes add another layer: column order matters because the index is a sorted list of concatenated values.
An index on (status, created_at) can speed up WHERE status = 'shipped' ORDER BY created_at, but it's useless for WHERE created_at > '2024-01-01' unless status is also filtered. The art of indexing is knowing when the read speedup justifies the write tax — and in production, that missing foreign key index on your checkout flow is almost always worth the cost.
Plain-English First
Imagine a 1,000-page cookbook with no table of contents. To find the recipe for 'chocolate cake' you'd flip every single page. An index at the back of the book says 'chocolate cake → page 847' and you jump straight there. A database index works exactly the same way — it's a separate, pre-sorted lookup structure that tells the database engine exactly where your data lives on disk, so it never has to flip through every page.
Without an index, your database degrades to a full table scan on every query—reading millions of rows just to find ten. That’s why indexing is the single most impactful performance lever you control. Get the wrong index or none at all, and your users will feel the delay; pick the right one, and your queries will stay sub‑millisecond even as data grows.
Why Indexing Is Not Optional
An index is a separate data structure — typically a B+ tree — that maps column values to row locations, enabling O(log n) lookups instead of O(n) full table scans. Without it, every query degenerates into a sequential read of all pages, which is the single most common cause of production latency. The core mechanic is simple: trade write overhead for read speed by maintaining a sorted or hashed copy of a subset of columns.
In practice, a B+ tree index stores key-pointer pairs in leaf nodes linked as a linked list, allowing both point lookups and range scans in logarithmic time. The index is physically separate from the heap or clustered table, so a query that uses an index performs two I/O operations: one to traverse the tree, another to fetch the row. The selectivity of the indexed column determines whether the optimizer will use it — a cardinality below 1% of total rows is the typical threshold.
Use indexes on columns that appear in WHERE, JOIN, ORDER BY, or GROUP BY clauses, especially foreign keys. The canonical failure is a missing foreign key index causing a cascading full table scan on every join — exactly what broke checkout in the motivating example. Every foreign key column should be indexed unless you have measured and proven otherwise.
Index ≠ Performance Guarantee
An index on a low-cardinality column (e.g., boolean 'is_deleted') is often ignored by the optimizer and wastes write throughput — measure selectivity before creating one.
Production Insight
A missing foreign key index on order.user_id caused a 12-second checkout response because every order insert triggered a full scan of the users table for referential integrity checks.
The symptom was intermittent timeouts under load that disappeared when the table was small — the full scan was fast at 10k rows but catastrophic at 10M.
Rule of thumb: index every foreign key column on the referencing side before the first deployment; the cost is negligible, the penalty is unbounded.
Key Takeaway
Indexes convert O(n) scans into O(log n) lookups but add write overhead — profile before and after.
Foreign key columns must be indexed; missing them is the most common cause of join-related production outages.
The optimizer will not use an index if selectivity is below ~1% — know your data distribution.
thecodeforge.io
Indexing in DBMS — Missing Foreign Key Broke Checkout
Indexing Dbms
How a Database Actually Finds Data Without an Index
Before we talk about indexes, we need to understand what they replace. When you run a query like SELECT * FROM orders WHERE customer_id = 4821, the database has two options: scan every row in the table checking each customer_id one by one, or consult an index to jump directly to the answer.
The first option is a Full Table Scan. It sounds terrible, but for a table with 500 rows it's actually fine — the overhead of an index lookup can even be slower for tiny datasets. The problem explodes when your orders table has 50 million rows and you're running that query 10,000 times per second.
Disk I/O is the bottleneck. Data lives in fixed-size blocks (usually 8KB or 16KB) on disk. A full scan means reading every block. An index is designed so that finding a specific value requires reading only a handful of blocks — typically O(log n) reads for a B-Tree index — regardless of whether the table has 1,000 rows or 1 billion.
This is the core contract: indexes trade extra disk space and slower writes for dramatically faster reads. That trade-off is the single most important idea in this entire article.
full_scan_vs_index.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- ============================================================-- Demonstrate the cost difference between a full scan-- and an index seek on a large orders table.-- Run EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL) to see the plan.-- ============================================================-- Step 1: Create a realistic orders tableCREATETABLEorders (
order_id SERIALPRIMARYKEY,
customer_id INTNOTNULL,
product_name VARCHAR(200) NOTNULL,
order_total DECIMAL(10,2) NOTNULL,
created_at TIMESTAMPNOTNULLDEFAULTNOW()
);
-- Step 2: Seed 1 million rows (simulate a production-sized table)INSERTINTOorders (customer_id, product_name, order_total, created_at)
SELECT
(RANDOM() * 100000)::INT, -- 100k unique customers
'Product_' || (RANDOM() * 500)::INT, -- 500 unique products
(RANDOM() * 500 + 5)::DECIMAL(10,2), -- order value $5 - $505NOW() - (RANDOM() * INTERVAL '2 years') -- orders from last 2 yearsFROMgenerate_series(1, 1000000);
-- Step 3: Query WITHOUT an index — watch the execution planEXPLAINANALYZESELECT order_id, product_name, order_total
FROM orders
WHERE customer_id = 4821;
-- Expected plan: "Seq Scan on orders" — reads ALL 1M rows-- Typical cost on spinning disk: 800ms - 3000ms-- Step 4: Create a B-Tree index on customer_idCREATEINDEX idx_orders_customer_id
ONorders (customer_id);
-- This builds the index structure — one-time cost, usually seconds-- Step 5: Run the SAME query — the planner now uses the indexEXPLAINANALYZESELECT order_id, product_name, order_total
FROM orders
WHERE customer_id = 4821;
-- Expected plan: "Index Scan using idx_orders_customer_id"-- Typical cost: 0.5ms - 3ms — a 1000x improvement
Output
-- BEFORE index (Seq Scan):
Seq Scan on orders (cost=0.00..28374.00 rows=10 width=54)
(actual time=12.431..1847.223 rows=9 loops=1)
Planning Time: 0.112 ms
Execution Time: 1847.441 ms
-- AFTER index (Index Scan):
Index Scan using idx_orders_customer_id on orders
(cost=0.42..16.85 rows=10 width=54)
(actual time=0.041..0.063 rows=9 loops=1)
Planning Time: 0.218 ms
Execution Time: 0.089 ms
Why the Planner Sometimes Ignores Your Index
If your query matches a very large percentage of rows (say, 30% of the table), the database query planner will deliberately choose a full table scan over your index. Fetching millions of scattered disk pages via an index is actually slower than reading the table sequentially. This is expected, intelligent behaviour — not a bug.
Production Insight
Full table scans are the silent killer of database performance in production.
They aren't bad for small tables — but on a 10M row table, even a single scan can consume seconds of I/O.
Rule: always run EXPLAIN on any query you add to a hot path.
Key Takeaway
An index trades write cost for read speed — but the planner is smart.
If your query returns >20% of rows, it will skip the index and scan.
Know your data distribution before declaring an index useless.
B-Tree vs Hash Indexes — Picking the Right Tool for the Job
There isn't just one type of index. The right choice depends entirely on the kind of queries you run.
B-Tree (Balanced Tree) is the default in every major RDBMS — PostgreSQL, MySQL, Oracle, SQL Server. It stores index entries in a sorted tree structure. The sorted order means it handles equality lookups (=), range queries (BETWEEN, <, >), and ORDER BY efficiently. 90% of the time, B-Tree is what you want.
Hash Index stores a hash of the column value mapped to a row pointer. Hash lookups are O(1) — theoretically faster than B-Tree's O(log n) for equality checks. The catch: hash indexes are useless for range queries. WHERE price > 100 on a hash index results in a full scan because hashed values have no sorted relationship to each other.
Bitmap Index (Oracle, PostgreSQL partial support) is ideal for low-cardinality columns — columns with very few distinct values like status (active/inactive) or country_code. It stores a bit array per distinct value and uses bitwise AND/OR to combine multiple conditions. Never use bitmap indexes on high-cardinality columns like email — the storage overhead becomes catastrophic.
Choosing wrong here is a real interview red flag.
index_types_comparison.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- ============================================================-- Compare B-Tree vs Hash index behaviour on a products table.-- PostgreSQL syntax — MySQL uses ENGINE=MEMORY for Hash indexes.-- ============================================================CREATETABLEproducts (
product_id SERIALPRIMARYKEY,
sku VARCHAR(50) NOTNULL, -- unique product code
category VARCHAR(100) NOTNULL, -- e.g. 'electronics'
price DECIMAL(10,2) NOTNULL,
in_stock BOOLEANNOTNULLDEFAULTTRUE
);
-- ── B-TREE INDEX ─────────────────────────────────────────────-- Best for: equality, range queries, ORDER BY, LIKE 'prefix%'CREATEINDEX idx_products_price_btree
ON products USINGBTREE (price);
-- This query USES the B-Tree index efficiently (range query)EXPLAINSELECT product_id, sku
FROM products
WHERE price BETWEEN50.00AND150.00; -- range scan ✓-- ── HASH INDEX ───────────────────────────────────────────────-- Best for: exact equality lookups ONLY — useless for rangesCREATEINDEX idx_products_sku_hash
ON products USINGHASH (sku);
-- This query USES the Hash index efficiently (exact match)EXPLAINSELECT product_id, price
FROM products
WHERE sku = 'ELEC-4K-TV-001'; -- equality lookup ✓-- This query CANNOT use the Hash index — falls back to seq scanEXPLAINSELECT product_id, price
FROM products
WHERE sku LIKE 'ELEC%'; -- range/pattern — hash fails ✗-- ── PARTIAL INDEX (a B-Tree power move) ─────────────────────-- Index ONLY the rows you actually query — smaller, faster indexCREATEINDEX idx_products_instock_price
ONproducts (price)
WHERE in_stock = TRUE; -- only index in-stock items-- Result: index is a fraction of the size but covers the hot query path
Output
-- B-Tree range query plan:
Bitmap Heap Scan on products
Recheck Cond: ((price >= 50.00) AND (price <= 150.00))
-> Bitmap Index Scan on idx_products_price_btree
Index Cond: ((price >= 50.00) AND (price <= 150.00))
-- Hash equality query plan:
Index Scan using idx_products_sku_hash on products
Index Cond: ((sku)::text = 'ELEC-4K-TV-001'::text)
-- Hash + LIKE (falls back to seq scan — hash can't help here):
Seq Scan on products
Filter: ((sku)::text ~~ 'ELEC%'::text)
Partial Indexes Are Criminally Underused
A partial index on WHERE status = 'pending' in an orders table can be 50x smaller than a full index on status, because 98% of orders are 'completed' and never queried that way. Smaller index = fits in memory = faster. If your queries consistently filter on a specific subset of rows, a partial index is almost always the right call.
Production Insight
Using a Hash index on a column that needs range queries silently falls back to full scans.
No error, no warning — just slower queries.
Rule: stick to B-Tree for anything that isn't a simple equality lookup in a high-throughput path.
Key Takeaway
B-Tree is the safe default for 90% of use cases.
Hash indexes are only faster for exact equality — never use them for ranges.
Partial indexes are a free performance win for filtered queries.
Clustered vs Non-Clustered Indexes — The One That Actually Moves Your Data
This distinction trips people up in interviews constantly, so let's nail it.
A Clustered Index physically reorders the rows on disk to match the index order. Because the data itself IS the index, there can only ever be one clustered index per table. In MySQL InnoDB, the PRIMARY KEY is always the clustered index. SQL Server lets you choose which column to cluster on. When you read data via the clustered index, you get the actual row immediately — no second lookup needed.
A Non-Clustered Index is a completely separate structure that stores the indexed column values and a pointer back to the actual row (called a Row Identifier or RID in SQL Server, or the primary key in InnoDB). When the query engine uses a non-clustered index, it first finds the matching pointer in the index, then does a second lookup in the main table to fetch the remaining columns. This second lookup is called a Key Lookup or Bookmark Lookup and shows up in execution plans.
The practical implication: if a non-clustered index query needs many columns beyond what's in the index, those Key Lookups add up. The fix is a Covering Index — an index that includes all the columns a query needs, so the engine never has to touch the main table at all.
Covering indexes are one of the highest-impact performance optimisations you can make with zero application code changes.
clustered_vs_nonclustered.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- ============================================================-- Illustrate clustered index layout vs non-clustered,-- and the power of a COVERING index to eliminate key lookups.-- SQL Server syntax — concepts identical in MySQL/PostgreSQL.-- ============================================================CREATETABLEemployees (
employee_id INTNOTNULLIDENTITY(1,1),
department_id INTNOTNULL,
last_name VARCHAR(100) NOTNULL,
first_name VARCHAR(100) NOTNULL,
salary DECIMAL(10,2) NOTNULL,
hire_date DATENOTNULL,
-- PRIMARY KEY creates the CLUSTERED index by default in SQL Server-- Rows are physically sorted on disk by employee_idCONSTRAINT PK_employees PRIMARYKEYCLUSTERED (employee_id)
);
-- Non-clustered index on department_id-- Stores: department_id → employee_id (pointer back to clustered index)CREATENONCLUSTEREDINDEX idx_employees_department
ONemployees (department_id);
-- ── SCENARIO 1: Non-clustered causes a Key Lookup ────────────-- The index has department_id but NOT salary or last_name.-- Engine finds the pointer via the index, then looks up the main-- table again for each matching row — expensive at scale.SELECT employee_id, last_name, salary
FROM employees
WHERE department_id = 7;
-- Execution plan shows: Key Lookup (clustered) — one extra disk read per row-- ── SCENARIO 2: Covering index eliminates the Key Lookup ─────-- Include all columns the query needs directly in the index.-- The INCLUDE clause adds columns to the leaf level of the B-Tree-- without making them part of the sort key.CREATENONCLUSTEREDINDEX idx_employees_dept_covering
ONemployees (department_id) -- the WHERE / JOIN columnINCLUDE (last_name, salary); -- columns needed in SELECT-- Now the SAME query never touches the main table at allSELECT employee_id, last_name, salary
FROM employees
WHERE department_id = 7;
-- Execution plan shows: Index Only Scan — all data in the index ✓-- ── PRO MOVE: Composite index column ORDER matters ────────────-- For a query filtering on department_id AND sorting by hire_date:CREATENONCLUSTEREDINDEX idx_employees_dept_hire
ONemployees (department_id, hire_date); -- filter col first, sort col second-- Putting hire_date first would make the department filter less selective
[All needed columns found in index — main table never touched]
Watch Out: Column Order in Composite Indexes Is Not Arbitrary
In a composite index on (department_id, hire_date), the index can support queries filtering on department_id alone, or on both columns. It CANNOT efficiently support a query filtering only on hire_date — that's a full index scan. The rule: put the most selective, most frequently filtered column first. A query that skips the leading column loses most of the index benefit.
Production Insight
Key lookups are insidious — they add one random read per row, turning a 3ms query into 300ms when 100 rows are returned.
Under load, these random reads cause disk queue contention.
Rule: cover your most frequent queries with INCLUDE columns to eliminate lookups entirely.
Key Takeaway
Clustered index = data is the index (one per table).
Non-clustered = separate structure + optional key lookup.
Covering indexes eliminate the lookup — always build them for hot queries.
When NOT to Add an Index — The Write Performance Tax
Here's what nobody tells beginners: indexes are never free. Every index you add is a liability on write-heavy tables, and getting this wrong in production is painful.
Every time you INSERT, UPDATE, or DELETE a row, the database must update every index on that table to keep it consistent. A table with 8 indexes doesn't just write once — it performs up to 9 writes per operation. On a table receiving 10,000 inserts per second, that's 80,000 extra I/O operations per second before you've even handled reads.
The situation gets worse with UPDATE. If you update an indexed column, the database must delete the old index entry and insert a new one — two write operations per index, per updated row.
The right mental model is to think of each index as a maintenance contract. The database pays that contract on every write, forever. You only justify that contract when read performance gains outweigh the write penalty.
Rules of thumb from the trenches: - Tables with > 80% writes and < 20% reads: index sparingly, only the absolute hottest read paths. - Tables with > 80% reads: index aggressively, your queries will thank you. - Event/audit log tables: almost never index beyond the primary key — they're append-only and rarely read column-by-column. - Foreign key columns: almost always index these — they're hit on every JOIN.
index_write_overhead_demo.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
-- ============================================================-- Demonstrate the measurable write cost of indexes on a-- high-throughput event_log table (append-only pattern).-- ============================================================-- Version A: Over-indexed event log (common beginner mistake)CREATETABLEevent_log_over_indexed (
event_id BIGSERIALPRIMARYKEY,
user_id BIGINTNOTNULL,
session_id UUIDNOTNULL,
event_type VARCHAR(100) NOTNULL,
page_url TEXTNOTNULL,
metadata JSONB,
created_at TIMESTAMPNOTNULLDEFAULTNOW()
);
-- Beginner adds indexes on every column "just in case"CREATEINDEX idx_event_user ONevent_log_over_indexed (user_id);
CREATEINDEX idx_event_session ONevent_log_over_indexed (session_id);
CREATEINDEX idx_event_type ONevent_log_over_indexed (event_type);
CREATEINDEX idx_event_url ONevent_log_over_indexed (page_url);
CREATEINDEX idx_event_time ONevent_log_over_indexed (created_at);
-- 6 total structures to update on every INSERT — silent write killer-- Version B: Lean, thoughtful indexingCREATETABLEevent_log_lean (
event_id BIGSERIALPRIMARYKEY,
user_id BIGINTNOTNULL,
session_id UUIDNOTNULL,
event_type VARCHAR(100) NOTNULL,
page_url TEXTNOTNULL,
metadata JSONB,
created_at TIMESTAMPNOTNULLDEFAULTNOW()
);
-- Only index what your actual queries need:-- Hot query 1: "Show me all events for user X in the last 7 days"CREATEINDEX idx_lean_user_time
ONevent_log_lean (user_id, created_at DESC);
-- Hot query 2: "Count events by type in the last hour"CREATEINDEX idx_lean_type_time
ONevent_log_lean (event_type, created_at DESC);
-- Result: 3 structures updated per INSERT vs 6 — 50% write reduction-- ── Measure index bloat on existing tables (PostgreSQL) ──────SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan AS times_used, -- indexes with 0 here are deadweight
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDERBYpg_relation_size(indexrelid) DESC;
-- Any index with idx_scan = 0 is costing you writes and giving nothing back
public | event_log_over_indexed | idx_event_url | 284 MB | 0
public | event_log_over_indexed | idx_event_session | 201 MB | 3
public | event_log_over_indexed | idx_event_type | 18 MB | 12041
public | event_log_lean | idx_lean_user_time | 94 MB | 89234
public | event_log_lean | idx_lean_type_time | 22 MB | 41109
-- idx_event_url: 284 MB of disk space, 0 uses — pure overhead. DROP IT.
Dead Indexes Are a Hidden Tax You're Paying Right Now
Query pg_stat_user_indexes in PostgreSQL (or sys.dm_db_index_usage_stats in SQL Server) to find indexes with zero or near-zero usage. These are burning disk space, slowing every write, and contributing to table bloat. Dropping an unused index on a write-heavy table can give you an immediate, measurable throughput boost with zero risk to read performance.
Production Insight
Over-indexing a high-write table can cause replication lag — the replica falls behind because it must replay all index writes.
A team once added 5 indexes to a logging table and saw replication delay grow to 45 minutes.
Rule: treat each index as a contract — if it doesn't serve a real query, drop it.
Key Takeaway
Indexes are paid on every write — never add one without a read justification.
Use index usage stats to find and drop dead indexes.
Foreign keys always need indexes; audit logs rarely do.
Composite Indexes — Column Order Makes or Breaks Performance
A composite index spans multiple columns. The order of those columns determines which queries it can serve efficiently. This is one of the most misunderstood concepts in indexing.
The Leading Column Rule: The index can be used for queries that filter on the leading column alone, or on the leading column plus any subsequent columns. It cannot efficiently support a query that skips the leading column — that forces a full index scan or full table scan.
For example, an index on (department_id, hire_date)
WHERE department_id = 7 → uses index (good)
WHERE department_id = 7 AND hire_date > '2024-01-01' → uses index very efficiently
WHERE hire_date > '2024-01-01' → cannot use leading column; likely full scan
Selector First Rule: Put the column with the highest selectivity (most unique values) first. If department_id has 100 distinct values and hire_date has 10,000, then hire_date is more selective. But if queries always filter on department_id first, keep it leading.
The real-world strategy: List all your hot queries, group them by filter columns, then design composite indexes that cover multiple related queries by varying the column order. One index cannot serve all; you often need two or three well-placed indexes rather than fifteen scattered ones.
Covering indexes (using INCLUDE) can add extra columns without affecting the sort order, preserving the selectivity of the leading columns.
composite_index_order.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
-- ============================================================-- Show how column order in a composite index affects query plans.-- Example: an orders table with department and order_date.-- ============================================================CREATETABLEorders100k (
order_id SERIALPRIMARYKEY,
department_id INTNOTNULL,
order_date DATENOTNULL,
amount DECIMAL(10,2),
status VARCHAR(20)
);
-- Insert sample data (100k rows, 10 departments, 1000 dates)INSERTINTOorders100k (department_id, order_date, amount, status)
SELECT
(RANDOM() * 9 + 1)::INT,
'2025-01-01'::DATE + (RANDOM() * 365)::INT,
(RANDOM() * 1000)::DECIMAL(10,2),
CASEWHENRANDOM() < 0.5THEN'completed'ELSE'pending'ENDFROMgenerate_series(1, 100000);
-- Index A: department_id first, order_date secondCREATEINDEX idx_dept_date ONorders100k (department_id, order_date);
-- Query 1: filter on leading column only → efficientEXPLAINANALYZESELECTCOUNT(*)
FROM orders100k
WHERE department_id = 3;
-- Plan: Index Only Scan on idx_dept_date — excellent-- Query 2: filter on both → still efficientEXPLAINANALYZESELECTCOUNT(*)
FROM orders100k
WHERE department_id = 3AND order_date BETWEEN'2025-06-01'AND'2025-06-30';
-- Plan: Index Only Scan — even better because range is narrowed-- Query 3: filter on second column only → NOT efficientEXPLAINANALYZESELECTCOUNT(*)
FROM orders100k
WHERE order_date BETWEEN'2025-06-01'AND'2025-06-30';
-- Plan: Seq Scan (reads whole table) — index cannot help because leading column is missing-- To support Query 3, you'd need an index on (order_date) aloneCREATEINDEX idx_date ONorders100k (order_date);
Output
-- Query 1 (department_id only):
Index Only Scan using idx_dept_date on orders100k
Index Cond: (department_id = 3)
Planning Time: 0.123 ms
Execution Time: 0.456 ms
-- Query 2 (both):
Index Only Scan using idx_dept_date on orders100k
Index Cond: ((department_id = 3) AND (order_date >= '2025-06-01') AND (order_date <= '2025-06-30'))
Execution Time: 0.712 ms
-- Query 3 (order_date only):
Seq Scan on orders100k (cost=0.00..1234.56 rows=3000 width=4)
Filter: ((order_date >= '2025-06-01') AND (order_date <= '2025-06-30'))
Execution Time: 45.234 ms ← 100x slower
Think of a Composite Index Like a Phone Book
Leading column determines the primary sort order — queries must include it to get the index benefit.
Second column is sorted within each leading value — useful for refining after the first filter.
Skipping the leading column forces a full index scan (still reads all leaf pages) or a full table scan.
Design indexes around your actual query patterns, not theoretical coverage — look at slow query logs.
Production Insight
A poorly ordered composite index silently wastes disk space and write overhead.
Your monitoring might show 'index used' but the scan depth is still high.
Rule: validate composite index efficiency by checking the number of index pages read (Buffers in PostgreSQL EXPLAIN).
Key Takeaway
Column order in composite indexes determines which queries benefit.
Leading column must match the first filter — skipping it kills performance.
Profile actual query patterns before designing composite indexes.
Multilevel Indexing — When a Single B-Tree Won't Fit in Memory
You just read about B-Trees. Good. Now imagine your index itself is too large to fit in RAM. This happens when your table has billions of rows. A single-level B-Tree still requires too many disk IOs for the root lookup. Multilevel indexing solves this by building an index on top of the index. Think of it as a directory of directories. The outer index points to blocks of the inner index. Each level shrinks the search space exponentially. Your database almost certainly uses this under the hood for large tables. You never see it. The query planner handles it. But you need to understand why a query that touches 1% of a 10-billion-row table can still return in milliseconds. It's not magic. It's multiple layers of sparse indexes. Each level stores only a few thousand entries. The top level fits entirely in cache. That means you find your data with 3-4 disk reads instead of 20. The cost? Insertions now update multiple index levels. Write-heavy tables pay this tax. If your workload is 90% reads, multilevel indexing is your best friend. If it's 90% writes, reconsider your schema.
multilevel_index_demo.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- io.thecodeforge-- Simulating multilevel index behavior in PostgreSQL-- Outer index (level 1) points to inner index blocks (level 2)CREATETABLEorders (
order_id BIGSERIALPRIMARYKEY,
customer_id INTNOTNULL,
order_date TIMESTAMPNOTNULL,
total DECIMAL(10,2)
);
-- PostgreSQL automatically creates a B-Tree index on primary key-- For multilevel, the engine stores index pages in blocks-- Inspect index depth using pg_stat_user_indexesSELECT
indexrelid::regclass AS index_name,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'orders'ORDERBY idx_scan DESC;
-- Real-world: a 10M row index typically has depth 3-4-- You can estimate depth with:SELECTpg_relation_size('orders_pkey') AS index_size_bytes;
-- Divide by 8192 (page size) to get number of leaf pages-- Then take log base (fill factor * page entries) to estimate levels
Just because your index fits in memory now doesn't mean it will next quarter. Monitor index size growth. When it exceeds available buffer pool, multilevel depth increases automatically — and query latency doubles or triples overnight.
Key Takeaway
If index depth exceeds 4 levels on a 100M row table, your B-Tree fan-out is too narrow. Rebuild with larger page size or consider partitioning.
Index Scan vs Seek — Why Range Queries Kill Hash Indexes
You slapped a hash index on the user email column. Great for login queries. Terrible for 'find all users registered in January'. Hash indexes only support equality lookups. No range scans. No ORDER BY. No prefix matching. The database has to fall back to a full table scan for anything that isn't an exact match. This is the #1 mistake junior devs make. They see 'index' and think it works everywhere. It doesn't. B-Tree indexes support both seek (exact match) and scan (range). When you query WHERE created_at BETWEEN '2025-01-01' AND '2025-01-31', the B-Tree navigates to the first matching leaf page, then walks the linked list of leaf pages forward. That's an index scan — sequential reads from the index, not the table. If your query filters on a leading column of a composite index, the same logic applies. The database decides between seek and scan based on selectivity. High selectivity (few rows) -> seek. Low selectivity (many rows) -> scan. You can force a seek with a hint in some databases, but don't. Trust the optimizer. It knows the statistics.
index_scan_vs_seek.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- io.thecodeforge-- Force different access patterns in SQL ServerCREATETABLEevents (
event_id BIGINTIDENTITY(1,1) PRIMARYKEY,
event_type VARCHAR(50) NOTNULL,
recorded_at DATETIME2NOTNULL
);
-- Non-clustered index supporting both seek and scanCREATENONCLUSTEREDINDEX idx_events_recorded_at
ONevents (recorded_at)
INCLUDE (event_type);
-- Query 1: Seek plan (high selectivity)SELECT * FROM events
WHERE recorded_at = '2025-01-15 14:30:00';
-- Estimated rows: ~1, plan shows Index Seek-- Query 2: Scan plan (low selectivity)SELECT * FROM events
WHERE recorded_at >= '2025-01-01'AND recorded_at < '2025-02-01';
-- Estimated rows: ~150K, plan shows Index Scan-- Check actual execution planSETSTATISTICSXMLON;
GO-- Run your query, then examine XML for:-- <PhysicalOp>Index Seek</PhysicalOp> vs <PhysicalOp>Index Scan</PhysicalOp>
Output
_: Seek vs Scan breakdown_
Seek: Random I/O, O(log n) tree traversal, 1-10 rows
Scan: Sequential I/O, O(n) leaf walk, 100K+ rows
Rule: If index scan touches >20% of table, table scan wins anyway
Check estimated rows vs actual — large deviation signals stale statistics
Optimizer Insight:
Always check the estimated vs actual number of rows in your execution plan. If they differ by more than 10x, update statistics immediately. A bad estimate can force an expensive scan when a seek would have been better.
Key Takeaway
B-Tree indexes support both seek and scan. Hash indexes only support seek. Use hash only for exact-match lookups on high-cardinality columns.
● Production incidentPOST-MORTEMseverity: high
The Missing Foreign Key Index That Took Down Checkout at Midnight
Symptom
Checkout API started timing out after 30 seconds. Database CPU pinned at 100%. Queries for SELECT * FROM order_items WHERE order_id = ? took 12 seconds each, even though the table had only 2 million rows.
Assumption
The team assumed the primary key index on order_id would cover all queries on the orders table. They didn't realize that a query filtering on customer_id — a foreign key column with no index — would force a full table scan regardless of the primary key.
Root cause
The orders table had 2 million rows. The query SELECT * FROM orders WHERE customer_id = 4821 had no index on customer_id, so the database performed a sequential scan. Under normal load (5 queries/sec) this was tolerable at ~800ms. During Black Friday, the query ran 200 times per second, saturating disk I/O and causing all other queries to queue.
Fix
Added a B-Tree index on customer_id: CREATE INDEX idx_orders_customer_id ON orders (customer_id);. Query time dropped from 800ms to 2ms. CPU dropped from 100% to 15%.
Key lesson
Every foreign key column needs an index — queries that JOIN or filter on FK columns will otherwise scan the whole table.
Query performance is not linear — a single missing index under load causes system-wide degradation, not just one slow query.
Monitor index usage with pg_stat_user_indexes or sys.dm_db_index_usage_stats; unused indexes are a write tax, missing indexes are a read disaster.
Production debug guideFollow these symptom→action pairs to identify and fix index-related performance issues in production.4 entries
Symptom · 01
Query takes seconds on a table with millions of rows, and the execution plan shows 'Seq Scan' or 'Table Scan'.
→
Fix
Check the WHERE clause columns — if they are not indexed, create a B-Tree index. For multi-column filters, create a composite index with the most selective column first.
Symptom · 02
Query uses an index but still has 'Key Lookup' (clustered) for every row, adding extra page reads.
→
Fix
Convert the index to a covering index by including all SELECT columns using the INCLUDE clause (SQL Server) or adding them to the index (PostgreSQL).
Symptom · 03
A query with a range condition (BETWEEN, <, >) is slow even though the column is indexed.
→
Fix
Verify the index type. If it's a Hash index, it cannot support range scans. Drop it and create a B-Tree index.
Symptom · 04
INSERT/UPDATE performance is degrading over time on a write-heavy table with many indexes.
→
Fix
Query index usage stats and drop any indexes with zero or near-zero scan counts. Also consider index maintenance (rebuild/reorganize) to reduce fragmentation.
★ Index Performance Quick ReferenceUse these commands and checks when you suspect index issues in PostgreSQL or SQL Server.
Query seems slow but you don't know if it's using an index−
Immediate action
Run EXPLAIN ANALYZE on the query to see the execution plan.
Commands
EXPLAIN ANALYZE SELECT ... WHERE ...;
Look for 'Index Scan', 'Index Seek' (good) vs 'Seq Scan' (bad).
Fix now
If Seq Scan appears and the table is large, add an index on the filter column(s).
High write latency on a table with many indexes+
Immediate action
Check which indexes are actually used by the application.
Commands
SELECT * FROM pg_stat_user_indexes WHERE relname = 'your_table';
Look for idx_scan = 0 — those indexes are never used.
Fix now
DROP any index with idx_scan = 0 (after confirming no periodic use).
Index size is huge but the table is moderate+
Immediate action
Check for redundant or overlapping indexes.
Commands
SELECT pg_size_pretty(pg_relation_size(indexrelid)) FROM pg_stat_user_indexes WHERE relname = 'your_table';
Compare with table size via pg_total_relation_size.
Fix now
Remove duplicate indexes and consider partial indexes to shrink size.
Feature / Aspect
Clustered Index
Non-Clustered Index
Physical row ordering
Yes — data rows sorted by index key on disk
No — separate structure, data stays in heap
Count per table
Exactly 1 (the table IS the index)
Up to 999 (SQL Server) / effectively unlimited
Lookup cost (full row)
O(log n) — row returned directly
O(log n) for index + O(1) key lookup for row
Range query speed
Excellent — sequential disk reads
Good, but key lookups add overhead per row
INSERT / UPDATE cost
Higher — rows may need physical reordering
Lower per index, but adds up with many indexes
Best for
Primary key, most-queried sort column
Foreign keys, filter columns, covering queries
Covering index support
N/A — always has full row
Yes — INCLUDE clause eliminates key lookups
Default in MySQL InnoDB
Always the PRIMARY KEY
Any additional CREATE INDEX statement
Key takeaways
1
An index trades disk space and write speed for faster reads
every index is a maintenance contract the database pays on every INSERT, UPDATE and DELETE, forever.
2
B-Tree indexes support equality AND range queries and are the safe default; Hash indexes are faster for pure equality lookups but completely blind to ranges
choosing wrong leaves queries doing full scans silently.
3
A clustered index physically reorders table rows so there can only be one per table; a non-clustered index is a separate structure that points back to the row, requiring a key lookup unless you make it a covering index with INCLUDE.
4
Composite index column order determines which queries benefit
the leading column must appear in the filter; skipping it forces a full scan.
5
Use pg_stat_user_indexes (PostgreSQL) or sys.dm_db_index_usage_stats (SQL Server) to find indexes with zero usage and drop them
they're costing you write performance and giving nothing back.
Common mistakes to avoid
4 patterns
×
Indexing a low-cardinality column alone (e.g., `CREATE INDEX ON orders (status)` where status has only 3 values)
Symptom
The query planner ignores the index because fetching 33% of the table via scattered index lookups is slower than a sequential scan.
Fix
Use a composite index with a high-cardinality leading column (customer_id, status), or use a partial index (WHERE status = 'pending') to target only the rare, frequently-queried value.
×
Wrapping an indexed column in a function inside a WHERE clause
Symptom
A query like WHERE YEAR(created_at) = 2024 or WHERE LOWER(email) = 'user@example.com' cannot use a standard B-Tree index because the function transforms the value before comparison, destroying the sorted order the index depends on.
Fix
Rewrite the predicate to leave the column bare (WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01') or create a function-based index on the expression (CREATE INDEX ON users (LOWER(email))).
×
Never running ANALYZE or UPDATE STATISTICS after bulk data loads
Symptom
Index performance depends on the query planner having accurate statistics about data distribution. After loading millions of rows, stale statistics can cause the planner to choose a full table scan even when a perfect index exists, or vice versa.
Fix
Always run ANALYZE table_name in PostgreSQL or UPDATE STATISTICS table_name in SQL Server after any significant bulk load or data deletion, and ensure your auto-analyze/auto-stats settings are tuned for the table's change rate.
×
Creating an index on a UUID column without considering writes
Symptom
Random UUIDs cause index fragmentation and page splits on B-Tree indexes because new inserts go to random pages instead of sequentially. The index becomes bloated and write performance degrades over time.
Fix
Use sequential UUIDs (UUID v7) or a separate sequential surrogate key as the indexed column. If random UUIDs are unavoidable, schedule periodic index rebuilds during low-traffic windows.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
What is the difference between a clustered and non-clustered index, and ...
Q02SENIOR
You have a query that filters on `department_id` and `hire_date`. How do...
Q03SENIOR
A colleague says 'indexes make queries faster, so let's index every colu...
Q04SENIOR
Explain how a covering index eliminates key lookups and why that matters...
Q01 of 04SENIOR
What is the difference between a clustered and non-clustered index, and why can a table have only one clustered index?
ANSWER
A clustered index physically reorders the data rows on disk to match the index order. Because the data itself is the index's leaf level, there can only be one such ordering per table. Non-clustered indexes are separate structures that store copies of the indexed columns plus a pointer to the actual row (either the clustered index key or a row identifier). They can support multiple lookups from different column combinations, but each lookup may require a key lookup step.
Q02 of 04SENIOR
You have a query that filters on `department_id` and `hire_date`. How do you decide the column order in a composite index, and what happens if a query only filters on `hire_date`?
ANSWER
You put the most selective or most frequently filtered column first. If both are equally frequent but hire_date is more selective (more distinct values), it should be first. However, if the application always filters on department_id first, then keep department_id as the leading column. If a query only filters on hire_date, a composite index with leading department_id cannot be used efficiently; you would need a separate index on hire_date alone. The general rule: the leading column must match the filter condition for the index to be used for seek operations.
Q03 of 04SENIOR
A colleague says 'indexes make queries faster, so let's index every column.' What's wrong with that reasoning, and how would you decide which columns actually deserve an index?
ANSWER
That reasoning ignores the write overhead. Every INSERT, UPDATE, and DELETE must update every index on the table. On a write-heavy table, over-indexing can saturate disk I/O and cause replication lag. The correct approach: identify your most frequent and critical read queries, then create indexes that support those queries — especially for JOIN, WHERE, and ORDER BY columns. Use index usage statistics to find and drop unused indexes. Also consider partial indexes for filtered subsets. The rule: index from actual query patterns, not from theoretical coverage.
Q04 of 04SENIOR
Explain how a covering index eliminates key lookups and why that matters in production.
ANSWER
A covering index includes all columns that a query needs, both in the WHERE clause and in the SELECT list, often via the INCLUDE clause. When the database can satisfy the query entirely from the index (Index Only Scan), it avoids a second lookup (key lookup) to the main table. In production, key lookups add a random disk read per row, which can multiply query time by 10x-100x for result sets of hundreds of rows. Covering indexes are critical for high-frequency queries because they reduce I/O and improve response time without changing application code.
01
What is the difference between a clustered and non-clustered index, and why can a table have only one clustered index?
SENIOR
02
You have a query that filters on `department_id` and `hire_date`. How do you decide the column order in a composite index, and what happens if a query only filters on `hire_date`?
SENIOR
03
A colleague says 'indexes make queries faster, so let's index every column.' What's wrong with that reasoning, and how would you decide which columns actually deserve an index?
SENIOR
04
Explain how a covering index eliminates key lookups and why that matters in production.
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
Does adding an index always make queries faster?
No — and this is a common misconception. For very small tables, a full scan is faster because the overhead of navigating the index structure outweighs the saving. For queries that match a large percentage of rows (typically >20-30%), the planner will also choose a full scan. Indexes are most valuable on large tables with highly selective filter columns.
Was this helpful?
02
What is a covering index and why does it matter?
A covering index includes all the columns a specific query needs — both the filter columns and the SELECT columns — so the database can answer the query entirely from the index without touching the main table. This eliminates the 'key lookup' step and can reduce query time dramatically on frequently-run queries. You create one using the INCLUDE clause in SQL Server, or by adding extra columns to the index in PostgreSQL.
Was this helpful?
03
How do I find out which indexes my database is actually using?
Prefix any query with EXPLAIN ANALYZE (PostgreSQL) or look at the execution plan in SQL Server Management Studio. The plan will show 'Index Scan', 'Index Seek' (good) or 'Seq Scan' / 'Table Scan' (index not used). For system-wide unused index hunting, query pg_stat_user_indexes in PostgreSQL or sys.dm_db_index_usage_stats in SQL Server and look for rows where the usage count is zero or extremely low.
Was this helpful?
04
What is the difference between a full index scan and an index seek?
An index seek uses the B-Tree structure to efficiently locate a specific range of values — it navigates down the tree and only reads relevant leaf pages. A full index scan, however, reads every leaf page in the index sequentially (like a table scan but on the index structure). Full index scans are still cheaper than full table scans if the index is smaller, but they are not as efficient as seeks. They happen when the query matches a large portion of the table or when the leading column of a composite index is missing.