Indexing in DBMS — Missing Foreign Key Broke Checkout
A missing foreign key index on a 2-million-row table caused 12-second scans and 100% CPU, taking down checkout —avoid this with proper indexing.
- An index is a sorted lookup structure that maps column values to disk locations, avoiding full table scans
- B-Tree indexes are default and support equality and range queries; Hash indexes are O(1) for equality only
- Clustered indexes reorder rows on disk (one per table); non-clustered indexes are separate structures requiring key lookups
- Performance: Index seek on 1M rows is ~0.1ms vs full scan ~2s — a 20,000x difference
- Production trap: Every index adds write overhead — 8 indexes on a 10K writes/sec table means 80K extra I/O ops per second
- Biggest mistake: Indexing low-cardinality columns alone — the planner will ignore the index at >20% row selectivity
Imagine a 1,000-page cookbook with no table of contents. To find the recipe for 'chocolate cake' you'd flip every single page. An index at the back of the book says 'chocolate cake → page 847' and you jump straight there. A database index works exactly the same way — it's a separate, pre-sorted lookup structure that tells the database engine exactly where your data lives on disk, so it never has to flip through every page.
Every application you've ever used — Instagram, Spotify, your bank — runs queries against tables that hold millions or even billions of rows. Without indexing, every single query would force the database to read every row from disk, one by one, until it found what it needed. That process is called a full table scan, and at scale it turns a 2ms query into a 30-second disaster that crashes your app under load.
Indexing solves the needle-in-a-haystack problem. Instead of storing data in one giant unsorted heap, the database builds a compact, sorted auxiliary structure (the index) that maps column values directly to the physical location of the matching rows. When a query arrives, the engine consults the index first, finds the exact disk address, and fetches only what it needs. The difference is the same as searching a phone book alphabetically versus reading every name from start to finish.
By the end of this article you'll understand not just what an index is, but why different index types exist, when to create them, when NOT to, and the subtle mistakes — like indexing the wrong column or over-indexing a write-heavy table — that engineers make all the time in production. You'll walk away ready to reason about index choices during system design interviews and during your next database performance investigation.
How a Database Actually Finds Data Without an Index
Before we talk about indexes, we need to understand what they replace. When you run a query like SELECT * FROM orders WHERE customer_id = 4821, the database has two options: scan every row in the table checking each customer_id one by one, or consult an index to jump directly to the answer.
The first option is a Full Table Scan. It sounds terrible, but for a table with 500 rows it's actually fine — the overhead of an index lookup can even be slower for tiny datasets. The problem explodes when your orders table has 50 million rows and you're running that query 10,000 times per second.
Disk I/O is the bottleneck. Data lives in fixed-size blocks (usually 8KB or 16KB) on disk. A full scan means reading every block. An index is designed so that finding a specific value requires reading only a handful of blocks — typically O(log n) reads for a B-Tree index — regardless of whether the table has 1,000 rows or 1 billion.
This is the core contract: indexes trade extra disk space and slower writes for dramatically faster reads. That trade-off is the single most important idea in this entire article.
B-Tree vs Hash Indexes — Picking the Right Tool for the Job
There isn't just one type of index. The right choice depends entirely on the kind of queries you run.
B-Tree (Balanced Tree) is the default in every major RDBMS — PostgreSQL, MySQL, Oracle, SQL Server. It stores index entries in a sorted tree structure. The sorted order means it handles equality lookups (=), range queries (BETWEEN, <, >), and ORDER BY efficiently. 90% of the time, B-Tree is what you want.
Hash Index stores a hash of the column value mapped to a row pointer. Hash lookups are O(1) — theoretically faster than B-Tree's O(log n) for equality checks. The catch: hash indexes are useless for range queries. WHERE price > 100 on a hash index results in a full scan because hashed values have no sorted relationship to each other.
Bitmap Index (Oracle, PostgreSQL partial support) is ideal for low-cardinality columns — columns with very few distinct values like status (active/inactive) or country_code. It stores a bit array per distinct value and uses bitwise AND/OR to combine multiple conditions. Never use bitmap indexes on high-cardinality columns like email — the storage overhead becomes catastrophic.
Choosing wrong here is a real interview red flag.
WHERE status = 'pending' in an orders table can be 50x smaller than a full index on status, because 98% of orders are 'completed' and never queried that way. Smaller index = fits in memory = faster. If your queries consistently filter on a specific subset of rows, a partial index is almost always the right call.Clustered vs Non-Clustered Indexes — The One That Actually Moves Your Data
This distinction trips people up in interviews constantly, so let's nail it.
A Clustered Index physically reorders the rows on disk to match the index order. Because the data itself IS the index, there can only ever be one clustered index per table. In MySQL InnoDB, the PRIMARY KEY is always the clustered index. SQL Server lets you choose which column to cluster on. When you read data via the clustered index, you get the actual row immediately — no second lookup needed.
A Non-Clustered Index is a completely separate structure that stores the indexed column values and a pointer back to the actual row (called a Row Identifier or RID in SQL Server, or the primary key in InnoDB). When the query engine uses a non-clustered index, it first finds the matching pointer in the index, then does a second lookup in the main table to fetch the remaining columns. This second lookup is called a Key Lookup or Bookmark Lookup and shows up in execution plans.
The practical implication: if a non-clustered index query needs many columns beyond what's in the index, those Key Lookups add up. The fix is a Covering Index — an index that includes all the columns a query needs, so the engine never has to touch the main table at all.
Covering indexes are one of the highest-impact performance optimisations you can make with zero application code changes.
When NOT to Add an Index — The Write Performance Tax
Here's what nobody tells beginners: indexes are never free. Every index you add is a liability on write-heavy tables, and getting this wrong in production is painful.
Every time you INSERT, UPDATE, or DELETE a row, the database must update every index on that table to keep it consistent. A table with 8 indexes doesn't just write once — it performs up to 9 writes per operation. On a table receiving 10,000 inserts per second, that's 80,000 extra I/O operations per second before you've even handled reads.
The situation gets worse with UPDATE. If you update an indexed column, the database must delete the old index entry and insert a new one — two write operations per index, per updated row.
The right mental model is to think of each index as a maintenance contract. The database pays that contract on every write, forever. You only justify that contract when read performance gains outweigh the write penalty.
Rules of thumb from the trenches: - Tables with > 80% writes and < 20% reads: index sparingly, only the absolute hottest read paths. - Tables with > 80% reads: index aggressively, your queries will thank you. - Event/audit log tables: almost never index beyond the primary key — they're append-only and rarely read column-by-column. - Foreign key columns: almost always index these — they're hit on every JOIN.
pg_stat_user_indexes in PostgreSQL (or sys.dm_db_index_usage_stats in SQL Server) to find indexes with zero or near-zero usage. These are burning disk space, slowing every write, and contributing to table bloat. Dropping an unused index on a write-heavy table can give you an immediate, measurable throughput boost with zero risk to read performance.Composite Indexes — Column Order Makes or Breaks Performance
A composite index spans multiple columns. The order of those columns determines which queries it can serve efficiently. This is one of the most misunderstood concepts in indexing.
The Leading Column Rule: The index can be used for queries that filter on the leading column alone, or on the leading column plus any subsequent columns. It cannot efficiently support a query that skips the leading column — that forces a full index scan or full table scan.
(department_id, hire_date)WHERE department_id = 7→ uses index (good)WHERE department_id = 7 AND hire_date > '2024-01-01'→ uses index very efficientlyWHERE hire_date > '2024-01-01'→ cannot use leading column; likely full scan
Selector First Rule: Put the column with the highest selectivity (most unique values) first. If department_id has 100 distinct values and hire_date has 10,000, then hire_date is more selective. But if queries always filter on department_id first, keep it leading.
The real-world strategy: List all your hot queries, group them by filter columns, then design composite indexes that cover multiple related queries by varying the column order. One index cannot serve all; you often need two or three well-placed indexes rather than fifteen scattered ones.
Covering indexes (using INCLUDE) can add extra columns without affecting the sort order, preserving the selectivity of the leading columns.
- Leading column determines the primary sort order — queries must include it to get the index benefit.
- Second column is sorted within each leading value — useful for refining after the first filter.
- Skipping the leading column forces a full index scan (still reads all leaf pages) or a full table scan.
- Design indexes around your actual query patterns, not theoretical coverage — look at slow query logs.
The Missing Foreign Key Index That Took Down Checkout at Midnight
SELECT * FROM order_items WHERE order_id = ? took 12 seconds each, even though the table had only 2 million rows.order_id would cover all queries on the orders table. They didn't realize that a query filtering on customer_id — a foreign key column with no index — would force a full table scan regardless of the primary key.orders table had 2 million rows. The query SELECT * FROM orders WHERE customer_id = 4821 had no index on customer_id, so the database performed a sequential scan. Under normal load (5 queries/sec) this was tolerable at ~800ms. During Black Friday, the query ran 200 times per second, saturating disk I/O and causing all other queries to queue.customer_id: CREATE INDEX idx_orders_customer_id ON orders (customer_id);. Query time dropped from 800ms to 2ms. CPU dropped from 100% to 15%.- Every foreign key column needs an index — queries that JOIN or filter on FK columns will otherwise scan the whole table.
- Query performance is not linear — a single missing index under load causes system-wide degradation, not just one slow query.
- Monitor index usage with
pg_stat_user_indexesorsys.dm_db_index_usage_stats; unused indexes are a write tax, missing indexes are a read disaster.
Key takeaways
pg_stat_user_indexes (PostgreSQL) or sys.dm_db_index_usage_stats (SQL Server) to find indexes with zero usage and drop themCommon mistakes to avoid
4 patternsIndexing a low-cardinality column alone (e.g., `CREATE INDEX ON orders (status)` where status has only 3 values)
customer_id, status), or use a partial index (WHERE status = 'pending') to target only the rare, frequently-queried value.Wrapping an indexed column in a function inside a WHERE clause
WHERE YEAR(created_at) = 2024 or WHERE LOWER(email) = 'user@example.com' cannot use a standard B-Tree index because the function transforms the value before comparison, destroying the sorted order the index depends on.WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01') or create a function-based index on the expression (CREATE INDEX ON users (LOWER(email))).Never running ANALYZE or UPDATE STATISTICS after bulk data loads
ANALYZE table_name in PostgreSQL or UPDATE STATISTICS table_name in SQL Server after any significant bulk load or data deletion, and ensure your auto-analyze/auto-stats settings are tuned for the table's change rate.Creating an index on a UUID column without considering writes
Interview Questions on This Topic
What is the difference between a clustered and non-clustered index, and why can a table have only one clustered index?
Frequently Asked Questions
That's DBMS. Mark it forged?
6 min read · try the examples if you haven't