Home CS Fundamentals Indexing in DBMS Explained — Types, Trade-offs and Real-World Usage

Indexing in DBMS Explained — Types, Trade-offs and Real-World Usage

In Plain English 🔥
Imagine a 1,000-page cookbook with no table of contents. To find the recipe for 'chocolate cake' you'd flip every single page. An index at the back of the book says 'chocolate cake → page 847' and you jump straight there. A database index works exactly the same way — it's a separate, pre-sorted lookup structure that tells the database engine exactly where your data lives on disk, so it never has to flip through every page.
⚡ Quick Answer
Imagine a 1,000-page cookbook with no table of contents. To find the recipe for 'chocolate cake' you'd flip every single page. An index at the back of the book says 'chocolate cake → page 847' and you jump straight there. A database index works exactly the same way — it's a separate, pre-sorted lookup structure that tells the database engine exactly where your data lives on disk, so it never has to flip through every page.

Every application you've ever used — Instagram, Spotify, your bank — runs queries against tables that hold millions or even billions of rows. Without indexing, every single query would force the database to read every row from disk, one by one, until it found what it needed. That process is called a full table scan, and at scale it turns a 2ms query into a 30-second disaster that crashes your app under load.

Indexing solves the needle-in-a-haystack problem. Instead of storing data in one giant unsorted heap, the database builds a compact, sorted auxiliary structure (the index) that maps column values directly to the physical location of the matching rows. When a query arrives, the engine consults the index first, finds the exact disk address, and fetches only what it needs. The difference is the same as searching a phone book alphabetically versus reading every name from start to finish.

By the end of this article you'll understand not just what an index is, but why different index types exist, when to create them, when NOT to, and the subtle mistakes — like indexing the wrong column or over-indexing a write-heavy table — that engineers make all the time in production. You'll walk away ready to reason about index choices during system design interviews and during your next database performance investigation.

How a Database Actually Finds Data Without an Index

Before we talk about indexes, we need to understand what they replace. When you run a query like SELECT * FROM orders WHERE customer_id = 4821, the database has two options: scan every row in the table checking each customer_id one by one, or consult an index to jump directly to the answer.

The first option is a Full Table Scan. It sounds terrible, but for a table with 500 rows it's actually fine — the overhead of an index lookup can even be slower for tiny datasets. The problem explodes when your orders table has 50 million rows and you're running that query 10,000 times per second.

Disk I/O is the bottleneck. Data lives in fixed-size blocks (usually 8KB or 16KB) on disk. A full scan means reading every block. An index is designed so that finding a specific value requires reading only a handful of blocks — typically O(log n) reads for a B-Tree index — regardless of whether the table has 1,000 rows or 1 billion.

This is the core contract: indexes trade extra disk space and slower writes for dramatically faster reads. That trade-off is the single most important idea in this entire article.

full_scan_vs_index.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- ============================================================
-- Demonstrate the cost difference between a full scan
-- and an index seek on a large orders table.
-- Run EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL) to see the plan.
-- ============================================================

-- Step 1: Create a realistic orders table
CREATE TABLE orders (
    order_id      SERIAL PRIMARY KEY,
    customer_id   INT            NOT NULL,
    product_name  VARCHAR(200)   NOT NULL,
    order_total   DECIMAL(10,2)  NOT NULL,
    created_at    TIMESTAMP      NOT NULL DEFAULT NOW()
);

-- Step 2: Seed 1 million rows (simulate a production-sized table)
INSERT INTO orders (customer_id, product_name, order_total, created_at)
SELECT
    (RANDOM() * 100000)::INT,              -- 100k unique customers
    'Product_' || (RANDOM() * 500)::INT,   -- 500 unique products
    (RANDOM() * 500 + 5)::DECIMAL(10,2),   -- order value $5 - $505
    NOW() - (RANDOM() * INTERVAL '2 years') -- orders from last 2 years
FROM generate_series(1, 1000000);

-- Step 3: Query WITHOUT an index — watch the execution plan
EXPLAIN ANALYZE
SELECT order_id, product_name, order_total
FROM   orders
WHERE  customer_id = 4821;
-- Expected plan: "Seq Scan on orders" — reads ALL 1M rows
-- Typical cost on spinning disk: 800ms - 3000ms

-- Step 4: Create a B-Tree index on customer_id
CREATE INDEX idx_orders_customer_id
    ON orders (customer_id);
-- This builds the index structure — one-time cost, usually seconds

-- Step 5: Run the SAME query — the planner now uses the index
EXPLAIN ANALYZE
SELECT order_id, product_name, order_total
FROM   orders
WHERE  customer_id = 4821;
-- Expected plan: "Index Scan using idx_orders_customer_id"
-- Typical cost: 0.5ms - 3ms — a 1000x improvement
▶ Output
-- BEFORE index (Seq Scan):
Seq Scan on orders (cost=0.00..28374.00 rows=10 width=54)
(actual time=12.431..1847.223 rows=9 loops=1)
Planning Time: 0.112 ms
Execution Time: 1847.441 ms

-- AFTER index (Index Scan):
Index Scan using idx_orders_customer_id on orders
(cost=0.42..16.85 rows=10 width=54)
(actual time=0.041..0.063 rows=9 loops=1)
Planning Time: 0.218 ms
Execution Time: 0.089 ms
🔥
Why the Planner Sometimes Ignores Your IndexIf your query matches a very large percentage of rows (say, 30% of the table), the database query planner will deliberately choose a full table scan over your index. Fetching millions of scattered disk pages via an index is actually slower than reading the table sequentially. This is expected, intelligent behaviour — not a bug.

B-Tree vs Hash Indexes — Picking the Right Tool for the Job

There isn't just one type of index. The right choice depends entirely on the kind of queries you run.

B-Tree (Balanced Tree) is the default in every major RDBMS — PostgreSQL, MySQL, Oracle, SQL Server. It stores index entries in a sorted tree structure. The sorted order means it handles equality lookups (=), range queries (BETWEEN, <, >), and ORDER BY efficiently. 90% of the time, B-Tree is what you want.

Hash Index stores a hash of the column value mapped to a row pointer. Hash lookups are O(1) — theoretically faster than B-Tree's O(log n) for equality checks. The catch: hash indexes are useless for range queries. WHERE price > 100 on a hash index results in a full scan because hashed values have no sorted relationship to each other.

Bitmap Index (Oracle, PostgreSQL partial support) is ideal for low-cardinality columns — columns with very few distinct values like status (active/inactive) or country_code. It stores a bit array per distinct value and uses bitwise AND/OR to combine multiple conditions. Never use bitmap indexes on high-cardinality columns like email — the storage overhead becomes catastrophic.

Choosing wrong here is a real interview red flag.

index_types_comparison.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- ============================================================
-- Compare B-Tree vs Hash index behaviour on a products table.
-- PostgreSQL syntax — MySQL uses ENGINE=MEMORY for Hash indexes.
-- ============================================================

CREATE TABLE products (
    product_id    SERIAL PRIMARY KEY,
    sku           VARCHAR(50)    NOT NULL,   -- unique product code
    category      VARCHAR(100)   NOT NULL,   -- e.g. 'electronics'
    price         DECIMAL(10,2)  NOT NULL,
    in_stock      BOOLEAN        NOT NULL DEFAULT TRUE
);

-- ── B-TREE INDEX ─────────────────────────────────────────────
-- Best for: equality, range queries, ORDER BY, LIKE 'prefix%'
CREATE INDEX idx_products_price_btree
    ON products USING BTREE (price);

-- This query USES the B-Tree index efficiently (range query)
EXPLAIN SELECT product_id, sku
FROM   products
WHERE  price BETWEEN 50.00 AND 150.00;  -- range scan ✓

-- ── HASH INDEX ───────────────────────────────────────────────
-- Best for: exact equality lookups ONLY — useless for ranges
CREATE INDEX idx_products_sku_hash
    ON products USING HASH (sku);

-- This query USES the Hash index efficiently (exact match)
EXPLAIN SELECT product_id, price
FROM   products
WHERE  sku = 'ELEC-4K-TV-001';          -- equality lookup ✓

-- This query CANNOT use the Hash index — falls back to seq scan
EXPLAIN SELECT product_id, price
FROM   products
WHERE  sku LIKE 'ELEC%';               -- range/pattern — hash fails ✗

-- ── PARTIAL INDEX (a B-Tree power move) ─────────────────────
-- Index ONLY the rows you actually query — smaller, faster index
CREATE INDEX idx_products_instock_price
    ON products (price)
    WHERE in_stock = TRUE;              -- only index in-stock items
-- Result: index is a fraction of the size but covers the hot query path
▶ Output
-- B-Tree range query plan:
Bitmap Heap Scan on products
Recheck Cond: ((price >= 50.00) AND (price <= 150.00))
-> Bitmap Index Scan on idx_products_price_btree
Index Cond: ((price >= 50.00) AND (price <= 150.00))

-- Hash equality query plan:
Index Scan using idx_products_sku_hash on products
Index Cond: ((sku)::text = 'ELEC-4K-TV-001'::text)

-- Hash + LIKE (falls back to seq scan — hash can't help here):
Seq Scan on products
Filter: ((sku)::text ~~ 'ELEC%'::text)
⚠️
Partial Indexes Are Criminally UnderusedA partial index on `WHERE status = 'pending'` in an orders table can be 50x smaller than a full index on `status`, because 98% of orders are 'completed' and never queried that way. Smaller index = fits in memory = faster. If your queries consistently filter on a specific subset of rows, a partial index is almost always the right call.

Clustered vs Non-Clustered Indexes — The One That Actually Moves Your Data

This distinction trips people up in interviews constantly, so let's nail it.

A Clustered Index physically reorders the rows on disk to match the index order. Because the data itself IS the index, there can only ever be one clustered index per table. In MySQL InnoDB, the PRIMARY KEY is always the clustered index. SQL Server lets you choose which column to cluster on. When you read data via the clustered index, you get the actual row immediately — no second lookup needed.

A Non-Clustered Index is a completely separate structure that stores the indexed column values and a pointer back to the actual row (called a Row Identifier or RID in SQL Server, or the primary key in InnoDB). When the query engine uses a non-clustered index, it first finds the matching pointer in the index, then does a second lookup in the main table to fetch the remaining columns. This second lookup is called a Key Lookup or Bookmark Lookup and shows up in execution plans.

The practical implication: if a non-clustered index query needs many columns beyond what's in the index, those Key Lookups add up. The fix is a Covering Index — an index that includes all the columns a query needs, so the engine never has to touch the main table at all.

Covering indexes are one of the highest-impact performance optimisations you can make with zero application code changes.

clustered_vs_nonclustered.sql · SQL
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
-- ============================================================
-- Illustrate clustered index layout vs non-clustered,
-- and the power of a COVERING index to eliminate key lookups.
-- SQL Server syntax — concepts identical in MySQL/PostgreSQL.
-- ============================================================

CREATE TABLE employees (
    employee_id   INT           NOT NULL IDENTITY(1,1),
    department_id INT           NOT NULL,
    last_name     VARCHAR(100)  NOT NULL,
    first_name    VARCHAR(100)  NOT NULL,
    salary        DECIMAL(10,2) NOT NULL,
    hire_date     DATE          NOT NULL,

    -- PRIMARY KEY creates the CLUSTERED index by default in SQL Server
    -- Rows are physically sorted on disk by employee_id
    CONSTRAINT PK_employees PRIMARY KEY CLUSTERED (employee_id)
);

-- Non-clustered index on department_id
-- Stores: department_id → employee_id (pointer back to clustered index)
CREATE NONCLUSTERED INDEX idx_employees_department
    ON employees (department_id);

-- ── SCENARIO 1: Non-clustered causes a Key Lookup ────────────
-- The index has department_id but NOT salary or last_name.
-- Engine finds the pointer via the index, then looks up the main
-- table again for each matching row — expensive at scale.
SELECT employee_id, last_name, salary
FROM   employees
WHERE  department_id = 7;
-- Execution plan shows: Key Lookup (clustered) — one extra disk read per row

-- ── SCENARIO 2: Covering index eliminates the Key Lookup ─────
-- Include all columns the query needs directly in the index.
-- The INCLUDE clause adds columns to the leaf level of the B-Tree
-- without making them part of the sort key.
CREATE NONCLUSTERED INDEX idx_employees_dept_covering
    ON employees (department_id)          -- the WHERE / JOIN column
    INCLUDE (last_name, salary);           -- columns needed in SELECT

-- Now the SAME query never touches the main table at all
SELECT employee_id, last_name, salary
FROM   employees
WHERE  department_id = 7;
-- Execution plan shows: Index Only Scan — all data in the index ✓

-- ── PRO MOVE: Composite index column ORDER matters ────────────
-- For a query filtering on department_id AND sorting by hire_date:
CREATE NONCLUSTERED INDEX idx_employees_dept_hire
    ON employees (department_id, hire_date);  -- filter col first, sort col second
-- Putting hire_date first would make the department filter less selective
▶ Output
-- SCENARIO 1 Execution Plan (with Key Lookup):
|--Nested Loops (Inner Join)
|--Index Seek(idx_employees_department, department_id=7)
|--Key Lookup(Clustered)(PK_employees) ← extra table hit per row

-- SCENARIO 2 Execution Plan (Covering Index — no Key Lookup):
|--Index Seek(idx_employees_dept_covering, department_id=7)
[All needed columns found in index — main table never touched]
⚠️
Watch Out: Column Order in Composite Indexes Is Not ArbitraryIn a composite index on (department_id, hire_date), the index can support queries filtering on department_id alone, or on both columns. It CANNOT efficiently support a query filtering only on hire_date — that's a full index scan. The rule: put the most selective, most frequently filtered column first. A query that skips the leading column loses most of the index benefit.

When NOT to Add an Index — The Write Performance Tax

Here's what nobody tells beginners: indexes are never free. Every index you add is a liability on write-heavy tables, and getting this wrong in production is painful.

Every time you INSERT, UPDATE, or DELETE a row, the database must update every index on that table to keep it consistent. A table with 8 indexes doesn't just write once — it performs up to 9 writes per operation. On a table receiving 10,000 inserts per second, that's 80,000 extra I/O operations per second before you've even handled reads.

The situation gets worse with UPDATE. If you update an indexed column, the database must delete the old index entry and insert a new one — two write operations per index, per updated row.

The right mental model is to think of each index as a maintenance contract. The database pays that contract on every write, forever. You only justify that contract when read performance gains outweigh the write penalty.

Rules of thumb from the trenches: - Tables with > 80% writes and < 20% reads: index sparingly, only the absolute hottest read paths. - Tables with > 80% reads: index aggressively, your queries will thank you. - Event/audit log tables: almost never index beyond the primary key — they're append-only and rarely read column-by-column. - Foreign key columns: almost always index these — they're hit on every JOIN.

index_write_overhead_demo.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
-- ============================================================
-- Demonstrate the measurable write cost of indexes on a
-- high-throughput event_log table (append-only pattern).
-- ============================================================

-- Version A: Over-indexed event log (common beginner mistake)
CREATE TABLE event_log_over_indexed (
    event_id      BIGSERIAL PRIMARY KEY,
    user_id       BIGINT        NOT NULL,
    session_id    UUID          NOT NULL,
    event_type    VARCHAR(100)  NOT NULL,
    page_url      TEXT          NOT NULL,
    metadata      JSONB,
    created_at    TIMESTAMP     NOT NULL DEFAULT NOW()
);
-- Beginner adds indexes on every column "just in case"
CREATE INDEX idx_event_user    ON event_log_over_indexed (user_id);
CREATE INDEX idx_event_session ON event_log_over_indexed (session_id);
CREATE INDEX idx_event_type    ON event_log_over_indexed (event_type);
CREATE INDEX idx_event_url     ON event_log_over_indexed (page_url);
CREATE INDEX idx_event_time    ON event_log_over_indexed (created_at);
-- 6 total structures to update on every INSERT — silent write killer

-- Version B: Lean, thoughtful indexing
CREATE TABLE event_log_lean (
    event_id      BIGSERIAL PRIMARY KEY,
    user_id       BIGINT        NOT NULL,
    session_id    UUID          NOT NULL,
    event_type    VARCHAR(100)  NOT NULL,
    page_url      TEXT          NOT NULL,
    metadata      JSONB,
    created_at    TIMESTAMP     NOT NULL DEFAULT NOW()
);
-- Only index what your actual queries need:
-- Hot query 1: "Show me all events for user X in the last 7 days"
CREATE INDEX idx_lean_user_time
    ON event_log_lean (user_id, created_at DESC);
-- Hot query 2: "Count events by type in the last hour"
CREATE INDEX idx_lean_type_time
    ON event_log_lean (event_type, created_at DESC);
-- Result: 3 structures updated per INSERT vs 6 — 50% write reduction

-- ── Measure index bloat on existing tables (PostgreSQL) ──────
SELECT
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan   AS times_used,    -- indexes with 0 here are deadweight
    idx_tup_read,
    idx_tup_fetch
FROM   pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Any index with idx_scan = 0 is costing you writes and giving nothing back
▶ Output
-- pg_stat_user_indexes output (example):
schemaname | tablename | indexname | index_size | times_used
------------+------------------------+------------------------+------------+-----------
public | event_log_over_indexed | idx_event_url | 284 MB | 0
public | event_log_over_indexed | idx_event_session | 201 MB | 3
public | event_log_over_indexed | idx_event_type | 18 MB | 12041
public | event_log_lean | idx_lean_user_time | 94 MB | 89234
public | event_log_lean | idx_lean_type_time | 22 MB | 41109

-- idx_event_url: 284 MB of disk space, 0 uses — pure overhead. DROP IT.
⚠️
Dead Indexes Are a Hidden Tax You're Paying Right NowQuery `pg_stat_user_indexes` in PostgreSQL (or `sys.dm_db_index_usage_stats` in SQL Server) to find indexes with zero or near-zero usage. These are burning disk space, slowing every write, and contributing to table bloat. Dropping an unused index on a write-heavy table can give you an immediate, measurable throughput boost with zero risk to read performance.
Feature / AspectClustered IndexNon-Clustered Index
Physical row orderingYes — data rows sorted by index key on diskNo — separate structure, data stays in heap
Count per tableExactly 1 (the table IS the index)Up to 999 (SQL Server) / effectively unlimited
Lookup cost (full row)O(log n) — row returned directlyO(log n) for index + O(1) key lookup for row
Range query speedExcellent — sequential disk readsGood, but key lookups add overhead per row
INSERT / UPDATE costHigher — rows may need physical reorderingLower per index, but adds up with many indexes
Best forPrimary key, most-queried sort columnForeign keys, filter columns, covering queries
Covering index supportN/A — always has full rowYes — INCLUDE clause eliminates key lookups
Default in MySQL InnoDBAlways the PRIMARY KEYAny additional CREATE INDEX statement

🎯 Key Takeaways

  • An index trades disk space and write speed for faster reads — every index is a maintenance contract the database pays on every INSERT, UPDATE and DELETE, forever.
  • B-Tree indexes support equality AND range queries and are the safe default; Hash indexes are faster for pure equality lookups but completely blind to ranges — choosing wrong leaves queries doing full scans silently.
  • A clustered index physically reorders table rows so there can only be one per table; a non-clustered index is a separate structure that points back to the row, requiring a key lookup unless you make it a covering index with INCLUDE.
  • Use pg_stat_user_indexes (PostgreSQL) or sys.dm_db_index_usage_stats (SQL Server) to find indexes with zero usage and drop them — they're costing you write performance and giving nothing back.

⚠ Common Mistakes to Avoid

  • Mistake 1: Indexing a low-cardinality column alone (e.g., CREATE INDEX ON orders (status) where status has only 3 values) — The query planner ignores the index because fetching 33% of the table via scattered index lookups is slower than a sequential scan. Fix it: use a composite index with a high-cardinality leading column (customer_id, status), or use a partial index (WHERE status = 'pending') to target only the rare, frequently-queried value.
  • Mistake 2: Wrapping an indexed column in a function inside a WHERE clause — A query like WHERE YEAR(created_at) = 2024 or WHERE LOWER(email) = 'user@example.com' cannot use a standard B-Tree index because the function transforms the value before comparison, destroying the sorted order the index depends on. Fix it: rewrite the predicate to leave the column bare (WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01') or create a function-based index on the expression (CREATE INDEX ON users (LOWER(email))).
  • Mistake 3: Never running ANALYZE or UPDATE STATISTICS after bulk data loads — Index performance depends on the query planner having accurate statistics about data distribution. After loading millions of rows, stale statistics can cause the planner to choose a full table scan even when a perfect index exists, or vice versa. Fix it: always run ANALYZE table_name in PostgreSQL or UPDATE STATISTICS table_name in SQL Server after any significant bulk load or data deletion, and ensure your auto-analyze/auto-stats settings are tuned for the table's change rate.

Interview Questions on This Topic

  • QWhat is the difference between a clustered and non-clustered index, and why can a table have only one clustered index?
  • QYou have a query that filters on `department_id` and `hire_date`. How do you decide the column order in a composite index, and what happens if a query only filters on `hire_date`?
  • QA colleague says 'indexes make queries faster, so let's index every column.' What's wrong with that reasoning, and how would you decide which columns actually deserve an index?

Frequently Asked Questions

Does adding an index always make queries faster?

No — and this is a common misconception. For very small tables, a full scan is faster because the overhead of navigating the index structure outweighs the saving. For queries that match a large percentage of rows (typically >20-30%), the planner will also choose a full scan. Indexes are most valuable on large tables with highly selective filter columns.

What is a covering index and why does it matter?

A covering index includes all the columns a specific query needs — both the filter columns and the SELECT columns — so the database can answer the query entirely from the index without touching the main table. This eliminates the 'key lookup' step and can reduce query time dramatically on frequently-run queries. You create one using the INCLUDE clause in SQL Server, or by adding extra columns to the index in PostgreSQL.

How do I find out which indexes my database is actually using?

Prefix any query with EXPLAIN ANALYZE (PostgreSQL) or look at the execution plan in SQL Server Management Studio. The plan will show 'Index Scan', 'Index Seek' (good) or 'Seq Scan' / 'Table Scan' (index not used). For system-wide unused index hunting, query pg_stat_user_indexes in PostgreSQL or sys.dm_db_index_usage_stats in SQL Server and look for rows where the usage count is zero or extremely low.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousConcurrency Control in DBMSNext →Transactions in DBMS
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged