Home Database SQL Indexes Explained — How They Work, When to Use Them, and What to Avoid

SQL Indexes Explained — How They Work, When to Use Them, and What to Avoid

In Plain English 🔥
Imagine your school library has 10,000 books with no catalogue — to find one book you'd have to check every shelf one by one. Now imagine a card catalogue at the front that says 'Python books: shelves 42–47'. That card catalogue is an index. SQL indexes work exactly the same way — instead of scanning every row in a table, the database jumps straight to the rows that match your query. The magic is that the index is maintained separately, so your actual data stays untouched.
⚡ Quick Answer
Imagine your school library has 10,000 books with no catalogue — to find one book you'd have to check every shelf one by one. Now imagine a card catalogue at the front that says 'Python books: shelves 42–47'. That card catalogue is an index. SQL indexes work exactly the same way — instead of scanning every row in a table, the database jumps straight to the rows that match your query. The magic is that the index is maintained separately, so your actual data stays untouched.

Every production database eventually hits the same wall: queries that worked fine with 1,000 rows start crawling when the table reaches 1,000,000. Engineers start blaming the network, the ORM, the server — but nine times out of ten, the culprit is a missing index. Indexes are the single highest-leverage performance tool available to you in SQL, and understanding them deeply separates good engineers from great ones.

Without an index, the database engine performs what's called a full table scan — it reads every single row to find the ones matching your WHERE clause. On a small table that's invisible. On a table with 50 million rows, that same query can take 30 seconds and bring an app to its knees. An index lets the engine jump directly to matching rows using a pre-sorted, compressed data structure — typically a B-tree — reducing what was O(n) work down to O(log n).

By the end of this article you'll understand exactly how indexes are stored and searched, know when to create a single-column vs composite index, be able to spot queries that can't use an index even if one exists, and avoid the three most common indexing mistakes that silently destroy database performance.

How SQL Indexes Actually Work Under the Hood (The B-Tree Story)

Most SQL databases — PostgreSQL, MySQL, SQL Server, SQLite — default to a B-tree (Balanced Tree) index. Picture a binary search tree where every node stays balanced, so the depth never gets out of hand. When you create an index on a column, the database builds a separate sorted structure containing the indexed values and pointers (row IDs) back to the actual table rows.

When you run a query like WHERE email = 'alice@example.com', the engine doesn't touch the main table at first. It walks the B-tree — starting at the root, going left or right based on comparisons — and arrives at the matching leaf node in O(log n) steps. That leaf holds the physical row address, and only then does the engine fetch the actual row.

This is why indexes are so powerful and also why they have a cost: every INSERT, UPDATE, or DELETE must also update the index tree. A table with 10 indexes takes 10x the write overhead versus no indexes. You're trading write speed for read speed — a trade worth making for columns you query often, but not for every column blindly.

One more key insight: indexes store data in sorted order. That's not just useful for equality checks — it means ORDER BY, BETWEEN, and LIKE 'prefix%' queries can all exploit an index. But LIKE '%suffix' cannot, because the sort order doesn't help when you're searching from the middle or end of a string.

create_and_test_basic_index.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142
-- Create a realistic users table with a million-row scenario in mind
CREATE TABLE users (
    user_id     SERIAL PRIMARY KEY,  -- PK automatically gets an index
    email       VARCHAR(255) NOT NULL,
    full_name   VARCHAR(100),
    country     VARCHAR(50),
    created_at  TIMESTAMP DEFAULT NOW()
);

-- Insert some sample rows so we can demonstrate the concept
INSERT INTO users (email, full_name, country) VALUES
    ('alice@example.com',   'Alice Nguyen',   'USA'),
    ('bob@example.com',     'Bob Martins',    'UK'),
    ('carol@example.com',   'Carol Osei',     'Ghana'),
    ('dave@example.com',    'Dave Kowalski',  'Poland'),
    ('eve@example.com',     'Eve Santos',     'Brazil');

-- WITHOUT an index: this causes a full table scan on `email`
-- The database reads every row to find a match
EXPLAIN ANALYZE
SELECT user_id, full_name
FROM   users
WHERE  email = 'carol@example.com';
-- Output before index: Seq Scan on users (cost=0.00..1.06 rows=1)
-- On a 10M row table this becomes: cost=0.00..254819.00 rows=1 — very slow

-- CREATE the index — this builds the B-tree sorted by email
CREATE INDEX idx_users_email ON users (email);
-- Name convention: idx_<table>_<column(s)> — makes maintenance easy

-- NOW run the same query — the engine will use the index
EXPLAIN ANALYZE
SELECT user_id, full_name
FROM   users
WHERE  email = 'carol@example.com';
-- Output after index: Index Scan using idx_users_email on users
--                     (cost=0.28..8.29 rows=1) -- dramatically cheaper

-- Verify the index exists
SELECT indexname, indexdef
FROM   pg_indexes
WHERE  tablename = 'users';
▶ Output
-- BEFORE index (EXPLAIN ANALYZE output):
Seq Scan on users (cost=0.00..1.06 rows=1 width=22)
(actual time=0.018..0.024 rows=1 loops=1)

-- AFTER index (EXPLAIN ANALYZE output):
Index Scan using idx_users_email on users
(cost=0.28..8.29 rows=1 width=22)
(actual time=0.021..0.023 rows=1 loops=1)

-- pg_indexes query output:
indexname | indexdef
--------------------+---------------------------------------------------
users_pkey | CREATE UNIQUE INDEX users_pkey ON users USING btree (user_id)
idx_users_email | CREATE INDEX idx_users_email ON users USING btree (email)
⚠️
Pro Tip: EXPLAIN ANALYZE is Your Best FriendAlways run EXPLAIN ANALYZE before and after creating an index. Look for 'Seq Scan' (bad on large tables) vs 'Index Scan' or 'Index Only Scan' (good). The cost numbers are estimates, but the scan type tells you everything about whether your index is being used.

Composite Indexes — Column Order Is Everything

A composite index covers multiple columns in a single index structure. Done right, it's one of the most powerful optimizations available. Done wrong, it's wasted disk space.

The golden rule: the database can use a composite index from left to right, but it cannot skip columns. An index on (country, created_at) can satisfy queries filtering on country alone, or on country AND created_at together. But it cannot help a query filtering on created_at alone — because the sort order at the top level is by country, not by date.

This is called the leftmost prefix rule and it trips up developers constantly. Think of it like a phone book sorted by last name, then first name. You can find everyone named 'Smith' (left column), or find 'Smith, John' (both columns). But you can't efficiently find everyone named 'John' across all last names — you'd be back to a full scan.

Column order strategy: put the most selective (highest cardinality) column first when the query filters on a single column most of the time. But if your most common query filters on both columns, put the equality condition column first and the range condition column second. Ranges must go last in a composite index because once the engine hits a range scan, it can't use subsequent index columns for filtering.

Composite indexes also enable something powerful called an Index Only Scan — if your SELECT only asks for columns that are all included in the index, the database never touches the main table at all.

composite_index_strategy.sql · SQL
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
-- Scenario: An e-commerce orders table — a classic real-world case
CREATE TABLE orders (
    order_id     SERIAL PRIMARY KEY,
    customer_id  INT          NOT NULL,
    status       VARCHAR(20)  NOT NULL,  -- 'pending', 'shipped', 'delivered', 'cancelled'
    order_date   DATE         NOT NULL,
    total_amount NUMERIC(10,2)
);

-- ─────────────────────────────────────────────────
-- SCENARIO 1: The wrong index order
-- ─────────────────────────────────────────────────
-- Our most common query: find all pending orders for a specific customer
-- Wrong approach: putting order_date first even though we almost always
-- filter by customer_id
CREATE INDEX idx_wrong_order ON orders (order_date, customer_id, status);

-- This query CANNOT efficiently use idx_wrong_order because
-- customer_id is not the leftmost column
EXPLAIN
SELECT order_id, total_amount
FROM   orders
WHERE  customer_id = 42
  AND  status = 'pending';
-- Result: Seq Scan — the index is skipped entirely

-- ─────────────────────────────────────────────────
-- SCENARIO 2: The right index order
-- ─────────────────────────────────────────────────
-- Rule: equality filters first, then range filters last
-- customer_id (equality) → status (equality) → order_date (range)
DROP INDEX idx_wrong_order;

CREATE INDEX idx_orders_customer_status_date
    ON orders (customer_id, status, order_date);

-- Now this query uses the index perfectly:
-- Step 1: jump to customer_id = 42 subtree
-- Step 2: within that, jump to status = 'pending'
-- Step 3: within that, scan the sorted order_date range
EXPLAIN
SELECT order_id, total_amount
FROM   orders
WHERE  customer_id = 42
  AND  status = 'pending'
  AND  order_date >= '2024-01-01';
-- Result: Index Scan using idx_orders_customer_status_date

-- ─────────────────────────────────────────────────
-- BONUS: Index Only Scan — the database never touches the main table
-- ─────────────────────────────────────────────────
-- If we only SELECT columns that exist IN the index, it's even faster
EXPLAIN
SELECT customer_id, status, order_date  -- all three are IN the index
FROM   orders
WHERE  customer_id = 42
  AND  status = 'pending';
-- Result: Index Only Scan — zero heap (table) access needed!
▶ Output
-- SCENARIO 1 (wrong order) EXPLAIN output:
Seq Scan on orders (cost=0.00..24.12 rows=3 width=16)
Filter: ((customer_id = 42) AND ((status)::text = 'pending'::text))

-- SCENARIO 2 (correct order) EXPLAIN output:
Index Scan using idx_orders_customer_status_date on orders
(cost=0.29..8.32 rows=3 width=16)
Index Cond: ((customer_id = 42) AND (status = 'pending') AND (order_date >= '2024-01-01'))

-- BONUS Index Only Scan output:
Index Only Scan using idx_orders_customer_status_date on orders
(cost=0.29..4.31 rows=3 width=24)
Index Cond: ((customer_id = 42) AND (status = 'pending'))
⚠️
Watch Out: The Leftmost Prefix Rule Bites HardAn index on (customer_id, status, order_date) does NOT help a query that only filters on status or only on order_date. You'd need separate indexes for those access patterns. Always map your most common WHERE clauses before designing composite indexes — don't guess.

Unique, Partial, and Covering Indexes — The Advanced Toolkit

Once you're comfortable with basic B-tree indexes, three more types unlock serious power.

A UNIQUE index does double duty: it enforces a data constraint AND creates an index simultaneously. When you declare a column UNIQUE, most databases create a unique index automatically. This is why unique constraint violations are so fast to detect — the engine checks the index, not the table.

A partial index (called a filtered index in SQL Server) only indexes rows that satisfy a WHERE condition. This is brilliant for tables where you're almost always querying a small, predictable subset. If 95% of your queries filter on status = 'active' and only 3% of rows are active, a partial index on just those rows is 30x smaller and faster than a full index. It's an underused gem.

A covering index is less a distinct type and more a design strategy: you intentionally include extra columns in an index so the database can answer the entire query from the index alone — producing that Index Only Scan we saw earlier. In PostgreSQL you use the INCLUDE clause; in MySQL you just add the extra columns to the index definition. This eliminates the expensive 'heap fetch' step entirely.

Knowing which type to reach for comes down to one question: what does your most critical query need, and what's the minimum index that gives it everything?

advanced_index_types.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- ─────────────────────────────────────────────────
-- 1. UNIQUE INDEX — constraint + performance in one shot
-- ─────────────────────────────────────────────────
-- Enforce that no two users share an email, while also making
-- login queries (WHERE email = ?) lightning fast
CREATE UNIQUE INDEX idx_users_email_unique
    ON users (email);

-- Test the constraint: this will fail fast (checked against the index)
INSERT INTO users (email, full_name, country)
VALUES ('alice@example.com', 'Alice Clone', 'USA');
-- ERROR: duplicate key value violates unique constraint

-- ─────────────────────────────────────────────────
-- 2. PARTIAL INDEX — only index the rows you actually query
-- ─────────────────────────────────────────────────
-- Problem: orders table has 10M rows, but 98% are status='delivered'
-- We almost never query delivered orders — we care about active ones
-- A full index on status wastes space on 9.8M rows we never need

-- Full index (expensive, bloated):
CREATE INDEX idx_orders_status_full ON orders (status);
-- Indexes ALL 10M rows including the 9.8M 'delivered' ones

-- Partial index (lean, surgical):
DROP INDEX idx_orders_status_full;

CREATE INDEX idx_orders_active_only
    ON orders (customer_id, order_date)
    WHERE status IN ('pending', 'processing', 'shipped');
-- Only indexes ~200K rows — 50x smaller, much faster to scan

-- This query will use the partial index:
EXPLAIN
SELECT customer_id, order_date
FROM   orders
WHERE  status = 'pending'
  AND  customer_id = 99;
-- The WHERE clause on the index matches, so the planner uses it

-- ─────────────────────────────────────────────────
-- 3. COVERING INDEX with INCLUDE — eliminate heap fetches
-- ─────────────────────────────────────────────────
-- Problem: a dashboard query needs customer_id, order_date, total_amount
-- We want an Index Only Scan but total_amount isn't a filter column
-- INCLUDE lets us store extra data in the leaf nodes without affecting sort order

CREATE INDEX idx_orders_covering
    ON orders (customer_id, order_date)
    INCLUDE (total_amount, status);
-- customer_id and order_date define the B-tree sort/search structure
-- total_amount and status are stored in leaf nodes as 'payload'

-- Now this SELECT never touches the main table at all:
EXPLAIN ANALYZE
SELECT customer_id, order_date, total_amount, status
FROM   orders
WHERE  customer_id = 42
  AND  order_date >= '2024-06-01';
-- Result: Index Only Scan using idx_orders_covering
▶ Output
-- Unique constraint violation:
ERROR: duplicate key value violates unique constraint "idx_users_email_unique"
DETAIL: Key (email)=(alice@example.com) already exists.

-- Partial index EXPLAIN output:
Index Scan using idx_orders_active_only on orders
(cost=0.29..8.31 rows=1 width=12)
Index Cond: (customer_id = 99)
Filter: ((status)::text = 'pending'::text)

-- Covering index EXPLAIN ANALYZE output:
Index Only Scan using idx_orders_covering on orders
(cost=0.29..4.35 rows=5 width=36)
(actual time=0.019..0.025 rows=5 loops=1)
Index Cond: ((customer_id = 42) AND (order_date >= '2024-06-01'))
Heap Fetches: 0 -- <-- this is the magic number. Zero table reads.
🔥
Interview Gold: Heap Fetches: 0When an interviewer asks about query optimization, mentioning 'Index Only Scan' and 'Heap Fetches: 0' signals you understand how data is physically stored. It shows you're thinking about I/O, not just syntax. Bonus: explain that VACUUM must keep visibility maps fresh for Index Only Scans to work — that's senior-level knowledge.

When Indexes Hurt — The Cases That Kill Performance

Indexes aren't free and they aren't magic. There are well-defined situations where an index either won't be used at all or will actively slow your system down.

The query planner abandons an index when it estimates that a full table scan is cheaper. This happens when your WHERE clause matches a huge percentage of the table — filtering on a boolean column with 50/50 distribution, for example. The planner correctly reasons that sequential I/O across the whole table beats random I/O jumping around via the index.

Functions and type coercions wrapped around indexed columns are silent index killers. WHERE LOWER(email) = 'alice@example.com' will not use an index on email. The engine has to evaluate LOWER() on every row first, which means a full scan. Fix: use a function-based index (CREATE INDEX ON users (LOWER(email))) or store data in a consistent format.

High write load is where over-indexing really hurts. Every INSERT into a table with 8 indexes writes to 9 places (table + 8 index B-trees). For bulk loads or high-throughput event logging, this overhead compounds. A common pattern in data engineering is to drop indexes before a bulk load, load the data, then rebuild indexes in one fast pass.

Finally, index bloat is real. Deleted rows leave dead entries in indexes until VACUUM cleans them up (in PostgreSQL). A table with lots of deletes and no regular VACUUM can have an index three times larger than it needs to be, degrading scan performance over time.

index_pitfalls_and_fixes.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- ─────────────────────────────────────────────────
-- PITFALL 1: Function on an indexed column = index ignored
-- ─────────────────────────────────────────────────
CREATE INDEX idx_users_email_btree ON users (email);

-- BAD: wrapping the column in LOWER() defeats the index
-- The planner can't use idx_users_email_btree here
EXPLAIN
SELECT user_id FROM users
WHERE LOWER(email) = 'alice@example.com';  -- full scan!
-- Output: Seq Scan on users

-- FIX A: Store emails in lowercase at insert time (best long-term solution)
-- Then your original index on email works fine

-- FIX B: Create a function-based index that matches the query exactly
CREATE INDEX idx_users_email_lower
    ON users (LOWER(email));  -- index stores the lowercased value

-- Now this query WILL use the function-based index:
EXPLAIN
SELECT user_id FROM users
WHERE LOWER(email) = 'alice@example.com';
-- Output: Index Scan using idx_users_email_lower

-- ─────────────────────────────────────────────────
-- PITFALL 2: Implicit type cast silently breaks index usage
-- ─────────────────────────────────────────────────
CREATE TABLE sessions (
    session_id  SERIAL PRIMARY KEY,
    user_id     INT,
    token       VARCHAR(64)  -- stored as VARCHAR
);
CREATE INDEX idx_sessions_user_id ON sessions (user_id);

-- BAD: passing a string '42' when user_id is INT
-- Some databases cast silently, causing a full scan
EXPLAIN
SELECT session_id FROM sessions
WHERE user_id = '42';  -- '42' is a string literal — type mismatch risk

-- FIX: always match the data type of your parameter to the column
EXPLAIN
SELECT session_id FROM sessions
WHERE user_id = 42;  -- 42 as integer — index used correctly

-- ─────────────────────────────────────────────────
-- PITFALL 3: Checking index usage and detecting unused indexes
-- ─────────────────────────────────────────────────
-- In PostgreSQL, pg_stat_user_indexes tracks how often each index is used
-- Run this periodically to find indexes that are costing writes but helping no reads
SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan   AS times_index_was_used,   -- low = likely useless
    idx_tup_read,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM
    pg_stat_user_indexes
ORDER BY
    idx_scan ASC,  -- show least-used indexes first
    pg_relation_size(indexrelid) DESC;  -- biggest wasted space at top
▶ Output
-- PITFALL 1: Before function-based index
Seq Scan on users (cost=0.00..1.06 rows=1 width=4)
Filter: (lower((email)::text) = 'alice@example.com'::text)

-- PITFALL 1: After function-based index
Index Scan using idx_users_email_lower on users
(cost=0.28..8.29 rows=1 width=4)
Index Cond: (lower((email)::text) = 'alice@example.com'::text)

-- PITFALL 3: Unused index report
schemaname | tablename | indexname | times_index_was_used | index_size
------------+-----------+-------------------------+----------------------+------------
public | orders | idx_wrong_order | 0 | 2208 kB
public | sessions | idx_sessions_user_id | 3 | 16 kB
public | users | idx_users_email_btree | 47 | 8192 bytes
⚠️
Watch Out: LIKE '%value' Is Always a Full ScanA leading wildcard — WHERE name LIKE '%smith' — cannot use a standard B-tree index because the sort order doesn't help when you're searching from the end of a string. For this pattern you need a trigram index (pg_trgm in PostgreSQL) or a full-text search solution. No amount of regular indexing fixes it.
Index TypeBest Use CaseStorage CostSupports Range Queries?Key Limitation
B-Tree (default)Equality, ranges, ORDER BY on high-cardinality columnsMediumYesDoesn't help with functions on the column
Unique IndexEnforcing uniqueness (email, username, SSN)MediumYesAdds write overhead; not for frequently-updated columns
Composite IndexMulti-column WHERE clauses and covering queriesHigherYes (last column only)Leftmost prefix rule — column order is critical
Partial IndexQuerying a predictable subset (active records, recent events)LowYesWHERE clause must match query filter exactly to be used
Covering Index (INCLUDE)Read-heavy dashboards; eliminating heap fetchesHigher (stores extra cols)YesExtra columns in INCLUDE add storage but don't improve search depth
Function-Based IndexQueries using LOWER(), UPPER(), date_trunc(), etc.MediumLimitedMust match the function in the query exactly

🎯 Key Takeaways

  • EXPLAIN ANALYZE before and after creating any index — 'Seq Scan' vs 'Index Scan' vs 'Index Only Scan' tells you everything about whether your index is actually being used.
  • The leftmost prefix rule governs composite indexes: an index on (A, B, C) can serve queries filtering on A, on A+B, or on A+B+C — but not on B alone or C alone.
  • Functions on indexed columns silently kill index usage — WHERE LOWER(email) = 'x' ignores an index on email. Fix it with a function-based index or consistent data storage.
  • Partial indexes and covering indexes are the two most underused tools in SQL — a partial index on active rows only can be 50x smaller and faster than indexing the whole table.

⚠ Common Mistakes to Avoid

  • Mistake 1: Indexing every column 'just in case' — Symptom: INSERT/UPDATE/DELETE operations become noticeably slow as the table grows; write latency spikes under load — Fix: audit with pg_stat_user_indexes (PostgreSQL) or sys.dm_db_index_usage_stats (SQL Server) to find indexes with zero or near-zero idx_scan counts, then DROP the unused ones. Only create indexes for columns that appear in actual WHERE, JOIN ON, or ORDER BY clauses of your most frequent queries.
  • Mistake 2: Using a function or expression on an indexed column in a WHERE clause — Symptom: EXPLAIN shows a Seq Scan even though an index clearly exists on that column — Fix: either store the data in the transformed format (e.g., always lowercase emails on insert) so the plain index works, or create a function-based index that exactly mirrors the expression used in the query: CREATE INDEX ON users (LOWER(email)).
  • Mistake 3: Getting composite index column order wrong — Symptom: queries that filter on the non-leftmost column completely ignore the composite index, falling back to full table scans — Fix: always put the column used in equality filters before columns used in range filters. Map your three most common WHERE clauses first, then design a single composite index that covers all three using the leftmost prefix rule rather than creating three separate single-column indexes.

Interview Questions on This Topic

  • QCan you explain the difference between a clustered and a non-clustered index, and why a table can only have one clustered index?
  • QYou have a query running slowly. You create an index on the WHERE clause column but EXPLAIN ANALYZE shows it's still doing a full table scan. Walk me through the five most likely reasons why.
  • QIf you have a composite index on (A, B, C), which of these queries can use the index, and which can't: WHERE A=1, WHERE B=2, WHERE A=1 AND C=3, WHERE A=1 AND B=2, WHERE B=2 AND C=3?

Frequently Asked Questions

Does adding an index always make queries faster?

No — the query planner will skip an index if it estimates a full table scan is cheaper, which happens when the query matches a large percentage of rows (low-cardinality columns like boolean flags). Indexes also slow down writes because every INSERT, UPDATE, and DELETE must update the index tree. Only add indexes for columns you actually query frequently in WHERE, JOIN, or ORDER BY clauses.

What is the difference between a clustered and a non-clustered index?

A clustered index determines the physical storage order of the rows in the table itself — there can only be one per table, and in most databases the primary key is clustered by default. A non-clustered index is a separate structure with sorted key values and pointers back to the actual rows. All indexes in PostgreSQL are technically non-clustered (heap-based), while SQL Server and MySQL InnoDB use clustered primary keys natively.

How many indexes should a table have?

There's no fixed rule, but a practical guideline is 3–5 indexes per table for an OLTP workload. Start with the primary key (automatic), add unique indexes for natural keys like email, then add indexes based on your actual slow-query report — never speculatively. For write-heavy tables like event logs or audit trails, minimize indexes to 1–2. For read-heavy reporting tables, you can go higher but always validate with pg_stat_user_indexes that they're actually being used.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousSQL NULL HandlingNext →SQL Views
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged