SQL Indexes: The Ultimate Guide to Query Optimization
- B-tree indexes transform $O(n)$ full table scans into $O(\log n)$ logarithmic lookups.
- The Leftmost Prefix rule is non-negotiable for composite indexes—order your columns by search frequency.
- Covering indexes are a 'cheat code'—they allow the database to answer queries without ever touching the main table file.
An index is a separate data structure (usually a B-tree) that stores a sorted copy of one or more columns with pointers to the actual rows. This lets the database find rows matching a WHERE clause in $O(\log n)$ instead of scanning every row $O(n)$. Indexes speed up reads but slow down writes — every INSERT, UPDATE, and DELETE must update the index too.
How a B-Tree Index Works
Most relational databases (PostgreSQL, MySQL, SQL Server) use a B-tree (Balanced Tree) structure for indexing. A B-tree keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
When you query a column, the database starts at the root node and follows pointers down to the leaf nodes. Each leaf node contains the indexed value plus a 'Record Identifier' (RID) or a Primary Key value that points to the actual row data in the 'Heap' or Clustered Index.
-- io.thecodeforge.optimization -- Scenario: Searching for a customer without an index SELECT * FROM io_thecodeforge.orders WHERE customer_id = 1234; -- Result: The database performs a 'Sequential Scan' (reads 100% of the table). -- Optimization: Create a standard B-tree index CREATE INDEX idx_orders_customer ON io_thecodeforge.orders(customer_id); -- Now the query engine performs an 'Index Scan' EXPLAIN ANALYZE SELECT * FROM io_thecodeforge.orders WHERE customer_id = 1234;
Composite Indexes and the Leftmost Prefix Rule
A composite index (multi-column index) is sorted primarily by the first column, then the second, and so on. This leads to the Leftmost Prefix Rule: an index on $(A, B)$ can be used for queries filtering by $(A)$ or $(A, B)$, but it is completely useless for a query filtering only by $(B)$.
Choosing the right column order is vital. Generally, you put the column with the highest cardinality (most unique values) first, or the column most frequently used in WHERE clauses.
-- io.thecodeforge.optimization -- Composite index on (last_name, first_name) CREATE INDEX idx_user_full_name ON io_thecodeforge.users(last_name, first_name); -- VALID: Uses index (leftmost prefix) SELECT * FROM io_thecodeforge.users WHERE last_name = 'Jordan'; -- VALID: Uses index (exact match) SELECT * FROM io_thecodeforge.users WHERE last_name = 'Jordan' AND first_name = 'Michael'; -- INVALID: Full Table Scan (skips the first column of the index) SELECT * FROM io_thecodeforge.users WHERE first_name = 'Michael'; -- PRO TIP: Covering Index -- This index contains ALL data needed, avoiding a 'Table Lookup' entirely CREATE INDEX idx_covering_search ON io_thecodeforge.users(email) INCLUDE (user_id, status);
When Indexes Hurt: Write Amplification
Every time you INSERT a row into a table with 5 indexes, the database must perform 6 writes (1 to the table, 5 to the indexes). This is known as Write Amplification. Furthermore, indexes can become 'fragmented' over time as rows are deleted and updated, leading to bloated files and degraded performance.
-- io.thecodeforge.maintenance -- 1. Identifying unused indexes (PostgreSQL example) SELECT relname, indexrelname, idx_scan FROM pg_stat_user_indexes WHERE idx_scan = 0 AND idx_read = 0; -- Action: Drop these to reclaim write performance. -- 2. Avoiding 'Functional' index traps -- This ignores a standard index on 'created_at' SELECT * FROM io_thecodeforge.orders WHERE DATE(created_at) = '2024-01-01'; -- Correct approach: Use a range or a functional index CREATE INDEX idx_functional_date ON io_thecodeforge.orders ((DATE(created_at)));
| Feature | Clustered Index | Non-Clustered Index |
|---|---|---|
| Data Storage | Stores actual row data in the leaf nodes | Stores pointers (RID/PK) to the data |
| Limit | Only 1 per table (the physical order) | Multiple per table (logical order) |
| Best For | Range searches and Primary Key lookups | Frequent filtering on non-PK columns |
| Maintenance | High overhead when updating the PK | Lower overhead but requires a 'lookup' step |
🎯 Key Takeaways
- B-tree indexes transform $O(n)$ full table scans into $O(\log n)$ logarithmic lookups.
- The Leftmost Prefix rule is non-negotiable for composite indexes—order your columns by search frequency.
- Covering indexes are a 'cheat code'—they allow the database to answer queries without ever touching the main table file.
- Always check your execution plans (
EXPLAIN) to ensure your indexes are actually being utilized by the optimizer. - Balance is key: index for your slowest, most frequent queries, but keep an eye on write latency.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the 'Leftmost Prefix' rule. Can an index on (A, B, C) be used for a query on (A, C)?
- QWhat is a 'Covering Index' and why is it faster than a standard index scan?
- QLeetCode Context: You have a table with 100 million logs. A query for 'logs from the last 24 hours' is slow despite an index. What could be the cause? (Hint: Selectivity/I/O cost).
- QWhat are the disadvantages of having too many indexes on a single table?
- QCompare B-tree and Hash indexes. In what specific scenario would a Hash index be superior?
Frequently Asked Questions
What is the difference between a clustered and non-clustered index?
A clustered index defines the physical order in which data is stored on the disk. There can only be one clustered index per table (usually the Primary Key). A non-clustered index is a separate structure that contains a sorted list of values and pointers back to the actual data. Think of the clustered index as the actual dictionary entries (alphabetical) and the non-clustered index as the 'Index' section at the back.
Does a Primary Key automatically create an index?
Yes. In almost every RDBMS, defining a Primary Key automatically creates a unique clustered index (or a unique B-tree index) on that column to ensure uniqueness and fast lookups.
Why would the database ignore my index and do a full scan?
The Query Optimizer is cost-based. If the table is very small, or if the column has 'low selectivity' (e.g., searching for 'Male' in a table where 50% of the rows match), the optimizer decides it is faster to just read the whole table into memory than to jump back and forth between the index and the data.
What is Index Fragmentation?
As you delete or update data, the logical order in the B-tree may no longer match the physical order on disk, leaving gaps ('internal fragmentation'). This makes index scans slower. Regular maintenance like REINDEX or ALTER INDEX ... REBUILD fixes this.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.