Top SQL Interview Questions: JOINs, Indexes, and Query Optimization Explained
- JOINs are not Venn diagrams — they're row-combination operations. In a LEFT JOIN, the ON clause controls which right-side rows participate; the WHERE clause eliminates rows from the final result. Mixing them up silently converts LEFT JOINs into INNER JOINs.
- NULL is the absence of a value, not a value itself. NULL != NULL, any arithmetic with NULL produces NULL, and comparison operators with NULL produce UNKNOWN. Use IS NULL,
COALESCE(), and COUNT(*) vs COUNT(column) correctly. - Window functions (ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG) add computed columns without collapsing rows. They're the correct answer for 'top N per group' and 'running total' questions — not correlated subqueries.
Think of a database like a giant office filing system with dozens of labeled cabinets. SQL is the language you use to ask the office assistant to find, combine, and summarize files across all those cabinets. An interview tests whether you truly understand WHY the assistant needs specific instructions — not just which words to say. A JOIN is like asking the assistant to staple two related files together before handing them to you; an index is like a sticky-tab system that lets them find files in seconds instead of rifling through every drawer; and a window function is like asking the assistant to rank every file by importance while still handing you every single file — not just the top one. But here's what most tutorials won't tell you: the assistant also makes mistakes if your instructions are ambiguous. Tell them to find 'all files where the date is NULL' and they might skip files you expected. Tell them to combine two cabinets but forget to specify how, and they'll staple every file from cabinet A to every file from cabinet B — producing 10,000 stapled pairs when you wanted 50. That's what a Cartesian product is, and it's killed more production databases than any other SQL mistake.
SQL interviews trip up even experienced developers — not because the syntax is hard, but because interviewers probe whether you understand what happens inside the database engine when your query runs. The difference between a candidate who gets hired and one who doesn't usually comes down to three words: 'and why that?'
I've interviewed over 60 candidates for backend and data engineering roles. The pattern is always the same: 80% can write a SELECT with a JOIN. Maybe 30% can explain why their JOIN is slow. Maybe 10% can look at an execution plan and tell me the optimizer chose a Hash Join when it should have used a Nested Loop. That 10% gets the offer.
Most people can write a SELECT statement; very few can explain why their query is scanning 2 million rows when it should be touching 50. I once inherited a reporting service that took 45 seconds to load a dashboard. The fix was one index and a rewritten subquery. The previous developer had been 'optimizing' by adding more RAM to the server for six months. The query was doing a full table scan on a 14-million-row orders table because someone wrapped an indexed column in YEAR() — making it non-SARGable. One line of SQL, six months of wasted infrastructure spend.
This matters because slow queries cost money. A poorly written JOIN on a 10-million-row orders table can turn a 20ms API response into a 12-second timeout. Knowing the difference between a clustered and non-clustered index, or understanding why a LEFT JOIN returns NULLs you didn't expect, is the difference between shipping reliable software and debugging production fires at 2am.
By the end of this article you'll be able to explain JOINs, NULLs, window functions, indexes, subqueries vs CTEs, GROUP BY, and query optimization at a level that earns the interviewer's respect — not just recognition. Every concept comes with a real schema, runnable SQL, actual output, and the exact reasoning an interviewer wants to hear.
Understanding JOINs: Beyond the Venn Diagram
Interviewers often start with JOINs. While Venn diagrams are a popular starting point, they're misleading for senior candidates because SQL JOINs are not set operations — they're row-combination operations. A LEFT JOIN doesn't 'subtract' anything; it preserves every row from the left table and fills in NULLs when there's no match on the right.
Senior candidates must discuss the logical processing of a JOIN. The database engine decides between three physical operators: Nested Loop (best for small tables), Hash Join (best for large unsorted tables), and Merge Join (best when both inputs are sorted on the join key). Knowing when each is chosen and why shows you understand the engine, not just the syntax.
The most common interview trap: putting a filter in the ON clause vs the WHERE clause of a LEFT JOIN. They produce completely different results. A filter in ON controls which right-side rows participate in the join. A filter in WHERE eliminates rows from the final result — including left-side rows that had no match. This distinction is the single most missed concept in SQL interviews.
I once caught a production bug where a developer filtered on the right table's status column in the WHERE clause of a LEFT JOIN. It silently converted the LEFT JOIN into an INNER JOIN — every customer without an active order disappeared from the report. The finance team made quarterly projections based on that report. The projections were wrong for three months before anyone noticed.
-- io.thecodeforge: LEFT JOIN — ON vs WHERE filter behavior -- Schema: customers(customer_id, customer_name) -- orders(order_id, customer_id, order_date, total_amount, status) -- QUERY 1: Filter in ON clause -- Preserves ALL customers. Only joins with orders after 2025-01-01. -- Customers with no recent orders still appear (with NULL order data). SELECT c.customer_name, o.order_date, COALESCE(o.total_amount, 0) AS amount FROM io_thecodeforge.customers c LEFT JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id AND o.order_date > '2025-01-01'; -- QUERY 2: Filter in WHERE clause -- Eliminates any row where order_date is NULL (no match or filtered out). -- This effectively converts the LEFT JOIN into an INNER JOIN. SELECT c.customer_name, o.order_date, COALESCE(o.total_amount, 0) AS amount FROM io_thecodeforge.customers c LEFT JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id WHERE o.order_date > '2025-01-01'; -- QUERY 3: Find customers with NO orders after 2025-01-01 -- Classic pattern: LEFT JOIN + WHERE IS NULL SELECT c.customer_name, COALESCE(o.total_amount, 0) AS amount FROM io_thecodeforge.customers c LEFT JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id AND o.order_date > '2025-01-01' WHERE o.order_id IS NULL;
| customer_name | order_date | amount |
|---------------|------------|--------|
| Alice Forge | 2025-06-15 | 250.00 |
| Alice Forge | 2025-09-22 | 175.50 |
| Bob Smith | NULL | 0 |
| Carol Wu | 2025-03-10 | 89.99 |
| Dave Patel | NULL | 0 |
-- QUERY 2 output (filter in WHERE — silently becomes INNER JOIN):
| customer_name | order_date | amount |
|---------------|------------|--------|
| Alice Forge | 2025-06-15 | 250.00 |
| Alice Forge | 2025-09-22 | 175.50 |
| Carol Wu | 2025-03-10 | 89.99 |
-- Bob Smith and Dave Patel are MISSING (filtered out by WHERE)
-- QUERY 3 output (customers with no recent orders):
| customer_name | amount |
|---------------|--------|
| Bob Smith | 0 |
| Dave Patel | 0 |
NULL Handling: The Landmine Nobody Practices
NULL is not zero. NULL is not an empty string. NULL is not 'false.' NULL is the absence of a value — it's the database saying 'I don't know.' And SQL's treatment of NULL is the source of more interview failures and production bugs than any other single concept.
The three rules that govern NULL: 1. NULL != NULL. NULL is not equal to anything, including itself. WHERE column = NULL returns zero rows. Always use IS NULL or IS NOT NULL. 2. Any arithmetic operation with NULL produces NULL. 5 + NULL = NULL. If your revenue column has NULLs, SUM(revenue) ignores them — but revenue * 1.1 produces NULL for those rows. 3. Comparison operators (<, >, =) with NULL produce UNKNOWN, not TRUE or FALSE. This means rows with NULL values are excluded from both WHERE column > 100 and WHERE column <= 100. They fall through both filters.
COALESCE is your safety net. It returns the first non-NULL value from its arguments. Use it to provide defaults: COALESCE(discount, 0) means 'use the discount if it exists, otherwise assume zero.' NVL (Oracle), IFNULL (MySQL), and ISNULL (SQL Server) are database-specific alternatives.
Production story: a revenue report was undercounting by 15%. The query was SELECT SUM(price quantity) FROM order_items. Some rows had NULL quantity (items pending confirmation). Multiplying by NULL produced NULL, and SUM skips NULLs — so those items silently disappeared from the revenue total. The fix was SUM(COALESCE(price, 0) COALESCE(quantity, 0)). One line. Three months of wrong numbers.
-- io.thecodeforge: NULL Handling Patterns -- Schema: products(product_id, product_name, discount_percent, category) -- WRONG: This returns ZERO rows even if NULLs exist -- Because NULL = NULL evaluates to UNKNOWN, not TRUE SELECT * FROM io_thecodeforge.products WHERE discount_percent = NULL; -- RIGHT: Use IS NULL SELECT * FROM io_thecodeforge.products WHERE discount_percent IS NULL; -- WRONG: This excludes NULLs from both sides -- Products with NULL discount appear in NEITHER result SELECT * FROM io_thecodeforge.products WHERE discount_percent > 10; SELECT * FROM io_thecodeforge.products WHERE discount_percent <= 10; -- Together they don't cover all rows — NULL rows are missing from both. -- RIGHT: Explicitly handle NULLs with COALESCE SELECT product_name, discount_percent, COALESCE(discount_percent, 0) AS safe_discount, CASE WHEN discount_percent IS NULL THEN 'No discount set' WHEN discount_percent > 20 THEN 'Heavy discount' ELSE 'Standard' END AS discount_category FROM io_thecodeforge.products; -- PRODUCTION PATTERN: Safe aggregation with NULLs -- Without COALESCE, NULL quantity items vanish from revenue totals SELECT SUM(COALESCE(unit_price, 0) * COALESCE(quantity, 0)) AS total_revenue, COUNT(*) AS total_line_items, COUNT(quantity) AS items_with_quantity, -- COUNT(column) skips NULLs COUNT(*) AS all_items -- COUNT(*) counts everything FROM io_thecodeforge.order_items;
| product_name | discount_percent | safe_discount | discount_category |
|---------------|-----------------|---------------|--------------------|
| Widget A | 15 | 15 | Standard |
| Widget B | NULL | 0 | No discount set |
| Widget C | 25 | 25 | Heavy discount |
| Widget D | NULL | 0 | No discount set |
-- Revenue aggregation output:
| total_revenue | total_line_items | items_with_quantity | all_items |
|---------------|-----------------|--------------------|-----------|
| 45890.50 | 1200 | 1185 | 1200 |
-- Notice: 15 items had NULL quantity. COUNT(*) includes them, COUNT(column) doesn't.
Window Functions: The Interview Differentiator
Window functions are the topic that separates mid-level candidates from senior ones. GROUP BY collapses rows. Window functions keep every row but add computed columns based on a 'window' of related rows. If GROUP BY gives you a summary, window functions give you a summary attached to every detail row.
ROW_NUMBER(): Assigns a unique sequential number. Ties get different numbers.RANK(): Assigns the same number to ties, then skips. 1, 1, 3, 4.DENSE_RANK(): Assigns the same number to ties, no skip. 1, 1, 2, 3.- LEAD(column, n): Gets the value from n rows ahead.
- LAG(column, n): Gets the value from n rows behind.
SUM()OVER(): Running total without collapsing rows.
The interview pattern I see most: 'Find the top 3 orders per customer' or 'Find the second-highest salary per department.' Both require ROW_NUMBER() or DENSE_RANK() with PARTITION BY. Candidates who reach for a subquery with LIMIT or a correlated subquery are showing they don't know window functions — and interviewers notice.
Production use case: I used LAG() to detect anomalies in a time-series metrics table. The query compared each hour's error rate to the previous hour's. If the error rate jumped by more than 200%, it flagged the row. That query ran against 50 million rows in 3 seconds because the window function avoided a self-join that would have taken 40+ seconds.
-- io.thecodeforge: Window Functions — Interview Patterns -- Schema: orders(order_id, customer_id, order_date, total_amount) -- employees(employee_id, name, department, salary) -- PATTERN 1: Top N per group (top 2 orders per customer by amount) -- This is the #1 window function interview question. SELECT * FROM ( SELECT customer_id, order_id, total_amount, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY total_amount DESC ) AS rn FROM io_thecodeforge.orders ) ranked WHERE rn <= 2; -- PATTERN 2: Running total (cumulative revenue by date) SELECT order_date, total_amount, SUM(total_amount) OVER ( ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total FROM io_thecodeforge.orders ORDER BY order_date; -- PATTERN 3: DENSE_RANK for 'second-highest salary per department' -- DENSE_RANK handles ties correctly. ROW_NUMBER would arbitrarily pick one. SELECT * FROM ( SELECT department, name, salary, DENSE_RANK() OVER ( PARTITION BY department ORDER BY salary DESC ) AS dept_rank FROM io_thecodeforge.employees ) ranked WHERE dept_rank = 2; -- PATTERN 4: LAG for detecting day-over-day anomalies SELECT metric_date, error_count, LAG(error_count, 1) OVER (ORDER BY metric_date) AS prev_day_errors, CASE WHEN error_count > LAG(error_count, 1) OVER (ORDER BY metric_date) * 2 THEN 'SPIKE ALERT' ELSE 'Normal' END AS status FROM io_thecodeforge.daily_metrics ORDER BY metric_date;
| customer_id | order_id | total_amount | rn |
|-------------|----------|-------------|----|
| 101 | 5001 | 450.00 | 1 |
| 101 | 5003 | 320.00 | 2 |
| 102 | 5005 | 890.00 | 1 |
| 102 | 5002 | 210.00 | 2 |
-- PATTERN 2 output (running total):
| order_date | total_amount | running_total |
|------------|-------------|---------------|
| 2025-01-01 | 500.00 | 500.00 |
| 2025-01-02 | 350.00 | 850.00 |
| 2025-01-03 | 720.00 | 1570.00 |
-- PATTERN 4 output (anomaly detection):
| metric_date | error_count | prev_day_errors | status |
|------------|------------|-----------------|-------------|
| 2025-03-01 | 45 | NULL | Normal |
| 2025-03-02 | 52 | 45 | Normal |
| 2025-03-03 | 180 | 52 | SPIKE ALERT |
GROUP BY and HAVING: Aggregation Done Right
GROUP BY collapses multiple rows into summary rows. HAVING filters those summary rows after aggregation. WHERE filters individual rows before aggregation. Mixing them up is a classic interview mistake — and a common production bug.
The logical processing order matters: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This means: - WHERE cannot use aggregate functions (SUM, COUNT, AVG) because it runs before GROUP BY. - HAVING can use aggregate functions because it runs after GROUP BY. - You can filter on non-aggregated columns in WHERE (faster) or HAVING (slower, but sometimes necessary).
The trap interviewers love: 'Show me departments where the average salary exceeds 50,000.' That's HAVING AVG(salary) > 50000. Candidates who write WHERE AVG(salary) > 50000 get an immediate syntax error — and the interviewer watches to see if you know why.
Another trap: selecting columns not in the GROUP BY clause. In strict SQL mode (MySQL) or PostgreSQL, SELECT department, name, AVG(salary) FROM employees GROUP BY department fails because 'name' is not in the GROUP BY and is not aggregated. In older MySQL modes, it silently returns an arbitrary name from each group — which is almost never what you want. Always ensure every non-aggregated column in SELECT is in the GROUP BY.
-- io.thecodeforge: GROUP BY and HAVING Patterns -- Schema: employees(employee_id, name, department, salary, hire_date) -- orders(order_id, customer_id, order_date, total_amount, status) -- PATTERN 1: Basic aggregation with HAVING -- 'Departments with average salary above 50,000' SELECT department, COUNT(*) AS headcount, AVG(salary) AS avg_salary, MAX(salary) AS highest_salary FROM io_thecodeforge.employees GROUP BY department HAVING AVG(salary) > 50000 ORDER BY avg_salary DESC; -- PATTERN 2: WHERE vs HAVING — use WHERE for non-aggregate filters -- 'Departments with avg salary > 50k, but only for employees hired after 2020' -- WRONG: HAVING hire_date > '2020-01-01' — works but inefficient -- RIGHT: Filter rows first with WHERE, then aggregate SELECT department, AVG(salary) AS avg_salary FROM io_thecodeforge.employees WHERE hire_date > '2020-01-01' -- Filters BEFORE grouping (faster) GROUP BY department HAVING AVG(salary) > 50000; -- Filters AFTER grouping -- PATTERN 3: Multi-column GROUP BY with COUNT DISTINCT -- 'Revenue per customer per year, only for completed orders' SELECT customer_id, EXTRACT(YEAR FROM order_date) AS order_year, COUNT(*) AS order_count, SUM(total_amount) AS total_revenue, COUNT(DISTINCT EXTRACT(MONTH FROM order_date)) AS active_months FROM io_thecodeforge.orders WHERE status = 'completed' GROUP BY customer_id, EXTRACT(YEAR FROM order_date) HAVING SUM(total_amount) > 10000 ORDER BY total_revenue DESC;
| department | headcount | avg_salary | highest_salary |
|-------------|----------|------------|----------------|
| Engineering | 45 | 92000.00 | 185000 |
| Data | 12 | 87000.00 | 145000 |
| Product | 8 | 78000.00 | 130000 |
-- Finance, HR, Marketing excluded (avg salary <= 50,000)
-- PATTERN 3 output:
| customer_id | order_year | order_count | total_revenue | active_months |
|-------------|-----------|-------------|---------------|---------------|
| 101 | 2025 | 47 | 23450.00 | 11 |
| 102 | 2025 | 32 | 18900.00 | 9 |
| 103 | 2025 | 28 | 12400.00 | 8 |
Subqueries vs CTEs vs Temp Tables: When to Use What
Interviewers ask: 'What's the difference between a subquery, a CTE, and a temp table?' The answer is about readability, reusability, and performance.
A subquery is a query nested inside another query. It works, but deeply nested subqueries become unreadable fast. If you have a subquery inside a subquery inside a WHERE clause, nobody — including future you — can debug it at 2am during an incident.
A CTE (Common Table Expression) with the WITH clause gives a subquery a name. It reads top-to-bottom, can be referenced multiple times, and makes complex queries self-documenting. The optimizer in PostgreSQL and SQL Server treats CTEs as inline views (not materialized temp tables) unless you explicitly use MATERIALIZED — so there's usually no performance difference from a subquery.
A temp table physically stores intermediate results. Use it when you need to reference the same intermediate result many times, when the intermediate result is expensive to compute, or when you need to index the intermediate result. The downside: it's more I/O and the optimizer can't push predicates through it.
My rule of thumb: CTEs for readability. Subqueries for simple one-off filters. Temp tables for heavy multi-step transformations where you reference the intermediate result 5+ times or need to index it.
-- io.thecodeforge: Subquery vs CTE — Same Query, Different Styles -- Task: Find customers whose total spend exceeds the overall average -- APPROACH 1: Correlated Subquery (harder to read) SELECT c.customer_name, (SELECT SUM(o.total_amount) FROM io_thecodeforge.orders o WHERE o.customer_id = c.customer_id) AS total_spend FROM io_thecodeforge.customers c WHERE (SELECT SUM(o.total_amount) FROM io_thecodeforge.orders o WHERE o.customer_id = c.customer_id) > (SELECT AVG(customer_total) FROM (SELECT SUM(total_amount) AS customer_total FROM io_thecodeforge.orders GROUP BY customer_id) sub); -- APPROACH 2: CTE (clean, readable, self-documenting) WITH customer_spending AS ( SELECT customer_id, SUM(total_amount) AS total_spend FROM io_thecodeforge.orders GROUP BY customer_id ), average_spending AS ( SELECT AVG(total_spend) AS avg_spend FROM customer_spending ) SELECT c.customer_name, cs.total_spend FROM io_thecodeforge.customers c JOIN customer_spending cs ON c.customer_id = cs.customer_id CROSS JOIN average_spending av WHERE cs.total_spend > av.avg_spend ORDER BY cs.total_spend DESC; -- APPROACH 3: Temp table (best when intermediate result is reused many times) CREATE TEMP TABLE customer_spending_tmp AS SELECT customer_id, SUM(total_amount) AS total_spend, COUNT(*) AS order_count FROM io_thecodeforge.orders GROUP BY customer_id; -- Now reference it multiple times without recomputing SELECT c.customer_name, cs.total_spend, cs.order_count FROM io_thecodeforge.customers c JOIN customer_spending_tmp cs ON c.customer_id = cs.customer_id WHERE cs.total_spend > (SELECT AVG(total_spend) FROM customer_spending_tmp) ORDER BY cs.total_spend DESC; DROP TABLE IF EXISTS customer_spending_tmp;
| customer_name | total_spend |
|---------------|-------------|
| Alice Forge | 23450.00 |
| Bob Smith | 18900.00 |
-- Carol Wu and Dave Patel excluded (below average spend)
-- Performance note: CTE and subquery have similar execution plans.
-- Temp table is faster when the intermediate result is referenced 5+ times
-- because it avoids recomputation — but adds I/O overhead for small datasets.
Indexing and SARGability: Why Your Query is Slow
A common interview question is: 'Why isn't my index being used?' This usually leads to the concept of SARGable (Search ARgument ABLE) queries. If you wrap an indexed column in a function (like WHERE YEAR(created_at) = 2026), the database engine cannot use the index effectively, leading to a full table scan. Instead, provide a range the engine can 'seek' into.
I've seen this exact pattern kill production performance. A reporting query on a 14-million-row orders table was taking 45 seconds. The developer had WHERE YEAR(order_date) = 2025 AND MONTH(order_date) = 3. The index on order_date was completely ignored because YEAR() and MONTH() wrapped the column. The fix: WHERE order_date >= '2025-03-01' AND order_date < '2025-04-01'. Same result, 12 milliseconds instead of 45 seconds. The index seek touched 40,000 rows instead of scanning all 14 million.
Beyond SARGability, understand covering indexes. If an index contains every column referenced in the query (both in WHERE and SELECT), the engine never touches the actual table — it reads everything from the index. This is called an 'index-only scan' and it's the fastest possible query path.
Clustered vs non-clustered: a clustered index IS the table — data is stored in the order of the clustered index (usually the primary key). A non-clustered index is a separate structure with pointers back to the data. Each table has exactly one clustered index but can have many non-clustered indexes.
-- io.thecodeforge: SARGable vs Non-SARGable Queries -- Schema: logs(log_id, created_at, level, message, service_name) -- Index: idx_logs_created_at ON logs(created_at) -- BAD: Index on created_at is IGNORED due to function wrapping -- The engine must call DATE() on every row — full table scan SELECT * FROM io_thecodeforge.logs WHERE DATE(created_at) = '2026-03-06'; -- GOOD: Index seek enabled — engine jumps directly to matching range SELECT * FROM io_thecodeforge.logs WHERE created_at >= '2026-03-06 00:00:00' AND created_at < '2026-03-07 00:00:00'; -- BAD: Function on both sides — index cannot be used SELECT * FROM io_thecodeforge.logs WHERE LOWER(service_name) = 'payment-service'; -- GOOD: Store data in consistent case, or use a functional index -- PostgreSQL: CREATE INDEX idx_service_lower ON logs(LOWER(service_name)); -- Then the LOWER() query becomes SARGable on that functional index. -- COVERING INDEX example: -- If your query is: SELECT customer_id, order_date, total_amount FROM io_thecodeforge.orders WHERE customer_id = 101 AND order_date > '2025-01-01'; -- Create a covering index that includes ALL referenced columns: CREATE INDEX idx_orders_covering ON io_thecodeforge.orders(customer_id, order_date) INCLUDE (total_amount); -- Now the engine reads ONLY the index — never touches the table. -- This is an 'index-only scan' — the fastest possible path.
Execution Plan: Seq Scan on logs | Rows: 14,000,000 | Time: 45,200ms
-- SARGable query:
Execution Plan: Index Scan using idx_logs_created_at | Rows: 40,000 | Time: 12ms
-- The difference: 3,750× faster by removing a function call.
Query Optimization: Reading Execution Plans and Avoiding Common Traps
The ultimate interview question: 'How would you optimize a slow query?' Here's the systematic approach I use in production, and the one I want to hear from candidates.
Step 1: Run EXPLAIN. Look for Seq Scan on large tables, Nested Loop on large joins, and high row estimates. If the estimated rows are wildly different from actual rows, your table statistics are stale — run ANALYZE.
Step 2: Check for non-SARGable predicates. Functions on indexed columns, implicit type conversions, LIKE with leading wildcards.
Step 3: Check for missing indexes. If the WHERE clause filters on columns without indexes, add them. If the JOIN uses columns without indexes, add them.
Step 4: Check for SELECT *. You're pulling every column from disk when you need three. Use a covering index or select only required columns.
Step 5: Check for N+1 queries. If your application runs one query to get IDs, then one query per ID, that's N+1. Rewrite as a single JOIN or IN clause.
Step 6: Check for DISTINCT abuse. DISTINCT is expensive — it requires sorting or hashing all results to remove duplicates. If you're using DISTINCT to hide duplicate rows caused by a bad JOIN, fix the JOIN instead.
The implicit conversion trap is worth calling out explicitly: comparing a VARCHAR column to an integer literal (WHERE varchar_col = 123) forces the engine to convert every row's value to an integer before comparing. This prevents index usage and causes a full table scan. Always match types: WHERE varchar_col = '123'.
-- io.thecodeforge: Query Optimization Patterns -- TRAP 1: Implicit type conversion — index ignored -- phone_number is VARCHAR, but we compare to an integer -- The engine converts EVERY row's phone_number to int before comparing SELECT * FROM io_thecodeforge.customers WHERE phone_number = 9876543210; -- BAD: implicit conversion SELECT * FROM io_thecodeforge.customers WHERE phone_number = '9876543210'; -- GOOD: type match, index used -- TRAP 2: DISTINCT hiding a bad JOIN -- If you're getting duplicates and adding DISTINCT to 'fix' it, -- your JOIN condition is wrong. SELECT DISTINCT c.customer_name, o.total_amount FROM io_thecodeforge.customers c JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id; -- If this returns duplicates, check: are there multiple orders per customer? -- If yes, DISTINCT is removing legitimate rows. If no, the JOIN is wrong. -- TRAP 3: Subquery in WHERE instead of JOIN -- Correlated subquery runs once PER ROW — O(n²) behavior SELECT * FROM io_thecodeforge.customers c WHERE c.customer_id IN ( SELECT customer_id FROM io_thecodeforge.orders WHERE total_amount > 1000 ); -- Better: use EXISTS (short-circuits on first match) SELECT * FROM io_thecodeforge.customers c WHERE EXISTS ( SELECT 1 FROM io_thecodeforge.orders o WHERE o.customer_id = c.customer_id AND o.total_amount > 1000 ); -- Best: use JOIN with DISTINCT if you only need customer data SELECT DISTINCT c.* FROM io_thecodeforge.customers c JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id WHERE o.total_amount > 1000; -- TRAP 4: LIKE with leading wildcard — index cannot be used SELECT * FROM io_thecodeforge.logs WHERE message LIKE '%timeout%'; -- Full table scan, every time -- If you need full-text search, use a proper full-text index: -- PostgreSQL: CREATE INDEX idx_message_fts ON logs USING gin(to_tsvector('english', message)); -- Then: WHERE to_tsvector('english', message) @@ to_tsquery('timeout');
-- Integer comparison on VARCHAR:
Execution Plan: Seq Scan on customers | Rows: 500,000 | Time: 850ms
-- String comparison on VARCHAR:
Execution Plan: Index Scan using idx_phone | Rows: 1 | Time: 0.5ms
-- 1,700× slower due to implicit conversion.
| Feature | Clustered Index | Non-Clustered Index |
|---|---|---|
| Data Storage | The table IS the index (stores actual data in index order) | Separate structure with pointers back to data pages |
| Count per Table | Only 1 (usually Primary Key) | Multiple (limited by engine, typically dozens) |
| Performance | Faster for range scans and ORDER BY on the key | Faster for targeted single-row lookups |
| Leaf Nodes | Actual data pages (the table rows themselves) | Pointers to the clustered index key or row ID |
| When to Use | Primary key, columns used in range queries and sorting | Columns frequently in WHERE, JOIN, or ORDER BY that aren't the primary key |
🎯 Key Takeaways
- JOINs are not Venn diagrams — they're row-combination operations. In a LEFT JOIN, the ON clause controls which right-side rows participate; the WHERE clause eliminates rows from the final result. Mixing them up silently converts LEFT JOINs into INNER JOINs.
- NULL is the absence of a value, not a value itself. NULL != NULL, any arithmetic with NULL produces NULL, and comparison operators with NULL produce UNKNOWN. Use IS NULL,
COALESCE(), and COUNT(*) vs COUNT(column) correctly. - Window functions (ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG) add computed columns without collapsing rows. They're the correct answer for 'top N per group' and 'running total' questions — not correlated subqueries.
- SARGable queries let the engine use index seeks instead of full table scans. Never wrap indexed columns in functions (YEAR, MONTH, DATE, LOWER). Provide ranges instead:
WHERE date >= X AND date < Y. - Covering indexes contain every column a query needs, enabling index-only scans — the fastest possible query path. Design indexes around your most frequent and expensive queries.
- EXISTS short-circuits on the first match; IN materializes the entire subquery result. For large subquery result sets, EXISTS is usually faster. For small static lists, IN is more readable.
- GROUP BY collapses rows; HAVING filters after aggregation; WHERE filters before. Every non-aggregated column in SELECT must be in GROUP BY or wrapped in an aggregate function.
- CTEs improve readability and can be referenced multiple times. Subqueries are fine for simple one-off filters. Temp tables win when intermediate results are expensive and referenced many times.
- Always run EXPLAIN before optimizing. Look for Seq Scans, high row estimates, and implicit type conversions. Without execution plans, you're guessing.
- SELECT * prevents covering indexes, increases I/O, and breaks backward compatibility. Always specify only the columns you need.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. When would you use each one? Give a real-world example where choosing the wrong JOIN type would produce incorrect business metrics.
- QWhat is the difference between WHERE and HAVING? Can you use aggregate functions in WHERE? Why or why not? What is the logical processing order of a SQL query?
- QExplain the difference between
ROW_NUMBER(),RANK(), andDENSE_RANK(). Given a table of employee salaries, how would you find the second-highest salary per department using each function? Which would you choose and why? - QWhat is a SARGable query? Give an example of a non-SARGable query and explain how to rewrite it. What happens to the execution plan when you wrap an indexed column in a function?
- QWhat is the difference between EXISTS and IN? When would you prefer one over the other? What happens to performance when the subquery result set is very large?
- QExplain NULL behavior in SQL. Why does
WHERE column = NULLreturn zero rows? What is the difference between COUNT(*) and COUNT(column)? How does NULL affect arithmetic operations and aggregate functions? - QA query runs in 200ms on a table with 10,000 rows but takes 45 seconds on a table with 10 million rows. Walk me through how you would diagnose and optimize it. What would you check first?
- QWhat is a covering index? How does it differ from a regular index? What is an 'index-only scan' and why is it the fastest query path?
- QExplain the difference between a subquery, a CTE, and a temporary table. When would you choose each approach? Does a CTE always materialize a temp table, or is it an inline view?
- QWhat is the difference between DELETE, TRUNCATE, and DROP? When would you use each? What happens to the transaction log in each case?
Frequently Asked Questions
What is the difference between WHERE and HAVING in SQL?
WHERE filters individual rows before any grouping or aggregation occurs — it runs early in the logical processing order. HAVING filters groups after GROUP BY has collapsed rows — it can use aggregate functions like SUM, COUNT, and AVG because it runs after aggregation. You cannot use WHERE AVG(salary) > 50000 because WHERE runs before GROUP BY, so no aggregate has been computed yet. Use HAVING AVG(salary) > 50000. For non-aggregate filters, always prefer WHERE — it's more efficient because it reduces the number of rows before the expensive GROUP BY operation.
What are Window Functions and why are they useful?
Window functions (ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, SUM OVER) perform calculations across a set of related rows without collapsing them into a single output row. Unlike GROUP BY, which produces one summary row per group, window functions retain every individual row and add computed columns. This is essential for patterns like 'top N per group,' 'running totals,' 'month-over-month comparison,' and 'rank within a partition.' The syntax uses OVER (PARTITION BY column ORDER BY column) to define the window of rows each calculation considers.
What is Database Normalization and why do we do it?
Normalization organizes data to reduce redundancy and improve integrity. 1NF eliminates repeating groups (each cell holds a single value). 2NF removes partial dependencies (every non-key column depends on the entire primary key). 3NF removes transitive dependencies (non-key columns don't depend on other non-key columns). In practice, most production databases target 3NF. Denormalization (intentionally violating normalization) is used for read-heavy workloads like analytics and reporting, where JOIN performance matters more than storage efficiency.
What is the difference between EXISTS and IN?
Both check if a value exists in a subquery result set, but they behave differently at scale. EXISTS runs the subquery for each outer row and short-circuits on the first match — it stops scanning as soon as it finds one matching row. IN materializes the entire subquery result into a set, then checks membership. For large subquery results (100,000+ rows), EXISTS is usually faster because it may find a match on the first row. For small static lists (WHERE status IN ('active', 'pending')), IN is more readable. In an interview, say: 'I default to EXISTS for correlated checks and IN for small literal lists.'
What is the difference between a Primary Key and a Foreign Key?
A Primary Key uniquely identifies each row in a table — it cannot be NULL and must be unique. A table has exactly one primary key. A Foreign Key is a column (or set of columns) in one table that references the primary key of another table, establishing a relationship between them. Foreign keys enforce referential integrity — you cannot insert an order with a customer_id that doesn't exist in the customers table, and you cannot delete a customer who has orders (unless you use CASCADE). In an interview, mention that foreign keys also help the optimizer choose better join strategies because they guarantee referential integrity.
What are Transactions and ACID properties?
A transaction is a sequence of operations treated as a single logical unit — either all operations succeed (COMMIT) or none do (ROLLBACK). ACID stands for Atomicity (all-or-nothing execution), Consistency (the database moves from one valid state to another), Isolation (concurrent transactions don't interfere with each other), and Durability (committed data survives system failures). In production, transactions prevent partial updates — like debiting one account but failing to credit another. Without transactions, every error is a potential data corruption event.
What is a CTE (Common Table Expression) and when should you use it?
A CTE is a named temporary result set defined with the WITH clause that exists only for the duration of the query. It improves readability by giving meaningful names to intermediate steps, can be referenced multiple times in the same query, and makes complex queries self-documenting. Contrary to popular belief, CTEs in PostgreSQL and SQL Server are usually inlined as subqueries by the optimizer — they don't create temp tables unless you use MATERIALIZED. Use CTEs when you need to reference the same intermediate result multiple times or when a query has more than two levels of nesting that would be unreadable as plain subqueries.
What is the difference between UNION and UNION ALL?
UNION combines results from two queries and removes duplicate rows — it requires a sort or hash operation to deduplicate, which is expensive on large datasets. UNION ALL combines results and keeps all rows including duplicates — it's a simple concatenation with no deduplication overhead. In production, almost always use UNION ALL unless you specifically need deduplication. I've seen queries go from 12 seconds to 200 milliseconds just by switching UNION to UNION ALL when the developer didn't realize duplicates were impossible given the query logic.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.