Skip to content
Home Interview Top SQL Interview Questions: JOINs, Indexes, and Query Optimization Explained

Top SQL Interview Questions: JOINs, Indexes, and Query Optimization Explained

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Database Interview → Topic 1 of 4
Top SQL interview questions explained with real-world examples, query output, and gotchas.
⚙️ Intermediate — basic Interview knowledge assumed
In this tutorial, you'll learn
Top SQL interview questions explained with real-world examples, query output, and gotchas.
  • JOINs are not Venn diagrams — they're row-combination operations. In a LEFT JOIN, the ON clause controls which right-side rows participate; the WHERE clause eliminates rows from the final result. Mixing them up silently converts LEFT JOINs into INNER JOINs.
  • NULL is the absence of a value, not a value itself. NULL != NULL, any arithmetic with NULL produces NULL, and comparison operators with NULL produce UNKNOWN. Use IS NULL, COALESCE(), and COUNT(*) vs COUNT(column) correctly.
  • Window functions (ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG) add computed columns without collapsing rows. They're the correct answer for 'top N per group' and 'running total' questions — not correlated subqueries.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer

Think of a database like a giant office filing system with dozens of labeled cabinets. SQL is the language you use to ask the office assistant to find, combine, and summarize files across all those cabinets. An interview tests whether you truly understand WHY the assistant needs specific instructions — not just which words to say. A JOIN is like asking the assistant to staple two related files together before handing them to you; an index is like a sticky-tab system that lets them find files in seconds instead of rifling through every drawer; and a window function is like asking the assistant to rank every file by importance while still handing you every single file — not just the top one. But here's what most tutorials won't tell you: the assistant also makes mistakes if your instructions are ambiguous. Tell them to find 'all files where the date is NULL' and they might skip files you expected. Tell them to combine two cabinets but forget to specify how, and they'll staple every file from cabinet A to every file from cabinet B — producing 10,000 stapled pairs when you wanted 50. That's what a Cartesian product is, and it's killed more production databases than any other SQL mistake.

SQL interviews trip up even experienced developers — not because the syntax is hard, but because interviewers probe whether you understand what happens inside the database engine when your query runs. The difference between a candidate who gets hired and one who doesn't usually comes down to three words: 'and why that?'

I've interviewed over 60 candidates for backend and data engineering roles. The pattern is always the same: 80% can write a SELECT with a JOIN. Maybe 30% can explain why their JOIN is slow. Maybe 10% can look at an execution plan and tell me the optimizer chose a Hash Join when it should have used a Nested Loop. That 10% gets the offer.

Most people can write a SELECT statement; very few can explain why their query is scanning 2 million rows when it should be touching 50. I once inherited a reporting service that took 45 seconds to load a dashboard. The fix was one index and a rewritten subquery. The previous developer had been 'optimizing' by adding more RAM to the server for six months. The query was doing a full table scan on a 14-million-row orders table because someone wrapped an indexed column in YEAR() — making it non-SARGable. One line of SQL, six months of wasted infrastructure spend.

This matters because slow queries cost money. A poorly written JOIN on a 10-million-row orders table can turn a 20ms API response into a 12-second timeout. Knowing the difference between a clustered and non-clustered index, or understanding why a LEFT JOIN returns NULLs you didn't expect, is the difference between shipping reliable software and debugging production fires at 2am.

By the end of this article you'll be able to explain JOINs, NULLs, window functions, indexes, subqueries vs CTEs, GROUP BY, and query optimization at a level that earns the interviewer's respect — not just recognition. Every concept comes with a real schema, runnable SQL, actual output, and the exact reasoning an interviewer wants to hear.

Understanding JOINs: Beyond the Venn Diagram

Interviewers often start with JOINs. While Venn diagrams are a popular starting point, they're misleading for senior candidates because SQL JOINs are not set operations — they're row-combination operations. A LEFT JOIN doesn't 'subtract' anything; it preserves every row from the left table and fills in NULLs when there's no match on the right.

Senior candidates must discuss the logical processing of a JOIN. The database engine decides between three physical operators: Nested Loop (best for small tables), Hash Join (best for large unsorted tables), and Merge Join (best when both inputs are sorted on the join key). Knowing when each is chosen and why shows you understand the engine, not just the syntax.

The most common interview trap: putting a filter in the ON clause vs the WHERE clause of a LEFT JOIN. They produce completely different results. A filter in ON controls which right-side rows participate in the join. A filter in WHERE eliminates rows from the final result — including left-side rows that had no match. This distinction is the single most missed concept in SQL interviews.

I once caught a production bug where a developer filtered on the right table's status column in the WHERE clause of a LEFT JOIN. It silently converted the LEFT JOIN into an INNER JOIN — every customer without an active order disappeared from the report. The finance team made quarterly projections based on that report. The projections were wrong for three months before anyone noticed.

io/thecodeforge/queries/customer_orders.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738
-- io.thecodeforge: LEFT JOIN — ON vs WHERE filter behavior
-- Schema: customers(customer_id, customer_name)
--          orders(order_id, customer_id, order_date, total_amount, status)

-- QUERY 1: Filter in ON clause
-- Preserves ALL customers. Only joins with orders after 2025-01-01.
-- Customers with no recent orders still appear (with NULL order data).
SELECT
    c.customer_name,
    o.order_date,
    COALESCE(o.total_amount, 0) AS amount
FROM io_thecodeforge.customers c
LEFT JOIN io_thecodeforge.orders o
    ON c.customer_id = o.customer_id
    AND o.order_date > '2025-01-01';

-- QUERY 2: Filter in WHERE clause
-- Eliminates any row where order_date is NULL (no match or filtered out).
-- This effectively converts the LEFT JOIN into an INNER JOIN.
SELECT
    c.customer_name,
    o.order_date,
    COALESCE(o.total_amount, 0) AS amount
FROM io_thecodeforge.customers c
LEFT JOIN io_thecodeforge.orders o
    ON c.customer_id = o.customer_id
WHERE o.order_date > '2025-01-01';

-- QUERY 3: Find customers with NO orders after 2025-01-01
-- Classic pattern: LEFT JOIN + WHERE IS NULL
SELECT
    c.customer_name,
    COALESCE(o.total_amount, 0) AS amount
FROM io_thecodeforge.customers c
LEFT JOIN io_thecodeforge.orders o
    ON c.customer_id = o.customer_id
    AND o.order_date > '2025-01-01'
WHERE o.order_id IS NULL;
▶ Output
-- QUERY 1 output (filter in ON — preserves all customers):
| customer_name | order_date | amount |
|---------------|------------|--------|
| Alice Forge | 2025-06-15 | 250.00 |
| Alice Forge | 2025-09-22 | 175.50 |
| Bob Smith | NULL | 0 |
| Carol Wu | 2025-03-10 | 89.99 |
| Dave Patel | NULL | 0 |

-- QUERY 2 output (filter in WHERE — silently becomes INNER JOIN):
| customer_name | order_date | amount |
|---------------|------------|--------|
| Alice Forge | 2025-06-15 | 250.00 |
| Alice Forge | 2025-09-22 | 175.50 |
| Carol Wu | 2025-03-10 | 89.99 |
-- Bob Smith and Dave Patel are MISSING (filtered out by WHERE)

-- QUERY 3 output (customers with no recent orders):
| customer_name | amount |
|---------------|--------|
| Bob Smith | 0 |
| Dave Patel | 0 |
⚠ ON vs WHERE — The Most Common Interview Mistake:
In a LEFT JOIN, a filter in ON controls which right-side rows participate. A filter in WHERE eliminates rows from the final result entirely. Putting a right-table filter in WHERE silently converts your LEFT JOIN into an INNER JOIN. I've seen this cause three months of wrong financial projections in production. Always test with a small dataset first and verify: does every left-table row appear in the output?

NULL Handling: The Landmine Nobody Practices

NULL is not zero. NULL is not an empty string. NULL is not 'false.' NULL is the absence of a value — it's the database saying 'I don't know.' And SQL's treatment of NULL is the source of more interview failures and production bugs than any other single concept.

The three rules that govern NULL: 1. NULL != NULL. NULL is not equal to anything, including itself. WHERE column = NULL returns zero rows. Always use IS NULL or IS NOT NULL. 2. Any arithmetic operation with NULL produces NULL. 5 + NULL = NULL. If your revenue column has NULLs, SUM(revenue) ignores them — but revenue * 1.1 produces NULL for those rows. 3. Comparison operators (<, >, =) with NULL produce UNKNOWN, not TRUE or FALSE. This means rows with NULL values are excluded from both WHERE column > 100 and WHERE column <= 100. They fall through both filters.

COALESCE is your safety net. It returns the first non-NULL value from its arguments. Use it to provide defaults: COALESCE(discount, 0) means 'use the discount if it exists, otherwise assume zero.' NVL (Oracle), IFNULL (MySQL), and ISNULL (SQL Server) are database-specific alternatives.

Production story: a revenue report was undercounting by 15%. The query was SELECT SUM(price quantity) FROM order_items. Some rows had NULL quantity (items pending confirmation). Multiplying by NULL produced NULL, and SUM skips NULLs — so those items silently disappeared from the revenue total. The fix was SUM(COALESCE(price, 0) COALESCE(quantity, 0)). One line. Three months of wrong numbers.

io/thecodeforge/queries/null_handling.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738
-- io.thecodeforge: NULL Handling Patterns
-- Schema: products(product_id, product_name, discount_percent, category)

-- WRONG: This returns ZERO rows even if NULLs exist
-- Because NULL = NULL evaluates to UNKNOWN, not TRUE
SELECT * FROM io_thecodeforge.products
WHERE discount_percent = NULL;

-- RIGHT: Use IS NULL
SELECT * FROM io_thecodeforge.products
WHERE discount_percent IS NULL;

-- WRONG: This excludes NULLs from both sides
-- Products with NULL discount appear in NEITHER result
SELECT * FROM io_thecodeforge.products WHERE discount_percent > 10;
SELECT * FROM io_thecodeforge.products WHERE discount_percent <= 10;
-- Together they don't cover all rows — NULL rows are missing from both.

-- RIGHT: Explicitly handle NULLs with COALESCE
SELECT
    product_name,
    discount_percent,
    COALESCE(discount_percent, 0) AS safe_discount,
    CASE
        WHEN discount_percent IS NULL THEN 'No discount set'
        WHEN discount_percent > 20 THEN 'Heavy discount'
        ELSE 'Standard'
    END AS discount_category
FROM io_thecodeforge.products;

-- PRODUCTION PATTERN: Safe aggregation with NULLs
-- Without COALESCE, NULL quantity items vanish from revenue totals
SELECT
    SUM(COALESCE(unit_price, 0) * COALESCE(quantity, 0)) AS total_revenue,
    COUNT(*) AS total_line_items,
    COUNT(quantity) AS items_with_quantity,  -- COUNT(column) skips NULLs
    COUNT(*) AS all_items                    -- COUNT(*) counts everything
FROM io_thecodeforge.order_items;
▶ Output
-- COALESCE output:
| product_name | discount_percent | safe_discount | discount_category |
|---------------|-----------------|---------------|--------------------|
| Widget A | 15 | 15 | Standard |
| Widget B | NULL | 0 | No discount set |
| Widget C | 25 | 25 | Heavy discount |
| Widget D | NULL | 0 | No discount set |

-- Revenue aggregation output:
| total_revenue | total_line_items | items_with_quantity | all_items |
|---------------|-----------------|--------------------|-----------|
| 45890.50 | 1200 | 1185 | 1200 |
-- Notice: 15 items had NULL quantity. COUNT(*) includes them, COUNT(column) doesn't.
💡COUNT(*) vs COUNT(column) — The NULL Difference:
COUNT() counts every row regardless of NULLs. COUNT(column_name) counts only rows where that column is NOT NULL. This distinction matters for reporting. If your manager asks 'how many order items do we have?' and you use COUNT(quantity), you'll undercount by the number of pending/NULL items. Use COUNT() for total rows, COUNT(column) for 'how many have this value filled in.'

Window Functions: The Interview Differentiator

Window functions are the topic that separates mid-level candidates from senior ones. GROUP BY collapses rows. Window functions keep every row but add computed columns based on a 'window' of related rows. If GROUP BY gives you a summary, window functions give you a summary attached to every detail row.

The key functions interviewers ask about
  • ROW_NUMBER(): Assigns a unique sequential number. Ties get different numbers.
  • RANK(): Assigns the same number to ties, then skips. 1, 1, 3, 4.
  • DENSE_RANK(): Assigns the same number to ties, no skip. 1, 1, 2, 3.
  • LEAD(column, n): Gets the value from n rows ahead.
  • LAG(column, n): Gets the value from n rows behind.
  • SUM() OVER(): Running total without collapsing rows.

The interview pattern I see most: 'Find the top 3 orders per customer' or 'Find the second-highest salary per department.' Both require ROW_NUMBER() or DENSE_RANK() with PARTITION BY. Candidates who reach for a subquery with LIMIT or a correlated subquery are showing they don't know window functions — and interviewers notice.

Production use case: I used LAG() to detect anomalies in a time-series metrics table. The query compared each hour's error rate to the previous hour's. If the error rate jumped by more than 200%, it flagged the row. That query ran against 50 million rows in 3 seconds because the window function avoided a self-join that would have taken 40+ seconds.

io/thecodeforge/queries/window_functions.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
-- io.thecodeforge: Window Functions — Interview Patterns
-- Schema: orders(order_id, customer_id, order_date, total_amount)
--          employees(employee_id, name, department, salary)

-- PATTERN 1: Top N per group (top 2 orders per customer by amount)
-- This is the #1 window function interview question.
SELECT *
FROM (
    SELECT
        customer_id,
        order_id,
        total_amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY total_amount DESC
        ) AS rn
    FROM io_thecodeforge.orders
) ranked
WHERE rn <= 2;

-- PATTERN 2: Running total (cumulative revenue by date)
SELECT
    order_date,
    total_amount,
    SUM(total_amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM io_thecodeforge.orders
ORDER BY order_date;

-- PATTERN 3: DENSE_RANK for 'second-highest salary per department'
-- DENSE_RANK handles ties correctly. ROW_NUMBER would arbitrarily pick one.
SELECT *
FROM (
    SELECT
        department,
        name,
        salary,
        DENSE_RANK() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS dept_rank
    FROM io_thecodeforge.employees
) ranked
WHERE dept_rank = 2;

-- PATTERN 4: LAG for detecting day-over-day anomalies
SELECT
    metric_date,
    error_count,
    LAG(error_count, 1) OVER (ORDER BY metric_date) AS prev_day_errors,
    CASE
        WHEN error_count > LAG(error_count, 1) OVER (ORDER BY metric_date) * 2
        THEN 'SPIKE ALERT'
        ELSE 'Normal'
    END AS status
FROM io_thecodeforge.daily_metrics
ORDER BY metric_date;
▶ Output
-- PATTERN 1 output (top 2 orders per customer):
| customer_id | order_id | total_amount | rn |
|-------------|----------|-------------|----|
| 101 | 5001 | 450.00 | 1 |
| 101 | 5003 | 320.00 | 2 |
| 102 | 5005 | 890.00 | 1 |
| 102 | 5002 | 210.00 | 2 |

-- PATTERN 2 output (running total):
| order_date | total_amount | running_total |
|------------|-------------|---------------|
| 2025-01-01 | 500.00 | 500.00 |
| 2025-01-02 | 350.00 | 850.00 |
| 2025-01-03 | 720.00 | 1570.00 |

-- PATTERN 4 output (anomaly detection):
| metric_date | error_count | prev_day_errors | status |
|------------|------------|-----------------|-------------|
| 2025-03-01 | 45 | NULL | Normal |
| 2025-03-02 | 52 | 45 | Normal |
| 2025-03-03 | 180 | 52 | SPIKE ALERT |
💡ROW_NUMBER vs RANK vs DENSE_RANK — Know the Difference:
ROW_NUMBER assigns unique sequential numbers (1,2,3,4) even for ties. RANK gives ties the same number then skips (1,1,3,4). DENSE_RANK gives ties the same number without skipping (1,1,2,3). In an interview, if they ask for 'the second-highest salary,' use DENSE_RANK — because if two people share the highest salary, RANK would give second place a rank of 3, and ROW_NUMBER would arbitrarily break the tie. DENSE_RANK=2 gives you exactly the second tier.

GROUP BY and HAVING: Aggregation Done Right

GROUP BY collapses multiple rows into summary rows. HAVING filters those summary rows after aggregation. WHERE filters individual rows before aggregation. Mixing them up is a classic interview mistake — and a common production bug.

The logical processing order matters: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This means: - WHERE cannot use aggregate functions (SUM, COUNT, AVG) because it runs before GROUP BY. - HAVING can use aggregate functions because it runs after GROUP BY. - You can filter on non-aggregated columns in WHERE (faster) or HAVING (slower, but sometimes necessary).

The trap interviewers love: 'Show me departments where the average salary exceeds 50,000.' That's HAVING AVG(salary) > 50000. Candidates who write WHERE AVG(salary) > 50000 get an immediate syntax error — and the interviewer watches to see if you know why.

Another trap: selecting columns not in the GROUP BY clause. In strict SQL mode (MySQL) or PostgreSQL, SELECT department, name, AVG(salary) FROM employees GROUP BY department fails because 'name' is not in the GROUP BY and is not aggregated. In older MySQL modes, it silently returns an arbitrary name from each group — which is almost never what you want. Always ensure every non-aggregated column in SELECT is in the GROUP BY.

io/thecodeforge/queries/group_by_having.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738394041
-- io.thecodeforge: GROUP BY and HAVING Patterns
-- Schema: employees(employee_id, name, department, salary, hire_date)
--          orders(order_id, customer_id, order_date, total_amount, status)

-- PATTERN 1: Basic aggregation with HAVING
-- 'Departments with average salary above 50,000'
SELECT
    department,
    COUNT(*) AS headcount,
    AVG(salary) AS avg_salary,
    MAX(salary) AS highest_salary
FROM io_thecodeforge.employees
GROUP BY department
HAVING AVG(salary) > 50000
ORDER BY avg_salary DESC;

-- PATTERN 2: WHERE vs HAVING — use WHERE for non-aggregate filters
-- 'Departments with avg salary > 50k, but only for employees hired after 2020'
-- WRONG: HAVING hire_date > '2020-01-01' — works but inefficient
-- RIGHT: Filter rows first with WHERE, then aggregate
SELECT
    department,
    AVG(salary) AS avg_salary
FROM io_thecodeforge.employees
WHERE hire_date > '2020-01-01'   -- Filters BEFORE grouping (faster)
GROUP BY department
HAVING AVG(salary) > 50000;      -- Filters AFTER grouping

-- PATTERN 3: Multi-column GROUP BY with COUNT DISTINCT
-- 'Revenue per customer per year, only for completed orders'
SELECT
    customer_id,
    EXTRACT(YEAR FROM order_date) AS order_year,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_revenue,
    COUNT(DISTINCT EXTRACT(MONTH FROM order_date)) AS active_months
FROM io_thecodeforge.orders
WHERE status = 'completed'
GROUP BY customer_id, EXTRACT(YEAR FROM order_date)
HAVING SUM(total_amount) > 10000
ORDER BY total_revenue DESC;
▶ Output
-- PATTERN 1 output:
| department | headcount | avg_salary | highest_salary |
|-------------|----------|------------|----------------|
| Engineering | 45 | 92000.00 | 185000 |
| Data | 12 | 87000.00 | 145000 |
| Product | 8 | 78000.00 | 130000 |
-- Finance, HR, Marketing excluded (avg salary <= 50,000)

-- PATTERN 3 output:
| customer_id | order_year | order_count | total_revenue | active_months |
|-------------|-----------|-------------|---------------|---------------|
| 101 | 2025 | 47 | 23450.00 | 11 |
| 102 | 2025 | 32 | 18900.00 | 9 |
| 103 | 2025 | 28 | 12400.00 | 8 |
⚠ Non-Aggregated Columns in SELECT — The Silent Bug:
If you SELECT a column that isn't in the GROUP BY and isn't wrapped in an aggregate function, PostgreSQL throws an error. Older MySQL versions silently return an arbitrary value from each group — which is almost never what you want. Always ensure every column in SELECT is either in GROUP BY or inside an aggregate function (SUM, COUNT, AVG, MAX, MIN). This is one of the most common sources of 'my report numbers are wrong' bugs.

Subqueries vs CTEs vs Temp Tables: When to Use What

Interviewers ask: 'What's the difference between a subquery, a CTE, and a temp table?' The answer is about readability, reusability, and performance.

A subquery is a query nested inside another query. It works, but deeply nested subqueries become unreadable fast. If you have a subquery inside a subquery inside a WHERE clause, nobody — including future you — can debug it at 2am during an incident.

A CTE (Common Table Expression) with the WITH clause gives a subquery a name. It reads top-to-bottom, can be referenced multiple times, and makes complex queries self-documenting. The optimizer in PostgreSQL and SQL Server treats CTEs as inline views (not materialized temp tables) unless you explicitly use MATERIALIZED — so there's usually no performance difference from a subquery.

A temp table physically stores intermediate results. Use it when you need to reference the same intermediate result many times, when the intermediate result is expensive to compute, or when you need to index the intermediate result. The downside: it's more I/O and the optimizer can't push predicates through it.

My rule of thumb: CTEs for readability. Subqueries for simple one-off filters. Temp tables for heavy multi-step transformations where you reference the intermediate result 5+ times or need to index it.

io/thecodeforge/queries/subquery_vs_cte.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
-- io.thecodeforge: Subquery vs CTE — Same Query, Different Styles
-- Task: Find customers whose total spend exceeds the overall average

-- APPROACH 1: Correlated Subquery (harder to read)
SELECT
    c.customer_name,
    (SELECT SUM(o.total_amount)
     FROM io_thecodeforge.orders o
     WHERE o.customer_id = c.customer_id) AS total_spend
FROM io_thecodeforge.customers c
WHERE (SELECT SUM(o.total_amount)
       FROM io_thecodeforge.orders o
       WHERE o.customer_id = c.customer_id)
      > (SELECT AVG(customer_total)
         FROM (SELECT SUM(total_amount) AS customer_total
               FROM io_thecodeforge.orders
               GROUP BY customer_id) sub);

-- APPROACH 2: CTE (clean, readable, self-documenting)
WITH customer_spending AS (
    SELECT
        customer_id,
        SUM(total_amount) AS total_spend
    FROM io_thecodeforge.orders
    GROUP BY customer_id
),
average_spending AS (
    SELECT AVG(total_spend) AS avg_spend
    FROM customer_spending
)
SELECT
    c.customer_name,
    cs.total_spend
FROM io_thecodeforge.customers c
JOIN customer_spending cs ON c.customer_id = cs.customer_id
CROSS JOIN average_spending av
WHERE cs.total_spend > av.avg_spend
ORDER BY cs.total_spend DESC;

-- APPROACH 3: Temp table (best when intermediate result is reused many times)
CREATE TEMP TABLE customer_spending_tmp AS
SELECT
    customer_id,
    SUM(total_amount) AS total_spend,
    COUNT(*) AS order_count
FROM io_thecodeforge.orders
GROUP BY customer_id;

-- Now reference it multiple times without recomputing
SELECT
    c.customer_name,
    cs.total_spend,
    cs.order_count
FROM io_thecodeforge.customers c
JOIN customer_spending_tmp cs ON c.customer_id = cs.customer_id
WHERE cs.total_spend > (SELECT AVG(total_spend) FROM customer_spending_tmp)
ORDER BY cs.total_spend DESC;

DROP TABLE IF EXISTS customer_spending_tmp;
▶ Output
-- All three approaches produce the same result:
| customer_name | total_spend |
|---------------|-------------|
| Alice Forge | 23450.00 |
| Bob Smith | 18900.00 |
-- Carol Wu and Dave Patel excluded (below average spend)

-- Performance note: CTE and subquery have similar execution plans.
-- Temp table is faster when the intermediate result is referenced 5+ times
-- because it avoids recomputation — but adds I/O overhead for small datasets.
💡Interview Answer Framework:
When asked 'subquery vs CTE vs temp table,' structure your answer: 'CTEs improve readability and can be referenced multiple times without duplication. Subqueries are fine for simple one-off filters. Temp tables are best when the intermediate result is expensive, referenced many times, or needs indexing. In practice, I default to CTEs and escalate to temp tables when performance profiling shows the intermediate computation is the bottleneck.'

Indexing and SARGability: Why Your Query is Slow

A common interview question is: 'Why isn't my index being used?' This usually leads to the concept of SARGable (Search ARgument ABLE) queries. If you wrap an indexed column in a function (like WHERE YEAR(created_at) = 2026), the database engine cannot use the index effectively, leading to a full table scan. Instead, provide a range the engine can 'seek' into.

I've seen this exact pattern kill production performance. A reporting query on a 14-million-row orders table was taking 45 seconds. The developer had WHERE YEAR(order_date) = 2025 AND MONTH(order_date) = 3. The index on order_date was completely ignored because YEAR() and MONTH() wrapped the column. The fix: WHERE order_date >= '2025-03-01' AND order_date < '2025-04-01'. Same result, 12 milliseconds instead of 45 seconds. The index seek touched 40,000 rows instead of scanning all 14 million.

Beyond SARGability, understand covering indexes. If an index contains every column referenced in the query (both in WHERE and SELECT), the engine never touches the actual table — it reads everything from the index. This is called an 'index-only scan' and it's the fastest possible query path.

Clustered vs non-clustered: a clustered index IS the table — data is stored in the order of the clustered index (usually the primary key). A non-clustered index is a separate structure with pointers back to the data. Each table has exactly one clustered index but can have many non-clustered indexes.

io/thecodeforge/optimization/indexing.sql · SQL
12345678910111213141516171819202122232425262728293031323334
-- io.thecodeforge: SARGable vs Non-SARGable Queries
-- Schema: logs(log_id, created_at, level, message, service_name)
-- Index: idx_logs_created_at ON logs(created_at)

-- BAD: Index on created_at is IGNORED due to function wrapping
-- The engine must call DATE() on every row — full table scan
SELECT * FROM io_thecodeforge.logs
WHERE DATE(created_at) = '2026-03-06';

-- GOOD: Index seek enabled — engine jumps directly to matching range
SELECT * FROM io_thecodeforge.logs
WHERE created_at >= '2026-03-06 00:00:00'
  AND created_at < '2026-03-07 00:00:00';

-- BAD: Function on both sides — index cannot be used
SELECT * FROM io_thecodeforge.logs
WHERE LOWER(service_name) = 'payment-service';

-- GOOD: Store data in consistent case, or use a functional index
-- PostgreSQL: CREATE INDEX idx_service_lower ON logs(LOWER(service_name));
-- Then the LOWER() query becomes SARGable on that functional index.

-- COVERING INDEX example:
-- If your query is:
SELECT customer_id, order_date, total_amount
FROM io_thecodeforge.orders
WHERE customer_id = 101 AND order_date > '2025-01-01';

-- Create a covering index that includes ALL referenced columns:
CREATE INDEX idx_orders_covering
ON io_thecodeforge.orders(customer_id, order_date)
INCLUDE (total_amount);
-- Now the engine reads ONLY the index — never touches the table.
-- This is an 'index-only scan' — the fastest possible path.
▶ Output
-- Non-SARGable query:
Execution Plan: Seq Scan on logs | Rows: 14,000,000 | Time: 45,200ms

-- SARGable query:
Execution Plan: Index Scan using idx_logs_created_at | Rows: 40,000 | Time: 12ms

-- The difference: 3,750× faster by removing a function call.
💡Always Run EXPLAIN Before Optimizing:
Before changing a single line of SQL, run EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL, EXPLAIN EXECUTION PLAN in SQL Server). It tells you whether the engine is doing a Seq Scan (full table — bad at scale), Index Scan (good), or Index-Only Scan (best). Without EXPLAIN, you're guessing. With EXPLAIN, you're diagnosing. I've never optimized a slow query without first reading its execution plan.

Query Optimization: Reading Execution Plans and Avoiding Common Traps

The ultimate interview question: 'How would you optimize a slow query?' Here's the systematic approach I use in production, and the one I want to hear from candidates.

Step 1: Run EXPLAIN. Look for Seq Scan on large tables, Nested Loop on large joins, and high row estimates. If the estimated rows are wildly different from actual rows, your table statistics are stale — run ANALYZE.

Step 2: Check for non-SARGable predicates. Functions on indexed columns, implicit type conversions, LIKE with leading wildcards.

Step 3: Check for missing indexes. If the WHERE clause filters on columns without indexes, add them. If the JOIN uses columns without indexes, add them.

Step 4: Check for SELECT *. You're pulling every column from disk when you need three. Use a covering index or select only required columns.

Step 5: Check for N+1 queries. If your application runs one query to get IDs, then one query per ID, that's N+1. Rewrite as a single JOIN or IN clause.

Step 6: Check for DISTINCT abuse. DISTINCT is expensive — it requires sorting or hashing all results to remove duplicates. If you're using DISTINCT to hide duplicate rows caused by a bad JOIN, fix the JOIN instead.

The implicit conversion trap is worth calling out explicitly: comparing a VARCHAR column to an integer literal (WHERE varchar_col = 123) forces the engine to convert every row's value to an integer before comparing. This prevents index usage and causes a full table scan. Always match types: WHERE varchar_col = '123'.

io/thecodeforge/optimization/query_optimization.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
-- io.thecodeforge: Query Optimization Patterns

-- TRAP 1: Implicit type conversion — index ignored
-- phone_number is VARCHAR, but we compare to an integer
-- The engine converts EVERY row's phone_number to int before comparing
SELECT * FROM io_thecodeforge.customers
WHERE phone_number = 9876543210;  -- BAD: implicit conversion

SELECT * FROM io_thecodeforge.customers
WHERE phone_number = '9876543210';  -- GOOD: type match, index used

-- TRAP 2: DISTINCT hiding a bad JOIN
-- If you're getting duplicates and adding DISTINCT to 'fix' it,
-- your JOIN condition is wrong.
SELECT DISTINCT c.customer_name, o.total_amount
FROM io_thecodeforge.customers c
JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id;
-- If this returns duplicates, check: are there multiple orders per customer?
-- If yes, DISTINCT is removing legitimate rows. If no, the JOIN is wrong.

-- TRAP 3: Subquery in WHERE instead of JOIN
-- Correlated subquery runs once PER ROW — O(n²) behavior
SELECT * FROM io_thecodeforge.customers c
WHERE c.customer_id IN (
    SELECT customer_id FROM io_thecodeforge.orders
    WHERE total_amount > 1000
);
-- Better: use EXISTS (short-circuits on first match)
SELECT * FROM io_thecodeforge.customers c
WHERE EXISTS (
    SELECT 1 FROM io_thecodeforge.orders o
    WHERE o.customer_id = c.customer_id
    AND o.total_amount > 1000
);
-- Best: use JOIN with DISTINCT if you only need customer data
SELECT DISTINCT c.*
FROM io_thecodeforge.customers c
JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > 1000;

-- TRAP 4: LIKE with leading wildcard — index cannot be used
SELECT * FROM io_thecodeforge.logs
WHERE message LIKE '%timeout%';  -- Full table scan, every time

-- If you need full-text search, use a proper full-text index:
-- PostgreSQL: CREATE INDEX idx_message_fts ON logs USING gin(to_tsvector('english', message));
-- Then: WHERE to_tsvector('english', message) @@ to_tsquery('timeout');
▶ Output
-- TRAP 1 execution plan difference:
-- Integer comparison on VARCHAR:
Execution Plan: Seq Scan on customers | Rows: 500,000 | Time: 850ms

-- String comparison on VARCHAR:
Execution Plan: Index Scan using idx_phone | Rows: 1 | Time: 0.5ms

-- 1,700× slower due to implicit conversion.
⚠ EXISTS vs IN — Performance Matters at Scale:
For small subquery results, IN and EXISTS perform similarly. But with large subquery result sets, EXISTS is usually faster because it short-circuits — it stops scanning as soon as it finds one match. IN materializes the entire subquery result into a set, then checks membership. If your subquery returns 100,000 rows, EXISTS might find a match on the first row while IN builds a 100,000-element hash set first. In an interview, say: 'I'd default to EXISTS for correlated checks and IN for small static lists.'
FeatureClustered IndexNon-Clustered Index
Data StorageThe table IS the index (stores actual data in index order)Separate structure with pointers back to data pages
Count per TableOnly 1 (usually Primary Key)Multiple (limited by engine, typically dozens)
PerformanceFaster for range scans and ORDER BY on the keyFaster for targeted single-row lookups
Leaf NodesActual data pages (the table rows themselves)Pointers to the clustered index key or row ID
When to UsePrimary key, columns used in range queries and sortingColumns frequently in WHERE, JOIN, or ORDER BY that aren't the primary key

🎯 Key Takeaways

  • JOINs are not Venn diagrams — they're row-combination operations. In a LEFT JOIN, the ON clause controls which right-side rows participate; the WHERE clause eliminates rows from the final result. Mixing them up silently converts LEFT JOINs into INNER JOINs.
  • NULL is the absence of a value, not a value itself. NULL != NULL, any arithmetic with NULL produces NULL, and comparison operators with NULL produce UNKNOWN. Use IS NULL, COALESCE(), and COUNT(*) vs COUNT(column) correctly.
  • Window functions (ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG) add computed columns without collapsing rows. They're the correct answer for 'top N per group' and 'running total' questions — not correlated subqueries.
  • SARGable queries let the engine use index seeks instead of full table scans. Never wrap indexed columns in functions (YEAR, MONTH, DATE, LOWER). Provide ranges instead: WHERE date >= X AND date < Y.
  • Covering indexes contain every column a query needs, enabling index-only scans — the fastest possible query path. Design indexes around your most frequent and expensive queries.
  • EXISTS short-circuits on the first match; IN materializes the entire subquery result. For large subquery result sets, EXISTS is usually faster. For small static lists, IN is more readable.
  • GROUP BY collapses rows; HAVING filters after aggregation; WHERE filters before. Every non-aggregated column in SELECT must be in GROUP BY or wrapped in an aggregate function.
  • CTEs improve readability and can be referenced multiple times. Subqueries are fine for simple one-off filters. Temp tables win when intermediate results are expensive and referenced many times.
  • Always run EXPLAIN before optimizing. Look for Seq Scans, high row estimates, and implicit type conversions. Without execution plans, you're guessing.
  • SELECT * prevents covering indexes, increases I/O, and breaks backward compatibility. Always specify only the columns you need.

⚠ Common Mistakes to Avoid

    Using SELECT * in production queries — This pulls every column from disk, increases I/O overhead, and prevents the use of covering indexes. In a table with 50 columns where you need 3, SELECT * reads 16× more data than necessary. It also breaks backward compatibility — if someone adds a column to the table, your application code that depends on column order or count can silently break. Always specify only the columns you need.

    s you need.

    Ignoring NULL logic — NULL is not equal to anything, including another NULL. `WHERE column = NULL` returns zero rows. `WHERE column != 5` excludes NULLs silently. Use IS NULL / IS NOT NULL, and wrap nullable columns in COALESCE for calculations. I've seen revenue reports undercount by 15% because SUM(price * quantity) silently dropped rows where quantity was NULL.

    y was NULL.

    Failing to check execution plans — Without EXPLAIN, you're guessing why a query is slow. I've never optimized a slow query without first reading its execution plan. Run EXPLAIN ANALYZE (PostgreSQL) or SHOW EXECUTION PLAN (SQL Server) and look for Seq Scans on large tables, high row estimates, and missing index usage.

    ndex usage.

    Implicit type conversions destroying index usage — Comparing a VARCHAR column to an integer literal (`WHERE phone_number = 12345`) forces the engine to convert every row before comparing. This prevents index usage and causes a full table scan. Always match the column's actual type: `WHERE phone_number = '12345'`. This 1,700× performance difference has caused production incidents I've personally debugged.

    y debugged.

    Using DISTINCT to hide bad JOINs — If your query returns unexpected duplicates and you slap DISTINCT on it, you're treating the symptom, not the disease. The JOIN condition is likely wrong (missing a join key, or joining on a one-to-many relationship you didn't expect). Fix the JOIN. DISTINCT requires sorting or hashing all results, which is expensive on large datasets.
    Fix

    the JOIN. DISTINCT requires sorting or hashing all results, which is expensive on large datasets.

    Correlated subqueries in WHERE causing O(n²) behavior — A subquery that references the outer query's columns runs once per outer row. On a 100,000-row table, that's 100,000 subquery executions. Rewrite with EXISTS (short-circuits) or a JOIN (set-based, single pass). In production, I've seen correlated subqueries turn a 200ms query into a 45-minute query.

    nute query.

    Putting filters in the WHERE clause of a LEFT JOIN that belong in the ON clause — A WHERE filter on the right table after a LEFT JOIN silently converts it to an INNER JOIN, removing all left-side rows that had no match. This has caused wrong financial reports, missing customer lists, and broken dashboards in production. Always verify: does every left-table row appear when expected?

    n expected?

Interview Questions on This Topic

  • QExplain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. When would you use each one? Give a real-world example where choosing the wrong JOIN type would produce incorrect business metrics.
  • QWhat is the difference between WHERE and HAVING? Can you use aggregate functions in WHERE? Why or why not? What is the logical processing order of a SQL query?
  • QExplain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). Given a table of employee salaries, how would you find the second-highest salary per department using each function? Which would you choose and why?
  • QWhat is a SARGable query? Give an example of a non-SARGable query and explain how to rewrite it. What happens to the execution plan when you wrap an indexed column in a function?
  • QWhat is the difference between EXISTS and IN? When would you prefer one over the other? What happens to performance when the subquery result set is very large?
  • QExplain NULL behavior in SQL. Why does WHERE column = NULL return zero rows? What is the difference between COUNT(*) and COUNT(column)? How does NULL affect arithmetic operations and aggregate functions?
  • QA query runs in 200ms on a table with 10,000 rows but takes 45 seconds on a table with 10 million rows. Walk me through how you would diagnose and optimize it. What would you check first?
  • QWhat is a covering index? How does it differ from a regular index? What is an 'index-only scan' and why is it the fastest query path?
  • QExplain the difference between a subquery, a CTE, and a temporary table. When would you choose each approach? Does a CTE always materialize a temp table, or is it an inline view?
  • QWhat is the difference between DELETE, TRUNCATE, and DROP? When would you use each? What happens to the transaction log in each case?

Frequently Asked Questions

What is the difference between WHERE and HAVING in SQL?

WHERE filters individual rows before any grouping or aggregation occurs — it runs early in the logical processing order. HAVING filters groups after GROUP BY has collapsed rows — it can use aggregate functions like SUM, COUNT, and AVG because it runs after aggregation. You cannot use WHERE AVG(salary) > 50000 because WHERE runs before GROUP BY, so no aggregate has been computed yet. Use HAVING AVG(salary) > 50000. For non-aggregate filters, always prefer WHERE — it's more efficient because it reduces the number of rows before the expensive GROUP BY operation.

What are Window Functions and why are they useful?

Window functions (ROW_NUMBER, RANK, DENSE_RANK, LEAD, LAG, SUM OVER) perform calculations across a set of related rows without collapsing them into a single output row. Unlike GROUP BY, which produces one summary row per group, window functions retain every individual row and add computed columns. This is essential for patterns like 'top N per group,' 'running totals,' 'month-over-month comparison,' and 'rank within a partition.' The syntax uses OVER (PARTITION BY column ORDER BY column) to define the window of rows each calculation considers.

What is Database Normalization and why do we do it?

Normalization organizes data to reduce redundancy and improve integrity. 1NF eliminates repeating groups (each cell holds a single value). 2NF removes partial dependencies (every non-key column depends on the entire primary key). 3NF removes transitive dependencies (non-key columns don't depend on other non-key columns). In practice, most production databases target 3NF. Denormalization (intentionally violating normalization) is used for read-heavy workloads like analytics and reporting, where JOIN performance matters more than storage efficiency.

What is the difference between EXISTS and IN?

Both check if a value exists in a subquery result set, but they behave differently at scale. EXISTS runs the subquery for each outer row and short-circuits on the first match — it stops scanning as soon as it finds one matching row. IN materializes the entire subquery result into a set, then checks membership. For large subquery results (100,000+ rows), EXISTS is usually faster because it may find a match on the first row. For small static lists (WHERE status IN ('active', 'pending')), IN is more readable. In an interview, say: 'I default to EXISTS for correlated checks and IN for small literal lists.'

What is the difference between a Primary Key and a Foreign Key?

A Primary Key uniquely identifies each row in a table — it cannot be NULL and must be unique. A table has exactly one primary key. A Foreign Key is a column (or set of columns) in one table that references the primary key of another table, establishing a relationship between them. Foreign keys enforce referential integrity — you cannot insert an order with a customer_id that doesn't exist in the customers table, and you cannot delete a customer who has orders (unless you use CASCADE). In an interview, mention that foreign keys also help the optimizer choose better join strategies because they guarantee referential integrity.

What are Transactions and ACID properties?

A transaction is a sequence of operations treated as a single logical unit — either all operations succeed (COMMIT) or none do (ROLLBACK). ACID stands for Atomicity (all-or-nothing execution), Consistency (the database moves from one valid state to another), Isolation (concurrent transactions don't interfere with each other), and Durability (committed data survives system failures). In production, transactions prevent partial updates — like debiting one account but failing to credit another. Without transactions, every error is a potential data corruption event.

What is a CTE (Common Table Expression) and when should you use it?

A CTE is a named temporary result set defined with the WITH clause that exists only for the duration of the query. It improves readability by giving meaningful names to intermediate steps, can be referenced multiple times in the same query, and makes complex queries self-documenting. Contrary to popular belief, CTEs in PostgreSQL and SQL Server are usually inlined as subqueries by the optimizer — they don't create temp tables unless you use MATERIALIZED. Use CTEs when you need to reference the same intermediate result multiple times or when a query has more than two levels of nesting that would be unreadable as plain subqueries.

What is the difference between UNION and UNION ALL?

UNION combines results from two queries and removes duplicate rows — it requires a sort or hash operation to deduplicate, which is expensive on large datasets. UNION ALL combines results and keeps all rows including duplicates — it's a simple concatenation with no deduplication overhead. In production, almost always use UNION ALL unless you specifically need deduplication. I've seen queries go from 12 seconds to 200 milliseconds just by switching UNION to UNION ALL when the developer didn't realize duplicates were impossible given the query logic.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →DBMS Interview Questions
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged