SQL interviews test engine behavior, not syntax memorization.
LEFT JOIN: ON filter controls join participation; WHERE eliminates rows.
NULL: not a value; use IS NULL and COALESCE for safety.
Window functions: ROW_NUMBER vs RANK vs DENSE_RANK distinguish senior candidates.
Indexes: function-wrapped columns (YEAR(date)) disable index seeks.
Optimization: EXPLAIN ANALYZE reveals full scans, missing indexes, implicit conversions.
✦ Definition~90s read
What is Top SQL Interview Questions?
This article dissects a classic SQL interview trap: using a LEFT JOIN with a WHERE filter that inadvertently converts it into an INNER JOIN, silently dropping rows with NULLs from the right table. It’s a mistake that surfaces in production when you’re trying to find missing customers or orphaned records, and it exposes a fundamental misunderstanding of how SQL evaluates predicates.
★
Think of a database like a giant office filing system with dozens of labeled cabinets.
The WHERE clause filters after the JOIN, so any condition on the right table’s columns (e.g., WHERE orders.amount > 100) excludes rows where that column is NULL—which is exactly what a LEFT JOIN preserves. The fix is to move such filters into the JOIN’s ON clause or use an explicit IS NULL check.
This isn’t just academic; it’s a bug that costs real money in analytics pipelines, and interviewers use it to separate engineers who memorize syntax from those who understand execution order. The article goes beyond the Venn diagram trap, covering NULL handling, window functions for ranking missing records, GROUP BY/HAVING for aggregation, and when to reach for CTEs over subqueries or temp tables.
If you’ve ever wondered why your LEFT JOIN returned fewer rows than expected, or you’re prepping for a senior-level SQL interview, this is the deep dive that saves you from the silent drop.
Plain-English First
Think of a database like a giant office filing system with dozens of labeled cabinets. SQL is the language you use to ask the office assistant to find, combine, and summarize files across all those cabinets. An interview tests whether you truly understand WHY the assistant needs specific instructions — not just which words to say. A JOIN is like asking the assistant to staple two related files together before handing them to you; an index is like a sticky-tab system that lets them find files in seconds instead of rifling through every drawer; and a window function is like asking the assistant to rank every file by importance while still handing you every single file — not just the top one.
But here's what most tutorials won't tell you: the assistant also makes mistakes if your instructions are ambiguous. Tell them to find 'all files where the date is NULL' and they might skip files you expected. Tell them to combine two cabinets but forget to specify how, and they'll staple every file from cabinet A to every file from cabinet B — producing 10,000 stapled pairs when you wanted 50. That's what a Cartesian product is, and it's killed more production databases than any other SQL mistake.
SQL interviews trip up even experienced developers — not because the syntax is hard, but because interviewers probe whether you understand what happens inside the database engine when your query runs. The difference between a candidate who gets hired and one who doesn't usually comes down to three words: 'and why that?'
I've interviewed over 60 candidates for backend and data engineering roles. The pattern is always the same: 80% can write a SELECT with a JOIN. Maybe 30% can explain why their JOIN is slow. Maybe 10% can look at an execution plan and tell me the optimizer chose a Hash Join when it should have used a Nested Loop. That 10% gets the offer.
Most people can write a SELECT statement; very few can explain why their query is scanning 2 million rows when it should be touching 50. I once inherited a reporting service that took 45 seconds to load a dashboard. The fix was one index and a rewritten subquery. The previous developer had been 'optimizing' by adding more RAM to the server for six months. The query was doing a full table scan on a 14-million-row orders table because someone wrapped an indexed column in YEAR() — making it non-SARGable. One line of SQL, six months of wasted infrastructure spend.
This matters because slow queries cost money. A poorly written JOIN on a 10-million-row orders table can turn a 20ms API response into a 12-second timeout. Knowing the difference between a clustered and non-clustered index, or understanding why a LEFT JOIN returns NULLs you didn't expect, is the difference between shipping reliable software and debugging production fires at 2am.
By the end of this article you'll be able to explain JOINs, NULLs, window functions, indexes, subqueries vs CTEs, GROUP BY, and query optimization at a level that earns the interviewer's respect — not just recognition. Every concept comes with a real schema, runnable SQL, actual output, and the exact reasoning an interviewer wants to hear.
Why LEFT JOIN with WHERE Filters Silently Drops Rows
A LEFT JOIN returns all rows from the left table, with matching rows from the right table — or NULLs where no match exists. The trap: placing a WHERE condition on a column from the right table converts the LEFT JOIN into an INNER JOIN, because NULL comparisons (e.g., WHERE right.id = 5) evaluate to UNKNOWN and filter out non-matching rows. This is not a bug; it's how SQL's three-valued logic works. The filter is applied after the join, so any row that didn't match gets a NULL in that column, and the WHERE clause rejects it. To preserve left-side rows while filtering the right side, move the condition into the ON clause: ON left.id = right.id AND right.status = 'active'. This keeps unmatched rows intact with NULLs for right-side columns. In practice, this mistake causes silent data loss — reports show fewer rows than expected, and debugging takes hours because the query looks correct at first glance. Senior engineers internalize that WHERE filters on the outer table's columns are always applied post-join, and the ON clause is the correct place for pre-join filters on the inner table.
The NULL Trap
A WHERE clause on a right-table column turns LEFT JOIN into INNER JOIN — NULLs from non-matches fail the comparison and get dropped.
Production Insight
A daily sales report using LEFT JOIN on orders with WHERE order_date = '2024-01-01' silently excluded customers with no orders on that date, undercounting by 12%. The symptom: totals matched no other system. The rule: if you need all left rows, put right-table filters in the ON clause, not WHERE.
Key Takeaway
WHERE on right-table columns converts LEFT JOIN to INNER JOIN.
Move right-table filters to the ON clause to preserve left rows.
Test with a known non-match row to verify join behavior.
thecodeforge.io
LEFT JOIN WHERE Filter: Missing Customers
Top Sql Interview Questions
Understanding JOINs: Beyond the Venn Diagram
Interviewers often start with JOINs. While Venn diagrams are a popular starting point, they're misleading for senior candidates because SQL JOINs are not set operations — they're row-combination operations. A LEFT JOIN doesn't 'subtract' anything; it preserves every row from the left table and fills in NULLs when there's no match on the right.
Senior candidates must discuss the logical processing of a JOIN. The database engine decides between three physical operators: Nested Loop (best for small tables), Hash Join (best for large unsorted tables), and Merge Join (best when both inputs are sorted on the join key). Knowing when each is chosen and why shows you understand the engine, not just the syntax.
The most common interview trap: putting a filter in the ON clause vs the WHERE clause of a LEFT JOIN. They produce completely different results. A filter in ON controls which right-side rows participate in the join. A filter in WHERE eliminates rows from the final result — including left-side rows that had no match. This distinction is the single most missed concept in SQL interviews.
I once caught a production bug where a developer filtered on the right table's status column in the WHERE clause of a LEFT JOIN. It silently converted the LEFT JOIN into an INNER JOIN — every customer without an active order disappeared from the report. The finance team made quarterly projections based on that report. The projections were wrong for three months before anyone noticed.
io/thecodeforge/queries/customer_orders.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- io.thecodeforge: LEFT JOIN — ON vs WHERE filter behavior-- Schema: customers(customer_id, customer_name)-- orders(order_id, customer_id, order_date, total_amount, status)-- QUERY 1: Filter in ON clause-- Preserves ALL customers. Only joins with orders after 2025-01-01.-- Customers with no recent orders still appear (with NULL order data).SELECT
c.customer_name,
o.order_date,
COALESCE(o.total_amount, 0) AS amount
FROM io_thecodeforge.customers c
LEFTJOIN io_thecodeforge.orders o
ON c.customer_id = o.customer_id
AND o.order_date > '2025-01-01';
-- QUERY 2: Filter in WHERE clause-- Eliminates any row where order_date is NULL (no match or filtered out).-- This effectively converts the LEFT JOIN into an INNER JOIN.SELECT
c.customer_name,
o.order_date,
COALESCE(o.total_amount, 0) AS amount
FROM io_thecodeforge.customers c
LEFTJOIN io_thecodeforge.orders o
ON c.customer_id = o.customer_id
WHERE o.order_date > '2025-01-01';
-- QUERY 3: Find customers with NO orders after 2025-01-01-- Classic pattern: LEFT JOIN + WHERE IS NULLSELECT
c.customer_name,
COALESCE(o.total_amount, 0) AS amount
FROM io_thecodeforge.customers c
LEFTJOIN io_thecodeforge.orders o
ON c.customer_id = o.customer_id
AND o.order_date > '2025-01-01'WHERE o.order_id ISNULL;
Output
-- QUERY 1 output (filter in ON — preserves all customers):
| customer_name | order_date | amount |
|---------------|------------|--------|
| Alice Forge | 2025-06-15 | 250.00 |
| Alice Forge | 2025-09-22 | 175.50 |
| Bob Smith | NULL | 0 |
| Carol Wu | 2025-03-10 | 89.99 |
| Dave Patel | NULL | 0 |
-- QUERY 2 output (filter in WHERE — silently becomes INNER JOIN):
| customer_name | order_date | amount |
|---------------|------------|--------|
| Alice Forge | 2025-06-15 | 250.00 |
| Alice Forge | 2025-09-22 | 175.50 |
| Carol Wu | 2025-03-10 | 89.99 |
-- Bob Smith and Dave Patel are MISSING (filtered out by WHERE)
-- QUERY 3 output (customers with no recent orders):
| customer_name | amount |
|---------------|--------|
| Bob Smith | 0 |
| Dave Patel | 0 |
ON vs WHERE — The Most Common Interview Mistake:
In a LEFT JOIN, a filter in ON controls which right-side rows participate. A filter in WHERE eliminates rows from the final result entirely. Putting a right-table filter in WHERE silently converts your LEFT JOIN into an INNER JOIN. I've seen this cause three months of wrong financial projections in production. Always test with a small dataset first and verify: does every left-table row appear in the output?
Production Insight
LEFT JOIN with right-table filter in WHERE caused 3 months of wrong financial reports.
Always verify that every left-table row appears when you need all rows.
Filter right-table columns in ON, not WHERE.
Key Takeaway
LEFT JOIN preserves left rows.
ON controls right row participation; WHERE removes rows from final result.
Always test with small data to verify row counts.
NULL Handling: The Landmine Nobody Practices
NULL is not zero. NULL is not an empty string. NULL is not 'false.' NULL is the absence of a value — it's the database saying 'I don't know.' And SQL's treatment of NULL is the source of more interview failures and production bugs than any other single concept.
The three rules that govern NULL: 1. NULL != NULL. NULL is not equal to anything, including itself. WHERE column = NULL returns zero rows. Always use IS NULL or IS NOT NULL. 2. Any arithmetic operation with NULL produces NULL. 5 + NULL = NULL. If your revenue column has NULLs, SUM(revenue) ignores them — but revenue * 1.1 produces NULL for those rows. 3. Comparison operators (<, >, =) with NULL produce UNKNOWN, not TRUE or FALSE. This means rows with NULL values are excluded from both WHERE column > 100 and WHERE column <= 100. They fall through both filters.
COALESCE is your safety net. It returns the first non-NULL value from its arguments. Use it to provide defaults: COALESCE(discount, 0) means 'use the discount if it exists, otherwise assume zero.' NVL (Oracle), IFNULL (MySQL), and ISNULL (SQL Server) are database-specific alternatives.
Production story: a revenue report was undercounting by 15%. The query was SELECT SUM(price quantity) FROM order_items. Some rows had NULL quantity (items pending confirmation). Multiplying by NULL produced NULL, and SUM skips NULLs — so those items silently disappeared from the revenue total. The fix was SUM(COALESCE(price, 0) COALESCE(quantity, 0)). One line. Three months of wrong numbers.
io/thecodeforge/queries/null_handling.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- io.thecodeforge: NULL Handling Patterns-- Schema: products(product_id, product_name, discount_percent, category)-- WRONG: This returns ZERO rows even if NULLs exist-- Because NULL = NULL evaluates to UNKNOWN, not TRUESELECT * FROM io_thecodeforge.products
WHERE discount_percent = NULL;
-- RIGHT: Use IS NULLSELECT * FROM io_thecodeforge.products
WHERE discount_percent ISNULL;
-- WRONG: This excludes NULLs from both sides-- Products with NULL discount appear in NEITHER resultSELECT * FROM io_thecodeforge.products WHERE discount_percent > 10;
SELECT * FROM io_thecodeforge.products WHERE discount_percent <= 10;
-- Together they don't cover all rows — NULL rows are missing from both.-- RIGHT: Explicitly handle NULLs with COALESCESELECT
product_name,
discount_percent,
COALESCE(discount_percent, 0) AS safe_discount,
CASEWHEN discount_percent ISNULLTHEN'No discount set'WHEN discount_percent > 20THEN'Heavy discount'ELSE'Standard'ENDAS discount_category
FROM io_thecodeforge.products;
-- PRODUCTION PATTERN: Safe aggregation with NULLs-- Without COALESCE, NULL quantity items vanish from revenue totalsSELECTSUM(COALESCE(unit_price, 0) * COALESCE(quantity, 0)) AS total_revenue,
COUNT(*) AS total_line_items,
COUNT(quantity) AS items_with_quantity, -- COUNT(column) skips NULLsCOUNT(*) AS all_items -- COUNT(*) counts everythingFROM io_thecodeforge.order_items;
-- Notice: 15 items had NULL quantity. COUNT(*) includes them, COUNT(column) doesn't.
COUNT(*) vs COUNT(column) — The NULL Difference:
COUNT() counts every row regardless of NULLs. COUNT(column_name) counts only rows where that column is NOT NULL. This distinction matters for reporting. If your manager asks 'how many order items do we have?' and you use COUNT(quantity), you'll undercount by the number of pending/NULL items. Use COUNT() for total rows, COUNT(column) for 'how many have this value filled in.'
Production Insight
Revenue report undercounted by 15% because SUM(price * quantity) had NULL quantity.
NULL in arithmetic produces NULL, and SUM skips NULLs.
Always wrap nullable columns in COALESCE for calculations.
Key Takeaway
NULL is not a value.
Use IS NULL, never = NULL.
Always wrap nullable columns in COALESCE for safety.
Window Functions: The Interview Differentiator
Window functions are the topic that separates mid-level candidates from senior ones. GROUP BY collapses rows. Window functions keep every row but add computed columns based on a 'window' of related rows. If GROUP BY gives you a summary, window functions give you a summary attached to every detail row.
The key functions interviewers ask about
ROW_NUMBER(): Assigns a unique sequential number. Ties get different numbers.
RANK(): Assigns the same number to ties, then skips. 1, 1, 3, 4.
DENSE_RANK(): Assigns the same number to ties, no skip. 1, 1, 2, 3.
LEAD(column, n): Gets the value from n rows ahead.
LAG(column, n): Gets the value from n rows behind.
SUM()OVER(): Running total without collapsing rows.
The interview pattern I see most: 'Find the top 3 orders per customer' or 'Find the second-highest salary per department.' Both require ROW_NUMBER() or DENSE_RANK() with PARTITION BY. Candidates who reach for a subquery with LIMIT or a correlated subquery are showing they don't know window functions — and interviewers notice.
Production use case: I used LAG() to detect anomalies in a time-series metrics table. The query compared each hour's error rate to the previous hour's. If the error rate jumped by more than 200%, it flagged the row. That query ran against 50 million rows in 3 seconds because the window function avoided a self-join that would have taken 40+ seconds.
io/thecodeforge/queries/window_functions.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
-- io.thecodeforge: Window Functions — Interview Patterns-- Schema: orders(order_id, customer_id, order_date, total_amount)-- employees(employee_id, name, department, salary)-- PATTERN 1: Top N per group (top 2 orders per customer by amount)-- This is the #1 window function interview question.SELECT *
FROM (
SELECT
customer_id,
order_id,
total_amount,
ROW_NUMBER() OVER (
PARTITIONBY customer_id
ORDERBY total_amount DESC
) AS rn
FROM io_thecodeforge.orders
) ranked
WHERE rn <= 2;
-- PATTERN 2: Running total (cumulative revenue by date)SELECT
order_date,
total_amount,
SUM(total_amount) OVER (
ORDERBY order_date
ROWSBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW
) AS running_total
FROM io_thecodeforge.orders
ORDERBY order_date;
-- PATTERN 3: DENSE_RANK for 'second-highest salary per department'-- DENSE_RANK handles ties correctly. ROW_NUMBER would arbitrarily pick one.SELECT *
FROM (
SELECT
department,
name,
salary,
DENSE_RANK() OVER (
PARTITIONBY department
ORDERBY salary DESC
) AS dept_rank
FROM io_thecodeforge.employees
) ranked
WHERE dept_rank = 2;
-- PATTERN 4: LAG for detecting day-over-day anomaliesSELECT
metric_date,
error_count,
LAG(error_count, 1) OVER (ORDERBY metric_date) AS prev_day_errors,
CASEWHEN error_count > LAG(error_count, 1) OVER (ORDERBY metric_date) * 2THEN'SPIKE ALERT'ELSE'Normal'ENDAS status
FROM io_thecodeforge.daily_metrics
ORDERBY metric_date;
Output
-- PATTERN 1 output (top 2 orders per customer):
| customer_id | order_id | total_amount | rn |
|-------------|----------|-------------|----|
| 101 | 5001 | 450.00 | 1 |
| 101 | 5003 | 320.00 | 2 |
| 102 | 5005 | 890.00 | 1 |
| 102 | 5002 | 210.00 | 2 |
-- PATTERN 2 output (running total):
| order_date | total_amount | running_total |
|------------|-------------|---------------|
| 2025-01-01 | 500.00 | 500.00 |
| 2025-01-02 | 350.00 | 850.00 |
| 2025-01-03 | 720.00 | 1570.00 |
-- PATTERN 4 output (anomaly detection):
| metric_date | error_count | prev_day_errors | status |
ROW_NUMBER vs RANK vs DENSE_RANK — Know the Difference:
ROW_NUMBER assigns unique sequential numbers (1,2,3,4) even for ties. RANK gives ties the same number then skips (1,1,3,4). DENSE_RANK gives ties the same number without skipping (1,1,2,3). In an interview, if they ask for 'the second-highest salary,' use DENSE_RANK — because if two people share the highest salary, RANK would give second place a rank of 3, and ROW_NUMBER would arbitrarily break the tie. DENSE_RANK=2 gives you exactly the second tier.
Production Insight
Used LAG() to detect error spikes in 50M rows in 3 seconds vs self-join that would take 40+ seconds.
Window functions avoid expensive self-joins for row-to-row comparisons.
Choose correctly: DENSE_RANK for tiers, ROW_NUMBER for unique ordering.
Key Takeaway
Window functions keep rows while adding computed values.
ROW_NUMBER - unique; RANK - ties skip; DENSE_RANK - ties no skip.
Use PARTITION BY for per-group ranking.
GROUP BY and HAVING: Aggregation Done Right
GROUP BY collapses multiple rows into summary rows. HAVING filters those summary rows after aggregation. WHERE filters individual rows before aggregation. Mixing them up is a classic interview mistake — and a common production bug.
The logical processing order matters: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This means: - WHERE cannot use aggregate functions (SUM, COUNT, AVG) because it runs before GROUP BY. - HAVING can use aggregate functions because it runs after GROUP BY. - You can filter on non-aggregated columns in WHERE (faster) or HAVING (slower, but sometimes necessary).
The trap interviewers love: 'Show me departments where the average salary exceeds 50,000.' That's HAVING AVG(salary) > 50000. Candidates who write WHERE AVG(salary) > 50000 get an immediate syntax error — and the interviewer watches to see if you know why.
Another trap: selecting columns not in the GROUP BY clause. In strict SQL mode (MySQL) or PostgreSQL, SELECT department, name, AVG(salary) FROM employees GROUP BY department fails because 'name' is not in the GROUP BY and is not aggregated. In older MySQL modes, it silently returns an arbitrary name from each group — which is almost never what you want. Always ensure every non-aggregated column in SELECT is in the GROUP BY.
io/thecodeforge/queries/group_by_having.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- io.thecodeforge: GROUP BY and HAVING Patterns-- Schema: employees(employee_id, name, department, salary, hire_date)-- orders(order_id, customer_id, order_date, total_amount, status)-- PATTERN 1: Basic aggregation with HAVING-- 'Departments with average salary above 50,000'SELECT
department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary,
MAX(salary) AS highest_salary
FROM io_thecodeforge.employees
GROUPBY department
HAVINGAVG(salary) > 50000ORDERBY avg_salary DESC;
-- PATTERN 2: WHERE vs HAVING — use WHERE for non-aggregate filters-- 'Departments with avg salary > 50k, but only for employees hired after 2020'-- WRONG: HAVING hire_date > '2020-01-01' — works but inefficient-- RIGHT: Filter rows first with WHERE, then aggregateSELECT
department,
AVG(salary) AS avg_salary
FROM io_thecodeforge.employees
WHERE hire_date > '2020-01-01' -- Filters BEFORE grouping (faster)GROUPBY department
HAVINGAVG(salary) > 50000; -- Filters AFTER grouping-- PATTERN 3: Multi-column GROUP BY with COUNT DISTINCT-- 'Revenue per customer per year, only for completed orders'SELECT
customer_id,
EXTRACT(YEARFROM order_date) AS order_year,
COUNT(*) AS order_count,
SUM(total_amount) AS total_revenue,
COUNT(DISTINCTEXTRACT(MONTHFROM order_date)) AS active_months
FROM io_thecodeforge.orders
WHERE status = 'completed'GROUPBY customer_id, EXTRACT(YEARFROM order_date)
HAVINGSUM(total_amount) > 10000ORDERBY total_revenue DESC;
Output
-- PATTERN 1 output:
| department | headcount | avg_salary | highest_salary |
Non-Aggregated Columns in SELECT — The Silent Bug:
If you SELECT a column that isn't in the GROUP BY and isn't wrapped in an aggregate function, PostgreSQL throws an error. Older MySQL versions silently return an arbitrary value from each group — which is almost never what you want. Always ensure every column in SELECT is either in GROUP BY or inside an aggregate function (SUM, COUNT, AVG, MAX, MIN). This is one of the most common sources of 'my report numbers are wrong' bugs.
Production Insight
Non-aggregated column in SELECT with GROUP BY silently returned arbitrary value in older MySQL.
PostgreSQL throws error, which is safer.
Always include all non-aggregated columns in GROUP BY or wrap in aggregate.
Key Takeaway
HAVING filters after aggregation, WHERE before.
WHERE cannot use aggregate functions.
Every non-aggregated column in SELECT must be in GROUP BY or aggregated.
Subqueries vs CTEs vs Temp Tables: When to Use What
Interviewers ask: 'What's the difference between a subquery, a CTE, and a temp table?' The answer is about readability, reusability, and performance.
A subquery is a query nested inside another query. It works, but deeply nested subqueries become unreadable fast. If you have a subquery inside a subquery inside a WHERE clause, nobody — including future you — can debug it at 2am during an incident.
A CTE (Common Table Expression) with the WITH clause gives a subquery a name. It reads top-to-bottom, can be referenced multiple times, and makes complex queries self-documenting. The optimizer in PostgreSQL and SQL Server treats CTEs as inline views (not materialized temp tables) unless you explicitly use MATERIALIZED — so there's usually no performance difference from a subquery.
A temp table physically stores intermediate results. Use it when you need to reference the same intermediate result many times, when the intermediate result is expensive to compute, or when you need to index the intermediate result. The downside: it's more I/O and the optimizer can't push predicates through it.
My rule of thumb: CTEs for readability. Subqueries for simple one-off filters. Temp tables for heavy multi-step transformations where you reference the intermediate result 5+ times or need to index it.
io/thecodeforge/queries/subquery_vs_cte.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
-- io.thecodeforge: Subquery vs CTE — Same Query, Different Styles-- Task: Find customers whose total spend exceeds the overall average-- APPROACH 1: Correlated Subquery (harder to read)SELECT
c.customer_name,
(SELECTSUM(o.total_amount)
FROM io_thecodeforge.orders o
WHERE o.customer_id = c.customer_id) AS total_spend
FROM io_thecodeforge.customers c
WHERE (SELECTSUM(o.total_amount)
FROM io_thecodeforge.orders o
WHERE o.customer_id = c.customer_id)
> (SELECTAVG(customer_total)
FROM (SELECTSUM(total_amount) AS customer_total
FROM io_thecodeforge.orders
GROUPBY customer_id) sub);
-- APPROACH 2: CTE (clean, readable, self-documenting)WITH customer_spending AS (
SELECT
customer_id,
SUM(total_amount) AS total_spend
FROM io_thecodeforge.orders
GROUPBY customer_id
),
average_spending AS (
SELECTAVG(total_spend) AS avg_spend
FROM customer_spending
)
SELECT
c.customer_name,
cs.total_spend
FROM io_thecodeforge.customers c
JOIN customer_spending cs ON c.customer_id = cs.customer_id
CROSSJOIN average_spending av
WHERE cs.total_spend > av.avg_spend
ORDERBY cs.total_spend DESC;
-- APPROACH 3: Temp table (best when intermediate result is reused many times)CREATETEMPTABLE customer_spending_tmp ASSELECT
customer_id,
SUM(total_amount) AS total_spend,
COUNT(*) AS order_count
FROM io_thecodeforge.orders
GROUPBY customer_id;
-- Now reference it multiple times without recomputingSELECT
c.customer_name,
cs.total_spend,
cs.order_count
FROM io_thecodeforge.customers c
JOIN customer_spending_tmp cs ON c.customer_id = cs.customer_id
WHERE cs.total_spend > (SELECTAVG(total_spend) FROM customer_spending_tmp)
ORDERBY cs.total_spend DESC;
DROPTABLEIFEXISTS customer_spending_tmp;
Output
-- All three approaches produce the same result:
| customer_name | total_spend |
|---------------|-------------|
| Alice Forge | 23450.00 |
| Bob Smith | 18900.00 |
-- Carol Wu and Dave Patel excluded (below average spend)
-- Performance note: CTE and subquery have similar execution plans.
-- Temp table is faster when the intermediate result is referenced 5+ times
-- because it avoids recomputation — but adds I/O overhead for small datasets.
Interview Answer Framework:
When asked 'subquery vs CTE vs temp table,' structure your answer: 'CTEs improve readability and can be referenced multiple times without duplication. Subqueries are fine for simple one-off filters. Temp tables are best when the intermediate result is expensive, referenced many times, or needs indexing. In practice, I default to CTEs and escalate to temp tables when performance profiling shows the intermediate computation is the bottleneck.'
Production Insight
Correlated subquery turned a 200ms query into 45 minutes on 100k rows.
Each outer row triggered a subquery execution.
Rewrite as JOIN or use EXISTS for short-circuit.
Key Takeaway
CTEs improve readability.
Subqueries for simple filters; temp tables for heavy reuse.
For large subquery results, EXISTS beats IN.
Indexing and SARGability: Why Your Query is Slow
A common interview question is: 'Why isn't my index being used?' This usually leads to the concept of SARGable (Search ARgument ABLE) queries. If you wrap an indexed column in a function (like WHERE YEAR(created_at) = 2026), the database engine cannot use the index effectively, leading to a full table scan. Instead, provide a range the engine can 'seek' into.
I've seen this exact pattern kill production performance. A reporting query on a 14-million-row orders table was taking 45 seconds. The developer had WHERE YEAR(order_date) = 2025 AND MONTH(order_date) = 3. The index on order_date was completely ignored because YEAR() and MONTH() wrapped the column. The fix: WHERE order_date >= '2025-03-01' AND order_date < '2025-04-01'. Same result, 12 milliseconds instead of 45 seconds. The index seek touched 40,000 rows instead of scanning all 14 million.
Beyond SARGability, understand covering indexes. If an index contains every column referenced in the query (both in WHERE and SELECT), the engine never touches the actual table — it reads everything from the index. This is called an 'index-only scan' and it's the fastest possible query path.
Clustered vs non-clustered: a clustered index IS the table — data is stored in the order of the clustered index (usually the primary key). A non-clustered index is a separate structure with pointers back to the data. Each table has exactly one clustered index but can have many non-clustered indexes.
io/thecodeforge/optimization/indexing.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- io.thecodeforge: SARGable vs Non-SARGable Queries-- Schema: logs(log_id, created_at, level, message, service_name)-- Index: idx_logs_created_at ON logs(created_at)-- BAD: Index on created_at is IGNORED due to function wrapping-- The engine must call DATE() on every row — full table scanSELECT * FROM io_thecodeforge.logs
WHEREDATE(created_at) = '2026-03-06';
-- GOOD: Index seek enabled — engine jumps directly to matching rangeSELECT * FROM io_thecodeforge.logs
WHERE created_at >= '2026-03-06 00:00:00'AND created_at < '2026-03-07 00:00:00';
-- BAD: Function on both sides — index cannot be usedSELECT * FROM io_thecodeforge.logs
WHERELOWER(service_name) = 'payment-service';
-- GOOD: Store data in consistent case, or use a functional index-- PostgreSQL: CREATE INDEX idx_service_lower ON logs(LOWER(service_name));-- Then the LOWER() query becomes SARGable on that functional index.-- COVERING INDEX example:-- If your query is:SELECT customer_id, order_date, total_amount
FROM io_thecodeforge.orders
WHERE customer_id = 101AND order_date > '2025-01-01';
-- Create a covering index that includes ALL referenced columns:CREATEINDEX idx_orders_covering
ON io_thecodeforge.orders(customer_id, order_date)
INCLUDE (total_amount);
-- Now the engine reads ONLY the index — never touches the table.-- This is an 'index-only scan' — the fastest possible path.
Execution Plan: Index Scan using idx_logs_created_at | Rows: 40,000 | Time: 12ms
-- The difference: 3,750× faster by removing a function call.
Always Run EXPLAIN Before Optimizing:
Before changing a single line of SQL, run EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL, EXPLAIN EXECUTION PLAN in SQL Server). It tells you whether the engine is doing a Seq Scan (full table — bad at scale), Index Scan (good), or Index-Only Scan (best). Without EXPLAIN, you're guessing. With EXPLAIN, you're diagnosing. I've never optimized a slow query without first reading its execution plan.
Production Insight
YEAR(order_date) wrapped index column caused full scan of 14M rows in 45 seconds.
Rewriting as range condition reduced to 12ms.
Never wrap indexed columns in functions.
Key Takeaway
SARGable queries enable index seeks.
Use range conditions instead of functions on columns.
Covering indexes enable index-only scans.
Query Optimization: Reading Execution Plans and Avoiding Common Traps
The ultimate interview question: 'How would you optimize a slow query?' Here's the systematic approach I use in production, and the one I want to hear from candidates.
Step 1: Run EXPLAIN. Look for Seq Scan on large tables, Nested Loop on large joins, and high row estimates. If the estimated rows are wildly different from actual rows, your table statistics are stale — run ANALYZE.
Step 2: Check for non-SARGable predicates. Functions on indexed columns, implicit type conversions, LIKE with leading wildcards.
Step 3: Check for missing indexes. If the WHERE clause filters on columns without indexes, add them. If the JOIN uses columns without indexes, add them.
Step 4: Check for SELECT *. You're pulling every column from disk when you need three. Use a covering index or select only required columns.
Step 5: Check for N+1 queries. If your application runs one query to get IDs, then one query per ID, that's N+1. Rewrite as a single JOIN or IN clause.
Step 6: Check for DISTINCT abuse. DISTINCT is expensive — it requires sorting or hashing all results to remove duplicates. If you're using DISTINCT to hide duplicate rows caused by a bad JOIN, fix the JOIN instead.
The implicit conversion trap is worth calling out explicitly: comparing a VARCHAR column to an integer literal (WHERE varchar_col = 123) forces the engine to convert every row's value to an integer before comparing. This prevents index usage and causes a full table scan. Always match types: WHERE varchar_col = '123'.
-- io.thecodeforge: Query Optimization Patterns-- TRAP 1: Implicit type conversion — index ignored-- phone_number is VARCHAR, but we compare to an integer-- The engine converts EVERY row's phone_number to int before comparingSELECT * FROM io_thecodeforge.customers
WHERE phone_number = 9876543210; -- BAD: implicit conversionSELECT * FROM io_thecodeforge.customers
WHERE phone_number = '9876543210'; -- GOOD: type match, index used-- TRAP 2: DISTINCT hiding a bad JOIN-- If you're getting duplicates and adding DISTINCT to 'fix' it,-- your JOIN condition is wrong.SELECTDISTINCT c.customer_name, o.total_amount
FROM io_thecodeforge.customers c
JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id;
-- If this returns duplicates, check: are there multiple orders per customer?-- If yes, DISTINCT is removing legitimate rows. If no, the JOIN is wrong.-- TRAP 3: Subquery in WHERE instead of JOIN-- Correlated subquery runs once PER ROW — O(n²) behaviorSELECT * FROM io_thecodeforge.customers c
WHERE c.customer_id IN (
SELECT customer_id FROM io_thecodeforge.orders
WHERE total_amount > 1000
);
-- Better: use EXISTS (short-circuits on first match)SELECT * FROM io_thecodeforge.customers c
WHEREEXISTS (
SELECT1FROM io_thecodeforge.orders o
WHERE o.customer_id = c.customer_id
AND o.total_amount > 1000
);
-- Best: use JOIN with DISTINCT if you only need customer dataSELECTDISTINCT c.*
FROM io_thecodeforge.customers c
JOIN io_thecodeforge.orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > 1000;
-- TRAP 4: LIKE with leading wildcard — index cannot be usedSELECT * FROM io_thecodeforge.logs
WHERE message LIKE '%timeout%'; -- Full table scan, every time-- If you need full-text search, use a proper full-text index:-- PostgreSQL: CREATE INDEX idx_message_fts ON logs USING gin(to_tsvector('english', message));-- Then: WHERE to_tsvector('english', message) @@ to_tsquery('timeout');
Execution Plan: Index Scan using idx_phone | Rows: 1 | Time: 0.5ms
-- 1,700× slower due to implicit conversion.
EXISTS vs IN — Performance Matters at Scale:
For small subquery results, IN and EXISTS perform similarly. But with large subquery result sets, EXISTS is usually faster because it short-circuits — it stops scanning as soon as it finds one match. IN materializes the entire subquery result into a set, then checks membership. If your subquery returns 100,000 rows, EXISTS might find a match on the first row while IN builds a 100,000-element hash set first. In an interview, say: 'I'd default to EXISTS for correlated checks and IN for small static lists.'
Production Insight
Implicit type conversion (VARCHAR to int) made a query 1700x slower by disabling index.
Always match types in comparisons.
EXISTS generally faster than IN for large subqueries.
Key Takeaway
Run EXPLAIN before optimizing.
Fix non-SARGable predicates first.
Avoid SELECT *, implicit conversions, and correlated subqueries.
Why Your Interviewer Cares About SQL Dialects (And You Should Too)
Interviewers ask about dialects not because they want a vocabulary test, but because they've been burned by someone who only knew MySQL's LIMIT syntax and crashed a Postgres migration. Every major RDBMS — PostgreSQL, SQL Server, Oracle, MySQL — has its own quirks in date functions, string operations, pagination, and window function support.
The real test isn't 'name three dialects,' it's 'can you adapt to our production stack in two weeks?' When I interview candidates, I ask about Oracle's CONNECT BY or SQL Server's TOP because those are the landmines I've seen wipe out entire sprints. If you've only written queries in one dialect, you're not a SQL developer — you're a user of that specific tool.
Know which dialect your target company uses, and at minimum understand how their date truncation, LIMIT/OFFSET alternatives, and transaction isolation levels differ from the ANSI standard. That's the difference between 'I can write a query' and 'I won't break prod on my second day.'
DialectGotchas.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — interview tutorial
// PostgreSQL dialect: Date truncation
SELECTdate_trunc('month', order_date) AS month_start,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2024-01-01'GROUPBYdate_trunc('month', order_date);
// SQLServer equivalent: DATEFROMPARTSSELECTDATEFROMPARTS(YEAR(order_date), MONTH(order_date), 1) AS month_start,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2024-01-01'GROUPBYDATEFROMPARTS(YEAR(order_date), MONTH(order_date), 1);
// Oracle equivalent: TRUNCSELECTTRUNC(order_date, 'MM') AS month_start,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= DATE'2024-01-01'GROUPBYTRUNC(order_date, 'MM');
Output
month_start | order_count
2024-01-01 | 12894
2024-02-01 | 14201
2024-03-01 | 11387
Production Trap:
Porting date comparison logic between dialects without testing edge cases (leap years, daylight savings) is a common source of silent data corruption. Always replicate your date predicates in both old and new dialects during migration.
Key Takeaway
If your production stack changes dialects, every date function and pagination query is a potential incident waiting to happen. Learn the three most common dialects' transaction isolation and date truncation syntax.
DELETE vs TRUNCATE vs DROP: The Nuclear, Conventional, and Surgical Options
Junior devs memorize the syntax differences. Senior devs know when each one will make them the person calling the all-hands-on-deck incident response meeting.
DELETE is surgical — row by row, logged, triggerable, rollbackable. It's for cleaning out those 500 orphaned records from 2019. But it generates transaction log traffic proportional to the number of rows killed. You don't DELETE 50 million rows unless you have an afternoon to burn and your DBA isn't watching.
TRUNCATE is conventional warfare — deallocates data pages, minimal logging, resets identity seeds, and can't be used with a WHERE clause. It's for wiping a staging table before the next ETL run. No triggers fire, no per-row logging. Fast. Lethal. But you can't undo it in most databases without a transaction wrapping it.
DROP is the nuclear option. Table definition, constraints, indexes, permissions — all gone. It's not a data operation; it's schema demolition. I've seen a junior type DROP instead of DELETE on the wrong database and take down a production reporting system for four hours. He still gets reminded about it every sprint retro.
The key insight interviewers probe: 'Can you articulate the operational impact of your choice?' If your answer is just syntax fluff about auto-increment and logging, you're showing you haven't actually managed data in production.
CleanupOperations.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — interview tutorial
-- Safe deletion with explicit transaction for rollback
BEGINTRANSACTION;
-- DELETE: row-by-row, logged, can be rolled back
DELETEFROM order_errors
WHERE created_at < '2023-01-01'AND error_type = 'TIMEOUT';
-- Check count before committing
SELECTCOUNT(*) AS remaining_errors FROM order_errors;
-- Commitor rollback based on sanity check
-- COMMIT; -- uncomment to finalize
-- ROLLBACK; -- uncomment to undo
-- TRUNCATE: fast, minimal logging, no per-row triggers
TRUNCATETABLE staging_warehouse_snapshot;
-- DROP: schema removal, not data operation
-- DROPTABLEIFEXISTS deprecated_archive;
Output
(rows affected: 12,347)
remaining_errors
13456
Senior Shortcut:
Always wrap DELETE in a transaction when targeting more than 1,000 rows. Run a SELECT COUNT(*) first to sanity-check. If the count looks wrong, ROLLBACK before anyone notices.
Key Takeaway
DELETE is for precision surgery, TRUNCATE for scheduled cleanups, DROP for schema destruction. Never confuse them. Always wrap destructive operations in a transaction.
UNION vs UNION ALL: Performance vs Deduplication
UNION and UNION ALL combine result sets from multiple SELECT queries, but they differ critically in performance. UNION automatically removes duplicate rows, requiring an extra sort or hash operation. UNION ALL returns all rows, including duplicates, without deduplication overhead. In interviews, clarify: use UNION only when you need distinct rows and the datasets are small or moderately sized. For large datasets, UNION ALL with explicit deduplication logic (e.g., using DISTINCT or GROUP BY) often outperforms a raw UNION because you control when deduplication happens. Example: merging customer lists from two regions. If duplicates are impossible by design (e.g., region-specific IDs), UNION ALL is strictly faster. The trap: candidates default to UNION thinking it's safer, but they ignore the cost. Always ask: "Does your data guarantee no duplicates?" If yes, UNION ALL is the correct, faster choice. In production, a slow UNION on a 10M-row table can stall reporting—know when to use each.
union_vs_union_all.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — interview tutorial
-- UNION eliminates duplicates (slower)
SELECT customer_id FROM region_east
UNIONSELECT customer_id FROM region_west;
-- UNIONALL keeps duplicates (faster)
SELECT customer_id FROM region_east
UNIONALLSELECT customer_id FROM region_west;
Output
UNION returns unique customer IDs only; UNION ALL returns all rows including duplicates.
Production Trap:
Using UNION on large tables with indexes can force full table scans for dedup. Benchmark shows UNION ALL is 30-50% faster on 5M+ rows.
Key Takeaway
UNION ALL unless deduplication is explicitly required.
Foreign Keys: Referential Integrity or Performance Tax?
A foreign key (FK) enforces referential integrity: it ensures a column's values exist in another table's primary key. This prevents orphan rows and maintains data consistency. However, FKs impose a performance cost on INSERT, UPDATE, and DELETE operations because the database checks the referenced table on every write. In interviews, show understanding: FKs are not optional for correctness in relational models, but they can be omitted in high-throughput OLTP systems where application-level integrity is preferred. Example: an orders table with customer_id referencing customers.id ensures no order points to a deleted customer. The trap: ignoring FKs leads to silent data corruption; overusing them in high-write environments causes deadlocks. Best practice: enforce FKs during design, disable them during bulk loads, then re-enable. Know your dialect—MySQL's InnoDB checks FKs; MyISAM ignores them silently. The interviewer wants to see you balance correctness with performance.
Error: Cannot add or update a child row: a foreign key constraint fails.
Production Trap:
FKs can cause cascading locks under high concurrency. Monitor lock waits and consider disabling FKs during batch ETL jobs.
Key Takeaway
FKs guarantee consistency but add write overhead—use them or compensate with application checks.
SQL Injection: The Interview Non-Negotiable
SQL injection (SQLi) occurs when user input is concatenated directly into SQL queries, allowing attackers to execute arbitrary SQL. This is the most common web vulnerability. Prevention is mandatory: use parameterized queries (prepared statements) or stored procedures—never string interpolation. In interviews, demonstrate the difference: SELECT * FROM users WHERE username = 'admin' OR '1'='1' bypasses authentication. Parameterized queries separate code from data, making injection impossible. Additional layers: least privilege (read-only accounts for reporting), input validation, and ORM frameworks. The trap: many developers think escaping quotes is enough—it isn't. Attackers exploit encoding, Unicode, and second-order injection. Example: an attacker inserts '<script>' into a name field, then that data is used unsafely in another query. Know your database: MySQL's prepared statements prevent most injections, but PostgreSQL's libpq also supports them. Interviewers ask this to gauge security awareness. Never say "we use regex to filter SQL keywords"—that's a red flag.
Parameterized query executes safely; attacker input is treated as data, not SQL code.
Production Trap:
Even stored procedures can be vulnerable if they use EXEC with concatenated strings inside them. Always paramaterize at every level.
Key Takeaway
Only parameterized queries prevent SQL injection—escaping is not a substitute.
● Production incidentPOST-MORTEMseverity: high
LEFT JOIN Filter in WHERE Silently Corrupted Quarterlies
Symptom
Finance team's quarterly revenue report showed fewer customers than expected. Some high-value customers were missing entirely, but no error was thrown.
Assumption
Developers assumed that filtering on the right table's status column in WHERE was equivalent to filtering in ON.
Root cause
In a LEFT JOIN, a WHERE clause filter on the right table eliminates rows where the right side has no match, effectively turning the LEFT JOIN into an INNER JOIN. The missing customers were those with no active orders.
Fix
Move the status filter from WHERE to the ON clause of the LEFT JOIN. Then the LEFT JOIN preserves all customers, and only the right-side rows are filtered.
Key lesson
Always test LEFT JOIN output with a small dataset.
If you use a LEFT JOIN, verify that every row from the left table appears in the result, even when the right side has no match.
Never put filters on the right table's columns in the WHERE clause of a LEFT JOIN without understanding the semantic change.
Production debug guideCommon performance or logic problems and their first actions5 entries
Symptom · 01
Query returns fewer rows than expected with LEFT JOIN
→
Fix
Check WHERE clause for filters on right-table columns. Move them to ON clause if LEFT JOIN is intended.
Symptom · 02
Query is slow despite having an index
→
Fix
Run EXPLAIN to see if index is used. Check for functions wrapping indexed columns (SARGability).
Symptom · 03
NULL values causing wrong aggregation results
→
Fix
Use COALESCE to supply defaults in arithmetic. Check COUNT(*) vs COUNT(column).
Symptom · 04
Window function returns unexpected ranks
→
Fix
Verify PARTITION BY and ORDER BY clauses. Use ROW_NUMBER vs RANK vs DENSE_RANK appropriately for ties.
Symptom · 05
Correlated subquery is very slow
→
Fix
Rewrite as JOIN or use EXISTS with short-circuit.
★ SQL Query Troubleshooting Cheat SheetQuick commands and fixes for the most common production SQL incidents.
Full table scan on large table−
Immediate action
Run EXPLAIN ANALYZE
Commands
EXPLAIN ANALYZE <your_query>;
\dt+ <table_name> -- PostgreSQL; check table size
Fix now
Add index on columns used in WHERE, ensure no function wrapping.
LEFT JOIN returns missing rows+
Immediate action
Check WHERE clause for right-table filters
Commands
Select the left table's primary key columns and verify count with and without WHERE.
Review the ON clause vs WHERE clause distinction.
Fix now
Move any right-table conditions from WHERE to ON.
Aggregate function returns NULL or wrong total+
Immediate action
Check for NULL values in aggregated columns
Commands
SELECT COUNT(*), COUNT(column), SUM(COALESCE(column,0)) FROM table;
Check if arithmetic inside SUM has NULL operands.
Fix now
Wrap nullable columns in COALESCE before arithmetic or aggregation.
Clustered Index vs Non-Clustered Index
Feature
Clustered Index
Non-Clustered Index
Data Storage
The table IS the index (stores actual data in index order)
Separate structure with pointers back to data pages
Count per Table
Only 1 (usually Primary Key)
Multiple (limited by engine, typically dozens)
Performance
Faster for range scans and ORDER BY on the key
Faster for targeted single-row lookups
Leaf Nodes
Actual data pages (the table rows themselves)
Pointers to the clustered index key or row ID
When to Use
Primary key, columns used in range queries and sorting
Columns frequently in WHERE, JOIN, or ORDER BY that aren't the primary key
Key takeaways
1
LEFT JOIN preserves all rows from the left table; filter right-table columns in ON not WHERE.
2
NULL is not a value
use IS NULL and COALESCE for comparisons and arithmetic.
SARGable queries enable index seeks; function-wrapped indexed columns cause full scans.
5
EXISTS short-circuits and beats IN for large subquery result sets.
6
Run EXPLAIN before optimizing
it reveals the actual execution plan and missing indexes.
Common mistakes to avoid
7 patterns
×
Using SELECT * in production queries
Symptom
Unnecessary I/O overhead, prevents covering index usage, and breaks backward compatibility when new columns are added to the table. In a 50-column table, SELECT * reads 16x more data than needed.
Fix
Always specify only the columns you need. Use a covering index that includes the selected columns.
×
Ignoring NULL logic
Symptom
WHERE column = NULL returns zero rows; arithmetic with NULL produces NULL; SUM skips NULLs silently. Revenue reports can undercount by 15% when NULL quantities exist.
Fix
Use IS NULL / IS NOT NULL for comparisons. Wrap nullable columns in COALESCE for calculations.
×
Failing to check execution plans
Symptom
Guessing why a query is slow leads to wasted effort. Without EXPLAIN, you may add indexes that aren't used or miss the real cause (missing statistics, implicit conversion).
Fix
Run EXPLAIN ANALYZE (PostgreSQL) or SHOW EXECUTION PLAN (SQL Server). Look for Seq Scans, high row estimates, and missing index usage.
×
Implicit type conversions destroying index usage
Symptom
Comparing a VARCHAR column to an integer literal forces a full table scan. A query that should take <1ms takes 850ms on 500k rows.
Fix
Always match the column's actual type: WHERE varchar_col = '123' instead of = 123.
×
Using DISTINCT to hide bad JOINs
Symptom
Adding DISTINCT to remove duplicates is expensive (requires sorting/hashing) and masks a wrong JOIN condition. Duplicate rows indicate a one-to-many join or missing join key.
Fix
Fix the JOIN condition instead. Verify cardinality between tables. Use DISTINCT only when you truly need to deduplicate.
×
Correlated subqueries in WHERE causing O(n²) behavior
Symptom
A subquery that references the outer query's columns runs once per outer row. On 100k rows, that's 100k executions, turning a 200ms query into 45 minutes.
Fix
Rewrite with EXISTS (short-circuits) or a JOIN (set-based, single pass). EXISTS is usually more efficient for large result sets.
×
Putting filters in the WHERE clause of a LEFT JOIN that belong in the ON clause
Symptom
All left-side rows with no match disappear from results. Financial reports, customer lists, and dashboards become incomplete without any error signal.
Fix
Move right-table conditions from WHERE to ON in the LEFT JOIN. The ON clause controls which rows participate; WHERE filters the final result.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FU...
Q02JUNIOR
What is the difference between WHERE and HAVING? Can you use aggregate f...
Q03SENIOR
Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). G...
Q04SENIOR
What is a SARGable query? Give an example of a non-SARGable query and ex...
Q05SENIOR
What is the difference between EXISTS and IN? When would you prefer one ...
Q06JUNIOR
Explain NULL behavior in SQL. Why does `WHERE column = NULL` return zero...
Q07SENIOR
A query runs in 200ms on a table with 10,000 rows but takes 45 seconds o...
Q01 of 07SENIOR
Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. When would you use each one? Give a real-world example where choosing the wrong JOIN type would produce incorrect business metrics.
ANSWER
INNER JOIN returns rows only when there is a match in both tables. LEFT JOIN returns all rows from the left table and matched rows from the right, filling NULLs for non-matches. RIGHT JOIN is symmetric. FULL OUTER JOIN returns all rows from both, with NULLs where no match exists. A wrong choice example: using INNER JOIN to list customers and their orders omits customers without orders, understating the customer base. Use LEFT JOIN when you need all entities from the primary table regardless of related data.
Q02 of 07JUNIOR
What is the difference between WHERE and HAVING? Can you use aggregate functions in WHERE? Why or why not? What is the logical processing order of a SQL query?
ANSWER
WHERE filters rows before grouping; HAVING filters groups after aggregation. Aggregate functions cannot be used in WHERE because WHERE runs before GROUP BY, so aggregates haven't been computed yet. Logical order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This means you can alias a column in SELECT and use it in ORDER BY, but not in WHERE.
Q03 of 07SENIOR
Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). Given a table of employee salaries, how would you find the second-highest salary per department using each function? Which would you choose and why?
ANSWER
ROW_NUMBER assigns unique consecutive numbers, breaking ties arbitrarily. RANK gives ties the same number then skips the next rank. DENSE_RANK gives ties the same number without skipping. For second-highest salary per department, use DENSE_RANK because it correctly assigns rank=2 to the second salary tier even if multiple people share highest. With RANK, second-highest becomes rank=3 if two share top. ROW_NUMBER would arbitrarily pick one of the tied top, making 'second' meaningless.
Q04 of 07SENIOR
What is a SARGable query? Give an example of a non-SARGable query and explain how to rewrite it. What happens to the execution plan when you wrap an indexed column in a function?
ANSWER
SARGable (Search ARgument ABLE) means the query can leverage an index seek. Non-SARGable example: WHERE YEAR(order_date) = 2026. Rewrite as WHERE order_date >= '2026-01-01' AND order_date < '2027-01-01'. Wrapping an indexed column in a function forces a full index scan or table scan because the engine must evaluate the function on every row before comparing. The execution plan shows a Seq Scan instead of Index Seek.
Q05 of 07SENIOR
What is the difference between EXISTS and IN? When would you prefer one over the other? What happens to performance when the subquery result set is very large?
ANSWER
EXISTS checks for existence and short-circuits on first match. IN materializes the entire subquery result into a set. For large subquery results, EXISTS is usually faster because it can return as soon as it finds one row. IN builds a hash set of potentially millions of elements. For small, finite lists (like status codes), IN is more readable. I default to EXISTS for correlated subqueries and IN for small literal lists.
Q06 of 07JUNIOR
Explain NULL behavior in SQL. Why does `WHERE column = NULL` return zero rows? What is the difference between COUNT(*) and COUNT(column)? How does NULL affect arithmetic operations and aggregate functions?
ANSWER
NULL is not a value; it means 'unknown'. NULL = NULL evaluates to UNKNOWN, not TRUE, so WHERE column = NULL returns no rows. Use IS NULL. COUNT() counts all rows; COUNT(column) counts non-NULL rows only. Arithmetic with NULL yields NULL, so SUM(price quantity) silently drops rows where quantity is NULL. Always use COALESCE to provide defaults in arithmetic and aggregation.
Q07 of 07SENIOR
A query runs in 200ms on a table with 10,000 rows but takes 45 seconds on a table with 10 million rows. Walk me through how you would diagnose and optimize it. What would you check first?
ANSWER
First, run EXPLAIN ANALYZE to see the execution plan. Look for Seq Scan on large tables, high row estimates, and missing indexes. Check if the WHERE clause functions wrap indexed columns (non-SARGable). Check for implicit type conversions. Then look at join order and join types. Also check if the query uses SELECT * (unnecessary columns). If statistics are stale, run ANALYZE. If there's a correlated subquery, rewrite as JOIN/EXISTS. The systematic approach: EXPLAIN first, then fix SARGability, then consider covering indexes.
01
Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. When would you use each one? Give a real-world example where choosing the wrong JOIN type would produce incorrect business metrics.
SENIOR
02
What is the difference between WHERE and HAVING? Can you use aggregate functions in WHERE? Why or why not? What is the logical processing order of a SQL query?
JUNIOR
03
Explain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). Given a table of employee salaries, how would you find the second-highest salary per department using each function? Which would you choose and why?
SENIOR
04
What is a SARGable query? Give an example of a non-SARGable query and explain how to rewrite it. What happens to the execution plan when you wrap an indexed column in a function?
SENIOR
05
What is the difference between EXISTS and IN? When would you prefer one over the other? What happens to performance when the subquery result set is very large?
SENIOR
06
Explain NULL behavior in SQL. Why does `WHERE column = NULL` return zero rows? What is the difference between COUNT(*) and COUNT(column)? How does NULL affect arithmetic operations and aggregate functions?
JUNIOR
07
A query runs in 200ms on a table with 10,000 rows but takes 45 seconds on a table with 10 million rows. Walk me through how you would diagnose and optimize it. What would you check first?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
What is the most common mistake developers make with LEFT JOINs in SQL interviews?
Putting a filter on the right table's column in the WHERE clause instead of the ON clause. This silently converts the LEFT JOIN into an INNER JOIN, eliminating rows from the left table that don't have a match. Always test with a small dataset to verify row counts.
Was this helpful?
02
Why does COUNT(*) differ from COUNT(column) when NULLs are present?
COUNT() counts every row regardless of NULLs. COUNT(column) counts only rows where that column is NOT NULL. If you need a total row count, use COUNT(). If you need to know how many rows have a value in a specific column, use COUNT(column).
Was this helpful?
03
When should I use a CTE instead of a subquery?
Use a CTE when you need to reference the same intermediate result multiple times in a query, or when the query becomes deeply nested and hard to read. CTEs make complex queries self-documenting. For simple one-off filters, a subquery is fine.
Was this helpful?
04
What does SARGable mean and why does it matter?
SARGable (Search ARgument ABLE) means the query can use an index seek. Wrapping an indexed column in a function (like YEAR(date)) makes the condition non-SARGable, forcing a full table scan. Rewriting with a range condition (date >= '2026-01-01' AND date < '2027-01-01') enables the index seek.