LEFT JOIN WHERE Filter: Missing Customers (SQL Interview)
Finance quarterly report missing high-value customers: a LEFT JOIN WHERE filter silently turned it into an INNER JOIN.
- SQL interviews test engine behavior, not syntax memorization.
- LEFT JOIN: ON filter controls join participation; WHERE eliminates rows.
- NULL: not a value; use IS NULL and COALESCE for safety.
- Window functions: ROW_NUMBER vs RANK vs DENSE_RANK distinguish senior candidates.
- Indexes: function-wrapped columns (YEAR(date)) disable index seeks.
- Optimization: EXPLAIN ANALYZE reveals full scans, missing indexes, implicit conversions.
SQL interviews trip up even experienced developers — not because the syntax is hard, but because interviewers probe whether you understand what happens inside the database engine when your query runs. The difference between a candidate who gets hired and one who doesn't usually comes down to three words: 'and why that?'
I've interviewed over 60 candidates for backend and data engineering roles. The pattern is always the same: 80% can write a SELECT with a JOIN. Maybe 30% can explain why their JOIN is slow. Maybe 10% can look at an execution plan and tell me the optimizer chose a Hash Join when it should have used a Nested Loop. That 10% gets the offer.
Most people can write a SELECT statement; very few can explain why their query is scanning 2 million rows when it should be touching 50. I once inherited a reporting service that took 45 seconds to load a dashboard. The fix was one index and a rewritten subquery. The previous developer had been 'optimizing' by adding more RAM to the server for six months. The query was doing a full table scan on a 14-million-row orders table because someone wrapped an indexed column in YEAR() — making it non-SARGable. One line of SQL, six months of wasted infrastructure spend.
This matters because slow queries cost money. A poorly written JOIN on a 10-million-row orders table can turn a 20ms API response into a 12-second timeout. Knowing the difference between a clustered and non-clustered index, or understanding why a LEFT JOIN returns NULLs you didn't expect, is the difference between shipping reliable software and debugging production fires at 2am.
By the end of this article you'll be able to explain JOINs, NULLs, window functions, indexes, subqueries vs CTEs, GROUP BY, and query optimization at a level that earns the interviewer's respect — not just recognition. Every concept comes with a real schema, runnable SQL, actual output, and the exact reasoning an interviewer wants to hear.
Understanding JOINs: Beyond the Venn Diagram
Interviewers often start with JOINs. While Venn diagrams are a popular starting point, they're misleading for senior candidates because SQL JOINs are not set operations — they're row-combination operations. A LEFT JOIN doesn't 'subtract' anything; it preserves every row from the left table and fills in NULLs when there's no match on the right.
Senior candidates must discuss the logical processing of a JOIN. The database engine decides between three physical operators: Nested Loop (best for small tables), Hash Join (best for large unsorted tables), and Merge Join (best when both inputs are sorted on the join key). Knowing when each is chosen and why shows you understand the engine, not just the syntax.
The most common interview trap: putting a filter in the ON clause vs the WHERE clause of a LEFT JOIN. They produce completely different results. A filter in ON controls which right-side rows participate in the join. A filter in WHERE eliminates rows from the final result — including left-side rows that had no match. This distinction is the single most missed concept in SQL interviews.
I once caught a production bug where a developer filtered on the right table's status column in the WHERE clause of a LEFT JOIN. It silently converted the LEFT JOIN into an INNER JOIN — every customer without an active order disappeared from the report. The finance team made quarterly projections based on that report. The projections were wrong for three months before anyone noticed.
NULL Handling: The Landmine Nobody Practices
NULL is not zero. NULL is not an empty string. NULL is not 'false.' NULL is the absence of a value — it's the database saying 'I don't know.' And SQL's treatment of NULL is the source of more interview failures and production bugs than any other single concept.
The three rules that govern NULL: 1. NULL != NULL. NULL is not equal to anything, including itself. WHERE column = NULL returns zero rows. Always use IS NULL or IS NOT NULL. 2. Any arithmetic operation with NULL produces NULL. 5 + NULL = NULL. If your revenue column has NULLs, SUM(revenue) ignores them — but revenue * 1.1 produces NULL for those rows. 3. Comparison operators (<, >, =) with NULL produce UNKNOWN, not TRUE or FALSE. This means rows with NULL values are excluded from both WHERE column > 100 and WHERE column <= 100. They fall through both filters.
COALESCE is your safety net. It returns the first non-NULL value from its arguments. Use it to provide defaults: COALESCE(discount, 0) means 'use the discount if it exists, otherwise assume zero.' NVL (Oracle), IFNULL (MySQL), and ISNULL (SQL Server) are database-specific alternatives.
Production story: a revenue report was undercounting by 15%. The query was SELECT SUM(price quantity) FROM order_items. Some rows had NULL quantity (items pending confirmation). Multiplying by NULL produced NULL, and SUM skips NULLs — so those items silently disappeared from the revenue total. The fix was SUM(COALESCE(price, 0) COALESCE(quantity, 0)). One line. Three months of wrong numbers.
Window Functions: The Interview Differentiator
Window functions are the topic that separates mid-level candidates from senior ones. GROUP BY collapses rows. Window functions keep every row but add computed columns based on a 'window' of related rows. If GROUP BY gives you a summary, window functions give you a summary attached to every detail row.
ROW_NUMBER(): Assigns a unique sequential number. Ties get different numbers.RANK(): Assigns the same number to ties, then skips. 1, 1, 3, 4.DENSE_RANK(): Assigns the same number to ties, no skip. 1, 1, 2, 3.- LEAD(column, n): Gets the value from n rows ahead.
- LAG(column, n): Gets the value from n rows behind.
SUM()OVER(): Running total without collapsing rows.
The interview pattern I see most: 'Find the top 3 orders per customer' or 'Find the second-highest salary per department.' Both require ROW_NUMBER() or DENSE_RANK() with PARTITION BY. Candidates who reach for a subquery with LIMIT or a correlated subquery are showing they don't know window functions — and interviewers notice.
Production use case: I used LAG() to detect anomalies in a time-series metrics table. The query compared each hour's error rate to the previous hour's. If the error rate jumped by more than 200%, it flagged the row. That query ran against 50 million rows in 3 seconds because the window function avoided a self-join that would have taken 40+ seconds.
GROUP BY and HAVING: Aggregation Done Right
GROUP BY collapses multiple rows into summary rows. HAVING filters those summary rows after aggregation. WHERE filters individual rows before aggregation. Mixing them up is a classic interview mistake — and a common production bug.
The logical processing order matters: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This means: - WHERE cannot use aggregate functions (SUM, COUNT, AVG) because it runs before GROUP BY. - HAVING can use aggregate functions because it runs after GROUP BY. - You can filter on non-aggregated columns in WHERE (faster) or HAVING (slower, but sometimes necessary).
The trap interviewers love: 'Show me departments where the average salary exceeds 50,000.' That's HAVING AVG(salary) > 50000. Candidates who write WHERE AVG(salary) > 50000 get an immediate syntax error — and the interviewer watches to see if you know why.
Another trap: selecting columns not in the GROUP BY clause. In strict SQL mode (MySQL) or PostgreSQL, SELECT department, name, AVG(salary) FROM employees GROUP BY department fails because 'name' is not in the GROUP BY and is not aggregated. In older MySQL modes, it silently returns an arbitrary name from each group — which is almost never what you want. Always ensure every non-aggregated column in SELECT is in the GROUP BY.
Subqueries vs CTEs vs Temp Tables: When to Use What
Interviewers ask: 'What's the difference between a subquery, a CTE, and a temp table?' The answer is about readability, reusability, and performance.
A subquery is a query nested inside another query. It works, but deeply nested subqueries become unreadable fast. If you have a subquery inside a subquery inside a WHERE clause, nobody — including future you — can debug it at 2am during an incident.
A CTE (Common Table Expression) with the WITH clause gives a subquery a name. It reads top-to-bottom, can be referenced multiple times, and makes complex queries self-documenting. The optimizer in PostgreSQL and SQL Server treats CTEs as inline views (not materialized temp tables) unless you explicitly use MATERIALIZED — so there's usually no performance difference from a subquery.
A temp table physically stores intermediate results. Use it when you need to reference the same intermediate result many times, when the intermediate result is expensive to compute, or when you need to index the intermediate result. The downside: it's more I/O and the optimizer can't push predicates through it.
My rule of thumb: CTEs for readability. Subqueries for simple one-off filters. Temp tables for heavy multi-step transformations where you reference the intermediate result 5+ times or need to index it.
Indexing and SARGability: Why Your Query is Slow
A common interview question is: 'Why isn't my index being used?' This usually leads to the concept of SARGable (Search ARgument ABLE) queries. If you wrap an indexed column in a function (like WHERE YEAR(created_at) = 2026), the database engine cannot use the index effectively, leading to a full table scan. Instead, provide a range the engine can 'seek' into.
I've seen this exact pattern kill production performance. A reporting query on a 14-million-row orders table was taking 45 seconds. The developer had WHERE YEAR(order_date) = 2025 AND MONTH(order_date) = 3. The index on order_date was completely ignored because YEAR() and MONTH() wrapped the column. The fix: WHERE order_date >= '2025-03-01' AND order_date < '2025-04-01'. Same result, 12 milliseconds instead of 45 seconds. The index seek touched 40,000 rows instead of scanning all 14 million.
Beyond SARGability, understand covering indexes. If an index contains every column referenced in the query (both in WHERE and SELECT), the engine never touches the actual table — it reads everything from the index. This is called an 'index-only scan' and it's the fastest possible query path.
Clustered vs non-clustered: a clustered index IS the table — data is stored in the order of the clustered index (usually the primary key). A non-clustered index is a separate structure with pointers back to the data. Each table has exactly one clustered index but can have many non-clustered indexes.
Query Optimization: Reading Execution Plans and Avoiding Common Traps
The ultimate interview question: 'How would you optimize a slow query?' Here's the systematic approach I use in production, and the one I want to hear from candidates.
Step 1: Run EXPLAIN. Look for Seq Scan on large tables, Nested Loop on large joins, and high row estimates. If the estimated rows are wildly different from actual rows, your table statistics are stale — run ANALYZE.
Step 2: Check for non-SARGable predicates. Functions on indexed columns, implicit type conversions, LIKE with leading wildcards.
Step 3: Check for missing indexes. If the WHERE clause filters on columns without indexes, add them. If the JOIN uses columns without indexes, add them.
Step 4: Check for SELECT *. You're pulling every column from disk when you need three. Use a covering index or select only required columns.
Step 5: Check for N+1 queries. If your application runs one query to get IDs, then one query per ID, that's N+1. Rewrite as a single JOIN or IN clause.
Step 6: Check for DISTINCT abuse. DISTINCT is expensive — it requires sorting or hashing all results to remove duplicates. If you're using DISTINCT to hide duplicate rows caused by a bad JOIN, fix the JOIN instead.
The implicit conversion trap is worth calling out explicitly: comparing a VARCHAR column to an integer literal (WHERE varchar_col = 123) forces the engine to convert every row's value to an integer before comparing. This prevents index usage and causes a full table scan. Always match types: WHERE varchar_col = '123'.
| Feature | Clustered Index | Non-Clustered Index |
|---|---|---|
| Data Storage | The table IS the index (stores actual data in index order) | Separate structure with pointers back to data pages |
| Count per Table | Only 1 (usually Primary Key) | Multiple (limited by engine, typically dozens) |
| Performance | Faster for range scans and ORDER BY on the key | Faster for targeted single-row lookups |
| Leaf Nodes | Actual data pages (the table rows themselves) | Pointers to the clustered index key or row ID |
| When to Use | Primary key, columns used in range queries and sorting | Columns frequently in WHERE, JOIN, or ORDER BY that aren't the primary key |
Key Takeaways
- LEFT JOIN preserves all rows from the left table; filter right-table columns in ON not WHERE.
- NULL is not a value — use IS NULL and COALESCE for comparisons and arithmetic.
- Window functions (ROW_NUMBER, RANK, DENSE_RANK) keep detail rows while adding computed columns.
- SARGable queries enable index seeks; function-wrapped indexed columns cause full scans.
- EXISTS short-circuits and beats IN for large subquery result sets.
- Run EXPLAIN before optimizing — it reveals the actual execution plan and missing indexes.
Common Mistakes to Avoid
- Using SELECT * in production queries
Symptom: Unnecessary I/O overhead, prevents covering index usage, and breaks backward compatibility when new columns are added to the table. In a 50-column table, SELECT * reads 16x more data than needed.
Fix: Always specify only the columns you need. Use a covering index that includes the selected columns. - Ignoring NULL logic
Symptom: WHERE column = NULL returns zero rows; arithmetic with NULL produces NULL; SUM skips NULLs silently. Revenue reports can undercount by 15% when NULL quantities exist.
Fix: Use IS NULL / IS NOT NULL for comparisons. Wrap nullable columns in COALESCE for calculations. - Failing to check execution plans
Symptom: Guessing why a query is slow leads to wasted effort. Without EXPLAIN, you may add indexes that aren't used or miss the real cause (missing statistics, implicit conversion).
Fix: Run EXPLAIN ANALYZE (PostgreSQL) or SHOW EXECUTION PLAN (SQL Server). Look for Seq Scans, high row estimates, and missing index usage. - Implicit type conversions destroying index usage
Symptom: Comparing a VARCHAR column to an integer literal forces a full table scan. A query that should take <1ms takes 850ms on 500k rows.
Fix: Always match the column's actual type: WHERE varchar_col = '123' instead of = 123. - Using DISTINCT to hide bad JOINs
Symptom: Adding DISTINCT to remove duplicates is expensive (requires sorting/hashing) and masks a wrong JOIN condition. Duplicate rows indicate a one-to-many join or missing join key.
Fix: Fix the JOIN condition instead. Verify cardinality between tables. Use DISTINCT only when you truly need to deduplicate. - Correlated subqueries in WHERE causing O(n²) behavior
Symptom: A subquery that references the outer query's columns runs once per outer row. On 100k rows, that's 100k executions, turning a 200ms query into 45 minutes.
Fix: Rewrite with EXISTS (short-circuits) or a JOIN (set-based, single pass). EXISTS is usually more efficient for large result sets. - Putting filters in the WHERE clause of a LEFT JOIN that belong in the ON clause
Symptom: All left-side rows with no match disappear from results. Financial reports, customer lists, and dashboards become incomplete without any error signal.
Fix: Move right-table conditions from WHERE to ON in the LEFT JOIN. The ON clause controls which rows participate; WHERE filters the final result.
Interview Questions on This Topic
- QExplain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. When would you use each one? Give a real-world example where choosing the wrong JOIN type would produce incorrect business metrics.Mid-levelReveal
- QWhat is the difference between WHERE and HAVING? Can you use aggregate functions in WHERE? Why or why not? What is the logical processing order of a SQL query?JuniorReveal
- QExplain the difference between
ROW_NUMBER(),RANK(), andDENSE_RANK(). Given a table of employee salaries, how would you find the second-highest salary per department using each function? Which would you choose and why?Mid-levelReveal - QWhat is a SARGable query? Give an example of a non-SARGable query and explain how to rewrite it. What happens to the execution plan when you wrap an indexed column in a function?SeniorReveal
- QWhat is the difference between EXISTS and IN? When would you prefer one over the other? What happens to performance when the subquery result set is very large?Mid-levelReveal
- QExplain NULL behavior in SQL. Why does
WHERE column = NULLreturn zero rows? What is the difference between COUNT(*) and COUNT(column)? How does NULL affect arithmetic operations and aggregate functions?JuniorReveal - QA query runs in 200ms on a table with 10,000 rows but takes 45 seconds on a table with 10 million rows. Walk me through how you would diagnose and optimize it. What would you check first?SeniorReveal
Frequently Asked Questions
What is the most common mistake developers make with LEFT JOINs in SQL interviews?
Putting a filter on the right table's column in the WHERE clause instead of the ON clause. This silently converts the LEFT JOIN into an INNER JOIN, eliminating rows from the left table that don't have a match. Always test with a small dataset to verify row counts.
Why does COUNT(*) differ from COUNT(column) when NULLs are present?
COUNT() counts every row regardless of NULLs. COUNT(column) counts only rows where that column is NOT NULL. If you need a total row count, use COUNT(). If you need to know how many rows have a value in a specific column, use COUNT(column).
When should I use a CTE instead of a subquery?
Use a CTE when you need to reference the same intermediate result multiple times in a query, or when the query becomes deeply nested and hard to read. CTEs make complex queries self-documenting. For simple one-off filters, a subquery is fine.
What does SARGable mean and why does it matter?
SARGable (Search ARgument ABLE) means the query can use an index seek. Wrapping an indexed column in a function (like YEAR(date)) makes the condition non-SARGable, forcing a full table scan. Rewriting with a range condition (date >= '2026-01-01' AND date < '2027-01-01') enables the index seek.
That's Database Interview. Mark it forged?
9 min read · try the examples if you haven't