Intermediate 9 min · March 06, 2026

LEFT JOIN WHERE Filter: Missing Customers (SQL Interview)

Finance quarterly report missing high-value customers: a LEFT JOIN WHERE filter silently turned it into an INNER JOIN.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
Quick Answer
  • SQL interviews test engine behavior, not syntax memorization.
  • LEFT JOIN: ON filter controls join participation; WHERE eliminates rows.
  • NULL: not a value; use IS NULL and COALESCE for safety.
  • Window functions: ROW_NUMBER vs RANK vs DENSE_RANK distinguish senior candidates.
  • Indexes: function-wrapped columns (YEAR(date)) disable index seeks.
  • Optimization: EXPLAIN ANALYZE reveals full scans, missing indexes, implicit conversions.

SQL interviews trip up even experienced developers — not because the syntax is hard, but because interviewers probe whether you understand what happens inside the database engine when your query runs. The difference between a candidate who gets hired and one who doesn't usually comes down to three words: 'and why that?'

I've interviewed over 60 candidates for backend and data engineering roles. The pattern is always the same: 80% can write a SELECT with a JOIN. Maybe 30% can explain why their JOIN is slow. Maybe 10% can look at an execution plan and tell me the optimizer chose a Hash Join when it should have used a Nested Loop. That 10% gets the offer.

Most people can write a SELECT statement; very few can explain why their query is scanning 2 million rows when it should be touching 50. I once inherited a reporting service that took 45 seconds to load a dashboard. The fix was one index and a rewritten subquery. The previous developer had been 'optimizing' by adding more RAM to the server for six months. The query was doing a full table scan on a 14-million-row orders table because someone wrapped an indexed column in YEAR() — making it non-SARGable. One line of SQL, six months of wasted infrastructure spend.

This matters because slow queries cost money. A poorly written JOIN on a 10-million-row orders table can turn a 20ms API response into a 12-second timeout. Knowing the difference between a clustered and non-clustered index, or understanding why a LEFT JOIN returns NULLs you didn't expect, is the difference between shipping reliable software and debugging production fires at 2am.

By the end of this article you'll be able to explain JOINs, NULLs, window functions, indexes, subqueries vs CTEs, GROUP BY, and query optimization at a level that earns the interviewer's respect — not just recognition. Every concept comes with a real schema, runnable SQL, actual output, and the exact reasoning an interviewer wants to hear.

Understanding JOINs: Beyond the Venn Diagram

Interviewers often start with JOINs. While Venn diagrams are a popular starting point, they're misleading for senior candidates because SQL JOINs are not set operations — they're row-combination operations. A LEFT JOIN doesn't 'subtract' anything; it preserves every row from the left table and fills in NULLs when there's no match on the right.

Senior candidates must discuss the logical processing of a JOIN. The database engine decides between three physical operators: Nested Loop (best for small tables), Hash Join (best for large unsorted tables), and Merge Join (best when both inputs are sorted on the join key). Knowing when each is chosen and why shows you understand the engine, not just the syntax.

The most common interview trap: putting a filter in the ON clause vs the WHERE clause of a LEFT JOIN. They produce completely different results. A filter in ON controls which right-side rows participate in the join. A filter in WHERE eliminates rows from the final result — including left-side rows that had no match. This distinction is the single most missed concept in SQL interviews.

I once caught a production bug where a developer filtered on the right table's status column in the WHERE clause of a LEFT JOIN. It silently converted the LEFT JOIN into an INNER JOIN — every customer without an active order disappeared from the report. The finance team made quarterly projections based on that report. The projections were wrong for three months before anyone noticed.

NULL Handling: The Landmine Nobody Practices

NULL is not zero. NULL is not an empty string. NULL is not 'false.' NULL is the absence of a value — it's the database saying 'I don't know.' And SQL's treatment of NULL is the source of more interview failures and production bugs than any other single concept.

The three rules that govern NULL: 1. NULL != NULL. NULL is not equal to anything, including itself. WHERE column = NULL returns zero rows. Always use IS NULL or IS NOT NULL. 2. Any arithmetic operation with NULL produces NULL. 5 + NULL = NULL. If your revenue column has NULLs, SUM(revenue) ignores them — but revenue * 1.1 produces NULL for those rows. 3. Comparison operators (<, >, =) with NULL produce UNKNOWN, not TRUE or FALSE. This means rows with NULL values are excluded from both WHERE column > 100 and WHERE column <= 100. They fall through both filters.

COALESCE is your safety net. It returns the first non-NULL value from its arguments. Use it to provide defaults: COALESCE(discount, 0) means 'use the discount if it exists, otherwise assume zero.' NVL (Oracle), IFNULL (MySQL), and ISNULL (SQL Server) are database-specific alternatives.

Production story: a revenue report was undercounting by 15%. The query was SELECT SUM(price quantity) FROM order_items. Some rows had NULL quantity (items pending confirmation). Multiplying by NULL produced NULL, and SUM skips NULLs — so those items silently disappeared from the revenue total. The fix was SUM(COALESCE(price, 0) COALESCE(quantity, 0)). One line. Three months of wrong numbers.

Window Functions: The Interview Differentiator

Window functions are the topic that separates mid-level candidates from senior ones. GROUP BY collapses rows. Window functions keep every row but add computed columns based on a 'window' of related rows. If GROUP BY gives you a summary, window functions give you a summary attached to every detail row.

The key functions interviewers ask about
  • ROW_NUMBER(): Assigns a unique sequential number. Ties get different numbers.
  • RANK(): Assigns the same number to ties, then skips. 1, 1, 3, 4.
  • DENSE_RANK(): Assigns the same number to ties, no skip. 1, 1, 2, 3.
  • LEAD(column, n): Gets the value from n rows ahead.
  • LAG(column, n): Gets the value from n rows behind.
  • SUM() OVER(): Running total without collapsing rows.

The interview pattern I see most: 'Find the top 3 orders per customer' or 'Find the second-highest salary per department.' Both require ROW_NUMBER() or DENSE_RANK() with PARTITION BY. Candidates who reach for a subquery with LIMIT or a correlated subquery are showing they don't know window functions — and interviewers notice.

Production use case: I used LAG() to detect anomalies in a time-series metrics table. The query compared each hour's error rate to the previous hour's. If the error rate jumped by more than 200%, it flagged the row. That query ran against 50 million rows in 3 seconds because the window function avoided a self-join that would have taken 40+ seconds.

GROUP BY and HAVING: Aggregation Done Right

GROUP BY collapses multiple rows into summary rows. HAVING filters those summary rows after aggregation. WHERE filters individual rows before aggregation. Mixing them up is a classic interview mistake — and a common production bug.

The logical processing order matters: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This means: - WHERE cannot use aggregate functions (SUM, COUNT, AVG) because it runs before GROUP BY. - HAVING can use aggregate functions because it runs after GROUP BY. - You can filter on non-aggregated columns in WHERE (faster) or HAVING (slower, but sometimes necessary).

The trap interviewers love: 'Show me departments where the average salary exceeds 50,000.' That's HAVING AVG(salary) > 50000. Candidates who write WHERE AVG(salary) > 50000 get an immediate syntax error — and the interviewer watches to see if you know why.

Another trap: selecting columns not in the GROUP BY clause. In strict SQL mode (MySQL) or PostgreSQL, SELECT department, name, AVG(salary) FROM employees GROUP BY department fails because 'name' is not in the GROUP BY and is not aggregated. In older MySQL modes, it silently returns an arbitrary name from each group — which is almost never what you want. Always ensure every non-aggregated column in SELECT is in the GROUP BY.

Subqueries vs CTEs vs Temp Tables: When to Use What

Interviewers ask: 'What's the difference between a subquery, a CTE, and a temp table?' The answer is about readability, reusability, and performance.

A subquery is a query nested inside another query. It works, but deeply nested subqueries become unreadable fast. If you have a subquery inside a subquery inside a WHERE clause, nobody — including future you — can debug it at 2am during an incident.

A CTE (Common Table Expression) with the WITH clause gives a subquery a name. It reads top-to-bottom, can be referenced multiple times, and makes complex queries self-documenting. The optimizer in PostgreSQL and SQL Server treats CTEs as inline views (not materialized temp tables) unless you explicitly use MATERIALIZED — so there's usually no performance difference from a subquery.

A temp table physically stores intermediate results. Use it when you need to reference the same intermediate result many times, when the intermediate result is expensive to compute, or when you need to index the intermediate result. The downside: it's more I/O and the optimizer can't push predicates through it.

My rule of thumb: CTEs for readability. Subqueries for simple one-off filters. Temp tables for heavy multi-step transformations where you reference the intermediate result 5+ times or need to index it.

Indexing and SARGability: Why Your Query is Slow

A common interview question is: 'Why isn't my index being used?' This usually leads to the concept of SARGable (Search ARgument ABLE) queries. If you wrap an indexed column in a function (like WHERE YEAR(created_at) = 2026), the database engine cannot use the index effectively, leading to a full table scan. Instead, provide a range the engine can 'seek' into.

I've seen this exact pattern kill production performance. A reporting query on a 14-million-row orders table was taking 45 seconds. The developer had WHERE YEAR(order_date) = 2025 AND MONTH(order_date) = 3. The index on order_date was completely ignored because YEAR() and MONTH() wrapped the column. The fix: WHERE order_date >= '2025-03-01' AND order_date < '2025-04-01'. Same result, 12 milliseconds instead of 45 seconds. The index seek touched 40,000 rows instead of scanning all 14 million.

Beyond SARGability, understand covering indexes. If an index contains every column referenced in the query (both in WHERE and SELECT), the engine never touches the actual table — it reads everything from the index. This is called an 'index-only scan' and it's the fastest possible query path.

Clustered vs non-clustered: a clustered index IS the table — data is stored in the order of the clustered index (usually the primary key). A non-clustered index is a separate structure with pointers back to the data. Each table has exactly one clustered index but can have many non-clustered indexes.

Query Optimization: Reading Execution Plans and Avoiding Common Traps

The ultimate interview question: 'How would you optimize a slow query?' Here's the systematic approach I use in production, and the one I want to hear from candidates.

Step 1: Run EXPLAIN. Look for Seq Scan on large tables, Nested Loop on large joins, and high row estimates. If the estimated rows are wildly different from actual rows, your table statistics are stale — run ANALYZE.

Step 2: Check for non-SARGable predicates. Functions on indexed columns, implicit type conversions, LIKE with leading wildcards.

Step 3: Check for missing indexes. If the WHERE clause filters on columns without indexes, add them. If the JOIN uses columns without indexes, add them.

Step 4: Check for SELECT *. You're pulling every column from disk when you need three. Use a covering index or select only required columns.

Step 5: Check for N+1 queries. If your application runs one query to get IDs, then one query per ID, that's N+1. Rewrite as a single JOIN or IN clause.

Step 6: Check for DISTINCT abuse. DISTINCT is expensive — it requires sorting or hashing all results to remove duplicates. If you're using DISTINCT to hide duplicate rows caused by a bad JOIN, fix the JOIN instead.

The implicit conversion trap is worth calling out explicitly: comparing a VARCHAR column to an integer literal (WHERE varchar_col = 123) forces the engine to convert every row's value to an integer before comparing. This prevents index usage and causes a full table scan. Always match types: WHERE varchar_col = '123'.

Clustered Index vs Non-Clustered Index
FeatureClustered IndexNon-Clustered Index
Data StorageThe table IS the index (stores actual data in index order)Separate structure with pointers back to data pages
Count per TableOnly 1 (usually Primary Key)Multiple (limited by engine, typically dozens)
PerformanceFaster for range scans and ORDER BY on the keyFaster for targeted single-row lookups
Leaf NodesActual data pages (the table rows themselves)Pointers to the clustered index key or row ID
When to UsePrimary key, columns used in range queries and sortingColumns frequently in WHERE, JOIN, or ORDER BY that aren't the primary key

Key Takeaways

  • LEFT JOIN preserves all rows from the left table; filter right-table columns in ON not WHERE.
  • NULL is not a value — use IS NULL and COALESCE for comparisons and arithmetic.
  • Window functions (ROW_NUMBER, RANK, DENSE_RANK) keep detail rows while adding computed columns.
  • SARGable queries enable index seeks; function-wrapped indexed columns cause full scans.
  • EXISTS short-circuits and beats IN for large subquery result sets.
  • Run EXPLAIN before optimizing — it reveals the actual execution plan and missing indexes.

Common Mistakes to Avoid

  • Using SELECT * in production queries
    Symptom: Unnecessary I/O overhead, prevents covering index usage, and breaks backward compatibility when new columns are added to the table. In a 50-column table, SELECT * reads 16x more data than needed.
    Fix: Always specify only the columns you need. Use a covering index that includes the selected columns.
  • Ignoring NULL logic
    Symptom: WHERE column = NULL returns zero rows; arithmetic with NULL produces NULL; SUM skips NULLs silently. Revenue reports can undercount by 15% when NULL quantities exist.
    Fix: Use IS NULL / IS NOT NULL for comparisons. Wrap nullable columns in COALESCE for calculations.
  • Failing to check execution plans
    Symptom: Guessing why a query is slow leads to wasted effort. Without EXPLAIN, you may add indexes that aren't used or miss the real cause (missing statistics, implicit conversion).
    Fix: Run EXPLAIN ANALYZE (PostgreSQL) or SHOW EXECUTION PLAN (SQL Server). Look for Seq Scans, high row estimates, and missing index usage.
  • Implicit type conversions destroying index usage
    Symptom: Comparing a VARCHAR column to an integer literal forces a full table scan. A query that should take <1ms takes 850ms on 500k rows.
    Fix: Always match the column's actual type: WHERE varchar_col = '123' instead of = 123.
  • Using DISTINCT to hide bad JOINs
    Symptom: Adding DISTINCT to remove duplicates is expensive (requires sorting/hashing) and masks a wrong JOIN condition. Duplicate rows indicate a one-to-many join or missing join key.
    Fix: Fix the JOIN condition instead. Verify cardinality between tables. Use DISTINCT only when you truly need to deduplicate.
  • Correlated subqueries in WHERE causing O(n²) behavior
    Symptom: A subquery that references the outer query's columns runs once per outer row. On 100k rows, that's 100k executions, turning a 200ms query into 45 minutes.
    Fix: Rewrite with EXISTS (short-circuits) or a JOIN (set-based, single pass). EXISTS is usually more efficient for large result sets.
  • Putting filters in the WHERE clause of a LEFT JOIN that belong in the ON clause
    Symptom: All left-side rows with no match disappear from results. Financial reports, customer lists, and dashboards become incomplete without any error signal.
    Fix: Move right-table conditions from WHERE to ON in the LEFT JOIN. The ON clause controls which rows participate; WHERE filters the final result.

Interview Questions on This Topic

  • QExplain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN. When would you use each one? Give a real-world example where choosing the wrong JOIN type would produce incorrect business metrics.Mid-levelReveal
    INNER JOIN returns rows only when there is a match in both tables. LEFT JOIN returns all rows from the left table and matched rows from the right, filling NULLs for non-matches. RIGHT JOIN is symmetric. FULL OUTER JOIN returns all rows from both, with NULLs where no match exists. A wrong choice example: using INNER JOIN to list customers and their orders omits customers without orders, understating the customer base. Use LEFT JOIN when you need all entities from the primary table regardless of related data.
  • QWhat is the difference between WHERE and HAVING? Can you use aggregate functions in WHERE? Why or why not? What is the logical processing order of a SQL query?JuniorReveal
    WHERE filters rows before grouping; HAVING filters groups after aggregation. Aggregate functions cannot be used in WHERE because WHERE runs before GROUP BY, so aggregates haven't been computed yet. Logical order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This means you can alias a column in SELECT and use it in ORDER BY, but not in WHERE.
  • QExplain the difference between ROW_NUMBER(), RANK(), and DENSE_RANK(). Given a table of employee salaries, how would you find the second-highest salary per department using each function? Which would you choose and why?Mid-levelReveal
    ROW_NUMBER assigns unique consecutive numbers, breaking ties arbitrarily. RANK gives ties the same number then skips the next rank. DENSE_RANK gives ties the same number without skipping. For second-highest salary per department, use DENSE_RANK because it correctly assigns rank=2 to the second salary tier even if multiple people share highest. With RANK, second-highest becomes rank=3 if two share top. ROW_NUMBER would arbitrarily pick one of the tied top, making 'second' meaningless.
  • QWhat is a SARGable query? Give an example of a non-SARGable query and explain how to rewrite it. What happens to the execution plan when you wrap an indexed column in a function?SeniorReveal
    SARGable (Search ARgument ABLE) means the query can leverage an index seek. Non-SARGable example: WHERE YEAR(order_date) = 2026. Rewrite as WHERE order_date >= '2026-01-01' AND order_date < '2027-01-01'. Wrapping an indexed column in a function forces a full index scan or table scan because the engine must evaluate the function on every row before comparing. The execution plan shows a Seq Scan instead of Index Seek.
  • QWhat is the difference between EXISTS and IN? When would you prefer one over the other? What happens to performance when the subquery result set is very large?Mid-levelReveal
    EXISTS checks for existence and short-circuits on first match. IN materializes the entire subquery result into a set. For large subquery results, EXISTS is usually faster because it can return as soon as it finds one row. IN builds a hash set of potentially millions of elements. For small, finite lists (like status codes), IN is more readable. I default to EXISTS for correlated subqueries and IN for small literal lists.
  • QExplain NULL behavior in SQL. Why does WHERE column = NULL return zero rows? What is the difference between COUNT(*) and COUNT(column)? How does NULL affect arithmetic operations and aggregate functions?JuniorReveal
    NULL is not a value; it means 'unknown'. NULL = NULL evaluates to UNKNOWN, not TRUE, so WHERE column = NULL returns no rows. Use IS NULL. COUNT() counts all rows; COUNT(column) counts non-NULL rows only. Arithmetic with NULL yields NULL, so SUM(price quantity) silently drops rows where quantity is NULL. Always use COALESCE to provide defaults in arithmetic and aggregation.
  • QA query runs in 200ms on a table with 10,000 rows but takes 45 seconds on a table with 10 million rows. Walk me through how you would diagnose and optimize it. What would you check first?SeniorReveal
    First, run EXPLAIN ANALYZE to see the execution plan. Look for Seq Scan on large tables, high row estimates, and missing indexes. Check if the WHERE clause functions wrap indexed columns (non-SARGable). Check for implicit type conversions. Then look at join order and join types. Also check if the query uses SELECT * (unnecessary columns). If statistics are stale, run ANALYZE. If there's a correlated subquery, rewrite as JOIN/EXISTS. The systematic approach: EXPLAIN first, then fix SARGability, then consider covering indexes.

Frequently Asked Questions

What is the most common mistake developers make with LEFT JOINs in SQL interviews?

Putting a filter on the right table's column in the WHERE clause instead of the ON clause. This silently converts the LEFT JOIN into an INNER JOIN, eliminating rows from the left table that don't have a match. Always test with a small dataset to verify row counts.

Why does COUNT(*) differ from COUNT(column) when NULLs are present?

COUNT() counts every row regardless of NULLs. COUNT(column) counts only rows where that column is NOT NULL. If you need a total row count, use COUNT(). If you need to know how many rows have a value in a specific column, use COUNT(column).

When should I use a CTE instead of a subquery?

Use a CTE when you need to reference the same intermediate result multiple times in a query, or when the query becomes deeply nested and hard to read. CTEs make complex queries self-documenting. For simple one-off filters, a subquery is fine.

What does SARGable mean and why does it matter?

SARGable (Search ARgument ABLE) means the query can use an index seek. Wrapping an indexed column in a function (like YEAR(date)) makes the condition non-SARGable, forcing a full table scan. Rewriting with a range condition (date >= '2026-01-01' AND date < '2027-01-01') enables the index seek.

🔥

That's Database Interview. Mark it forged?

9 min read · try the examples if you haven't

Previous
Design a Leaderboard System
1 / 4 · Database Interview
Next
DBMS Interview Questions