Senior 4 min · March 17, 2026

SQL Subqueries — 40-Minute Timeout from Correlated Query

Q: When should I use a CTE instead of a subquery?

Use a CTE when the subquery is reused multiple times, when the logic has multiple sequential steps, or when you need recursion. CTEs are named and easier to debug — you can comment out the outer query and SELECT from the CTE directly. For simple one-off subqueries, either works.

Q: Can a subquery return multiple columns?

In a WHERE clause with IN, the subquery can return multiple columns using row value constructors: WHERE (col1, col2) IN (SELECT a, b FROM ...). In a scalar context, the subquery must return exactly one column and one row — otherwise you get a runtime error.

A correlated subquery ran AVG() 200k times — 2-second report became 40-minute timeout.

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Everything here is grounded in real deployments.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide

⚡Quick Answer

A subquery is a SELECT nested inside another query — appears in WHERE (filter), SELECT (compute a value), or FROM (derived table)
Non-correlated subquery: executes once, result reused — fast
Correlated subquery: references the outer query and re-executes once per outer row — potentially O(n²)
EXISTS short-circuits on first match; IN collects the full result set first
NOT IN is dangerous when the subquery can return NULL — returns empty set silently; use NOT EXISTS instead
Rule: replace correlated subqueries with JOINs on large tables for an order-of-magnitude speedup

✦ Definition~90s read

What is SQL Subqueries?

A subquery is a SELECT statement nested inside another SQL statement — typically inside WHERE, FROM, or HAVING clauses. It exists because SQL is declarative: you describe what you want, not how to get it. Subqueries let you express multi-step logic in a single query, like "find employees whose salary exceeds the average for their department" without writing procedural loops or temp tables.

★

A subquery is a question inside a question.

They're the SQL equivalent of a function call — encapsulating a data lookup that the outer query consumes as a value, a set, or a boolean check.

Subqueries split into two camps: non-correlated and correlated. A non-correlated subquery runs once, independently, and its result is fed to the outer query — think WHERE salary > (SELECT AVG(salary) FROM employees). That's fast and predictable. A correlated subquery references columns from the outer query and must re-execute for every row the outer query processes — WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e2.dept_id = e1.dept_id).

This is where the O(n²) trap lives: for 10,000 employees across 50 departments, the inner query fires 10,000 times. PostgreSQL, MySQL, and SQL Server all handle this differently under the hood, but the pattern is the same — correlated subqueries are powerful but demand respect for their cost.

Alternatives exist. JOINs often outperform subqueries, especially when you need columns from the inner table. Window functions like AVG() OVER (PARTITION BY dept_id) can replace correlated subqueries with a single table scan. Common Table Expressions (CTEs) improve readability for deeply nested logic.

Don't use a subquery when a JOIN gives you the same result with less overhead — and never use a correlated subquery in a FROM clause unless you're prepared for a Cartesian explosion. The rule of thumb: if you're filtering rows based on an aggregate from a related table, a subquery is idiomatic; if you're transforming or enriching data, reach for a JOIN or window function first.

Plain-English First

A subquery is a question inside a question. Asking 'which customers placed an order?' is one question. Asking 'which customers placed an order that was larger than the average order?' requires asking the average question first, then using that answer. SQL lets you nest the inner question directly inside the outer one. The engine answers the inner question first, then uses that result to answer the outer one.

Subqueries are one of the most flexible tools in SQL — they let you compose queries in ways that would otherwise require multiple steps or temporary tables. But they also hide performance traps, especially correlated subqueries that run once per row.

This guide covers the main subquery types, the EXISTS vs IN performance difference, the dangerous NULL behaviour of NOT IN, and when to reach for CTEs instead. Understanding the performance profile of each subquery type is the difference between a query that works and one that works at scale.

What a Subquery Actually Is

A subquery is a SELECT statement nested inside another SQL statement — inside a WHERE, FROM, or HAVING clause. Its result feeds the outer query as a value, a row set, or a correlation reference. The core mechanic: the outer query pauses, the inner query executes (once or per row), and the outer query uses that result to filter, compute, or join.

Subqueries come in two flavors: non-correlated (executed once, independent of the outer query) and correlated (re-executed for each row of the outer query — O(n*m) cost). A non-correlated subquery runs first, materializes its result, then the outer query uses it. A correlated subquery references columns from the outer query, forcing per-row execution. That distinction is the single most important performance property: correlated subqueries can turn a fast query into a 40-minute timeout.

Use subqueries when you need to compare a value against an aggregated result (e.g., "employees earning above department average") or when a JOIN would produce duplicate rows due to one-to-many relationships. In production systems, subqueries often replace complex JOINs for readability, but the trade-off is execution plan complexity. A well-placed subquery can reduce data scanned by orders of magnitude; a poorly placed one can scan the same table millions of times.

Correlated Subquery Trap

A correlated subquery in a WHERE clause with 100k rows executes 100k times — one full scan per row. That's not a bug; it's the feature you didn't ask for.

Production Insight

A reporting query on a 5M-row orders table used a correlated subquery in SELECT to compute customer lifetime value per row. The query ran for 47 minutes and blocked the replica's I/O. Symptom: CPU at 100%, disk reads spiking, query stuck in 'Sending data' state. Rule: never put a correlated subquery in the SELECT list of a large result set — use a window function or a derived table instead.

Key Takeaway

Non-correlated subqueries execute once; correlated subqueries execute per outer row — treat them as nested loops.

A subquery in WHERE can often be rewritten as a JOIN or EXISTS with better performance.

Always check the execution plan: a 'DEPENDENT SUBQUERY' is a red flag for large datasets.

thecodeforge.io

SQL Subqueries: From Basics to Correlated Traps

Sql Subqueries

Scalar and Non-Correlated Subqueries

A scalar subquery returns exactly one value — one row, one column. You can use it anywhere a single value is expected: in SELECT to compute a derived column, in WHERE for a comparison, or in SET for an UPDATE. The key property of a non-correlated scalar subquery is that it executes once and its result is reused for every row of the outer query.

Subqueries in WHERE with IN accept a multi-row, single-column result — the outer query keeps rows where the column value appears in that list. Subqueries in FROM create a derived table (also called an inline view) — the result acts as a temporary table for the outer query to filter or join against.

Performance note: non-correlated subqueries execute once. The engine computes the inner result, then uses it for all outer row comparisons. This is fundamentally different from correlated subqueries, which re-execute per outer row.

non_correlated_subqueries.sqlSQL

-- Scalar subquery: compute one value, reuse for every row
SELECT
    employee_name,
    salary,
    (SELECT AVG(salary) FROM employees) AS company_avg,
    salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;
-- The AVG subquery executes ONCE, result reused for all rows

-- Subquery in WHERE with IN
SELECT product_name
FROM products
WHERE category_id IN (
    SELECT id FROM categories WHERE type = 'Electronics'
);
-- Inner query executes once, outer query checks membership

-- Subquery in FROM (derived table / inline view)
SELECT dept, avg_salary
FROM (
    SELECT department AS dept, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
) dept_averages
WHERE avg_salary > 60000;
-- Derived table computes once, outer query filters the result

Output

-- Scalar subquery result:

employee_name | salary | company_avg | diff_from_avg

Alice Tanaka | 95000.00 | 72000.00 | 23000.00

Marcus Webb | 58000.00 | 72000.00 | -14000.00

-- Derived table result:

dept | avg_salary

Engineering | 88500.00

Product | 79200.00

When Derived Tables Become CTEs

Derived tables (subqueries in FROM) are anonymous and can only be used once. When you need to reference the same intermediate result more than once, or when the logic has multiple steps, a CTE is cleaner: WITH dept_averages AS (SELECT ...) SELECT ... FROM dept_averages WHERE ...

Production Insight

A scalar subquery in SELECT re-executes once per output row — it is effectively a correlated subquery even when it looks independent.

For SELECT (SELECT COUNT(*) FROM orders WHERE customer_id = c.id) FROM customers — replace with a LEFT JOIN to a pre-aggregated subquery for large tables.

Non-correlated subqueries in WHERE with IN are safe for moderate result set sizes; over ~10,000 values in the IN list, consider a JOIN instead.

Key Takeaway

Non-correlated subqueries execute once — the result is reused for all outer row comparisons.

Scalar subqueries in SELECT look innocent but re-execute per output row at large scale.

When the same intermediate result is needed twice, use a CTE instead of repeating the subquery.

Correlated Subqueries — Power and the O(n²) Trap

A correlated subquery references a column from the outer query. This means it cannot run independently — it must re-execute for each row the outer query processes. On a table with 100 rows, that's 100 executions. On a table with 100,000 rows, it's 100,000 executions. This is the O(n²) trap.

Correlated subqueries are useful for row-by-row comparisons that are hard to express any other way. The canonical example: 'find employees earning more than their own department's average.' The subquery needs to know which department the current outer row belongs to — and that changes row by row.

But for large tables, this pattern is almost always replaceable with a JOIN to a pre-aggregated derived table or CTE. The JOIN version computes the average once per department (not once per row) and is orders of magnitude faster.

correlated_subquery_and_fix.sqlSQL

-- CORRELATED SUBQUERY: executes AVG() once per outer row
-- On 100,000 employees: 100,000 AVG() computations
SELECT e.name, e.salary, e.department
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department = e.department  -- references outer row!
);

-- BETTER: pre-aggregate in a derived table (computes AVG once per dept)
SELECT e.name, e.salary, e.department
FROM employees e
JOIN (
    SELECT department, AVG(salary) AS dept_avg
    FROM employees
    GROUP BY department
) dept ON e.department = dept.department
WHERE e.salary > dept.dept_avg;

-- Or with a CTE (same performance, more readable):
WITH dept_averages AS (
    SELECT department, AVG(salary) AS dept_avg
    FROM employees
    GROUP BY department
)
SELECT e.name, e.salary, e.department
FROM employees e
JOIN dept_averages d ON e.department = d.department
WHERE e.salary > d.dept_avg;

Output

-- Correlated: ~40 minutes on 200,000 rows (real incident)

-- JOIN version: 1.4 seconds on same data

-- EXPLAIN output correlated: loops=200000 on inner AVG()

-- EXPLAIN output JOIN: loops=1 on the aggregation

The Scale Trap

A correlated subquery that takes 2 seconds on 5,000 rows will take roughly (200,000/5,000)² = 1,600× longer on 200,000 rows — that's 53 minutes. This is not an edge case; it is a predictable consequence of O(n²) complexity. Always check execution plans for correlated subqueries before shipping to production.

Production Insight

Correlated subqueries look correct and work fine in development — they only reveal their complexity at production scale.

Run EXPLAIN ANALYZE and look for 'loops=N' on the inner query node where N = outer table row count.

The rewrite rule: correlated subquery in WHERE → JOIN to pre-aggregated CTE. Almost always possible.

Key Takeaway

Correlated subqueries re-execute once per outer row — O(n×m) complexity, lethal at scale.

The fix: pre-aggregate in a CTE or derived table, then JOIN — reduces inner query executions from N to 1.

Always check EXPLAIN for loops count on any subquery involving a large outer table.

EXISTS vs IN — Performance and the NULL Danger

EXISTS and IN are both used to check whether matching rows exist in another table, but they behave differently in important ways.

IN collects the full result set from the subquery, then checks membership for each outer row. If the subquery returns 10,000 rows, IN builds a 10,000-entry lookup structure first. EXISTS, on the other hand, is correlated — it runs the inner query for each outer row and stops as soon as it finds one match. This short-circuit makes EXISTS faster when you expect many matches.

The most important practical difference is NULL handling. NOT IN returns an empty result set if the subquery contains any NULL values — because SQL uses three-valued logic: a value compared to NULL is neither true nor false, it's unknown. One NULL in the subquery poisons the entire NOT IN check. NOT EXISTS does not have this problem because it never compares values directly to NULL.

exists_vs_in.sqlSQL

-- IN: collects all values from subquery, then checks membership
SELECT name FROM customers
WHERE id IN (
    SELECT customer_id FROM orders
);

-- EXISTS: short-circuits on first match — stops looking immediately
SELECT name FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders WHERE customer_id = c.id
);

-- The SELECT 1 is a convention — any column works here,
-- EXISTS only cares whether a row was found, not what it contains

-- DANGEROUS: NOT IN with potential NULLs
SELECT name FROM customers
WHERE id NOT IN (
    SELECT customer_id FROM orders  -- if ANY row has NULL customer_id,
);                                  -- this returns ZERO rows silently!

-- SAFE: NOT EXISTS handles NULLs correctly
SELECT name FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);

-- ALSO SAFE: explicit NULL guard on NOT IN
SELECT name FROM customers
WHERE id NOT IN (
    SELECT customer_id FROM orders
    WHERE customer_id IS NOT NULL  -- explicitly exclude NULLs
);

Output

-- EXISTS result: customers who have at least one order

name

Alice Tanaka

Ben Okafor

-- NOT IN result (if any order has NULL customer_id): empty set

-- NOT EXISTS result: correctly returns customers with no orders

name

Clara Svensson

NOT IN + NULL = Silent Data Loss

This is one of the most dangerous SQL bugs because it produces no error — just silently wrong results. If the orders table allows NULL in customer_id (common in systems with orphaned records), NOT IN returns zero rows every time. Default to NOT EXISTS for negative existence checks. Only use NOT IN when you can guarantee the subquery column is NOT NULL.

Production Insight

NOT IN silent empty results from NULLs have caused incorrect reporting in customer churn analysis, financial reconciliation, and compliance audits — all with zero error messages.

Default to NOT EXISTS for any negative existence check. Add a comment explaining why if the code reviewer might question it.

For EXISTS vs IN performance: EXISTS wins when most outer rows have matches (short-circuits early); IN wins when the subquery returns very few rows and the outer table is large.

Key Takeaway

EXISTS short-circuits on first match — faster when most rows match.

NOT IN is broken when the subquery contains any NULL — use NOT EXISTS as the safe default.

For correlated EXISTS on large tables, always index the correlation column in the inner table.

Subqueries in the FROM Clause — Deriving Battle Tables

You understand subqueries in WHERE. Now weaponize them in FROM. A subquery in the FROM clause creates a derived table — a temporary result set you can treat like a real table. Why bother? Because it lets you pre-aggregate, filter, or reshape data before your outer query touches it, avoiding repeated scans and saving your production DB from unnecessary thrashing. The derived table must have an alias. Every database needs a name to reference it. Execution order: the inner query runs first, materializing the derived table (in memory or tempdb), then the outer query queries that result. Be warned: if your derived table returns millions of rows, you just created a bottleneck. Always check execution plans. Derive only what you need, when you need it.

top_customers.sqlSQL

-- io.thecodeforge
-- Find customers whose total orders exceed 10k
SELECT dt.customer_id, dt.total_spent
FROM (
    SELECT o.customer_id, SUM(o.amount) AS total_spent
    FROM orders o
    WHERE o.status = 'completed'
    GROUP BY o.customer_id
    HAVING SUM(o.amount) > 10000
) AS dt
ORDER BY dt.total_spent DESC;

Output

customer_id | total_spent

1234 | 28500

5678 | 22000

9012 | 15250

Production Trap:

Derived tables don't have indexes. If you filter the outer query on a column from the derived table, you force a full scan. Push filters inside the subquery to reduce rows early.

Key Takeaway

A derived table lets you pre-process data before the outer query runs, but keep it lean — anything large belongs in a temp table with indexes.

Subqueries with UPDATE and DELETE — Precision Surgery

Subqueries aren't just for SELECT. You can embed them in UPDATE and DELETE to target rows dynamically without manual joins or multiple statements. This is how you fix bad data without a full table scan in a transaction. For UPDATE, the subquery provides the new value or determines which rows to change. For DELETE, it identifies the rows to remove based on complex logic. Both patterns let you avoid race conditions where your filter changes between a SELECT and a subsequent write. The inner query executes first, giving you a snapshot of what to change. Critical: If your subquery returns NULL inadvertently, your UPDATE might set a column to NULL or your DELETE might remove nothing silently. Always test with a SELECT first, wrap in a transaction, and commit only when you see the expected row count.

fix_inventory.sqlSQL

-- io.thecodeforge
-- Update product prices to match supplier's latest price
UPDATE products p
SET p.price = (
    SELECT s.unit_price
    FROM supplier_prices s
    WHERE s.sku = p.sku
    AND s.effective_date = CURRENT_DATE
)
WHERE EXISTS (
    SELECT 1
    FROM supplier_prices s
    WHERE s.sku = p.sku
    AND s.effective_date = CURRENT_DATE
);

Output

5 rows affected. Prices updated for SKUs: ABC-123, DEF-456, GHI-789, JKL-012, MNO-345

The Atomic Pattern:

Use EXISTS in your WHERE clause to guard against missing key values. Without it, you'd set prices to NULL for any product without a current supplier price.

Key Takeaway

Subqueries in DML give you surgical precision — but always validate with a SELECT and wrap in a transaction to prevent data loss.

● Production incidentPOST-MORTEMseverity: high

A Correlated Subquery Turned a 2-Second Report into a 40-Minute Timeout

Symptom

A manager dashboard that loaded in 2 seconds in staging started returning 504 Gateway Timeout in production after a data migration that expanded the customer base from 5,000 to 200,000 records.

Assumption

The query had always worked fine. The developer had tested it on a sample dataset and assumed it would scale linearly.

Root cause

The WHERE clause used a correlated subquery: WHERE salary > (SELECT AVG(salary) FROM employees WHERE department = e.department). This re-executed the AVG() computation for every row in the 200,000-row outer table. 200,000 × AVG() = 200,000 full group scans. The query had O(n²) complexity that only became visible at production scale.

Fix

Replaced the correlated subquery with a JOIN to a derived table: WITH dept_avg AS (SELECT department, AVG(salary) AS avg FROM employees GROUP BY department) SELECT * FROM employees e JOIN dept_avg d ON e.department = d.department WHERE e.salary > d.avg. The rewrite computed the average once per department (not once per row) and brought runtime from 40 minutes to 1.4 seconds.

Key lesson

Correlated subqueries have O(n×m) complexity — always verify execution plans on production-scale data
A query that runs in 2 seconds on 5,000 rows can timeout at 200,000 rows with no other changes
Replace correlated subqueries with a pre-aggregated JOIN or CTE whenever the outer table is large

Production debug guideDiagnosing slow queries caused by subquery patterns4 entries

Symptom · 01

Query is slow and execution plan shows the inner query cost appearing many times

→

Fix

You have a correlated subquery. Run EXPLAIN and look for the inner query node with loops = N where N equals the outer row count. Rewrite as a JOIN to a derived table or CTE that computes the inner result once.

Symptom · 02

NOT IN returns zero rows unexpectedly when you know matching rows exist

→

Fix

The subquery contains at least one NULL value. NOT IN with a NULL in the subquery always returns an empty set due to three-valued logic. Switch to NOT EXISTS which handles NULLs correctly, or add WHERE subquery_col IS NOT NULL.

Symptom · 03

EXISTS query is slower than expected despite short-circuiting

→

Fix

Verify the correlated column in the EXISTS subquery is indexed. EXISTS still performs a lookup per outer row — if the inner table has no index on the correlation column, each lookup is a full scan.

Symptom · 04

Subquery in SELECT clause (scalar subquery) runs slow at scale

→

Fix

A scalar subquery in SELECT re-executes once per output row — it is a correlated subquery. Move it to a LEFT JOIN: SELECT c.name, o.order_count FROM customers c LEFT JOIN (SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id) o ON c.id = o.customer_id.

Pattern	Executes	NULL safe?	Short-circuits?	Best for
Non-correlated subquery	Once	Yes	No	Computing a single value or filtering against a static list
Correlated subquery	Once per outer row	Yes	No	Row-by-row comparisons — use JOIN for large tables
EXISTS	Once per outer row, stops at first match	Yes	Yes	Checking existence when most rows match
IN	Once (subquery), then membership check	Yes for IN, NO for NOT IN	No	Filtering against a small, known list
NOT IN	Once (subquery)	NO — NULL poisons result	No	Avoid — use NOT EXISTS instead
NOT EXISTS	Once per outer row, stops at first match	Yes	Yes	Safe negative existence check

Key takeaways

Correlated subqueries run once per outer row

replace with JOINs on pre-aggregated CTEs for large tables.

EXISTS short-circuits on first match; IN collects all values first

choose based on expected match rate.

NOT IN is broken when the subquery contains any NULL

always use NOT EXISTS for negative existence checks.

Subqueries in FROM (derived tables) execute once and are safe for large result sets

they are effectively pre-aggregated CTEs without the WITH syntax.

Common mistakes to avoid

3 patterns

Using a correlated subquery on a large outer table without checking the execution plan

Symptom

Query runs fine in development with sample data but times out in production — execution time grows quadratically with row count

Fix

Run EXPLAIN ANALYZE and look for loops=N where N equals the outer table row count. Replace the correlated subquery with a JOIN to a pre-aggregated CTE or derived table that computes the inner result once.

Using NOT IN when the subquery column can contain NULL values

Symptom

NOT IN returns zero rows silently — no error, just empty results that look plausible until you cross-reference the data manually

Fix

Replace NOT IN with NOT EXISTS, which handles NULLs correctly. If you must use NOT IN, add WHERE subquery_column IS NOT NULL to explicitly exclude NULLs from the subquery result.

Using a scalar subquery in SELECT to compute a per-row derived value

Symptom

Query is slow at scale — EXPLAIN shows the inner subquery executing once per output row

Fix

Replace with a LEFT JOIN to a pre-aggregated derived table. SELECT c.name, o.cnt FROM customers c LEFT JOIN (SELECT customer_id, COUNT(*) AS cnt FROM orders GROUP BY customer_id) o ON c.id = o.customer_id.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is a correlated subquery and why can it be slow?

Q02SENIOR

What is the difference between IN and EXISTS — when would you prefer one...

Q03SENIOR

Why is NOT IN dangerous when the subquery can contain NULL, and how do y...

Q01 of 03SENIOR

What is a correlated subquery and why can it be slow?

ANSWER

A correlated subquery references a column from the outer query, which means it cannot execute independently — it must re-run for each row the outer query processes. If the outer query returns 100,000 rows, the inner query executes 100,000 times. This gives the query O(n×m) complexity where n is the outer row count and m is the cost of the inner query. It looks correct and works fine on small datasets, but becomes catastrophically slow at production scale. The fix is almost always to replace the correlated subquery with a JOIN to a pre-aggregated CTE or derived table that computes the inner result once.

FAQ · 2 QUESTIONS

Frequently Asked Questions

When should I use a CTE instead of a subquery?

Can a subquery return multiple columns?

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Everything here is grounded in real deployments.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's SQL Basics. Mark it forged?

4 min read · try the examples if you haven't