SQL GROUP BY and HAVING Explained — Grouping, Filtering and Real-World Patterns
Every real application that touches a database eventually needs to answer questions like 'which customers spent the most last month?' or 'which product categories have fewer than ten sales?' These aren't questions about individual rows — they're questions about groups of rows. That distinction is exactly what GROUP BY and HAVING are built for, and understanding them deeply separates developers who can write SQL from developers who can write SQL well.
The problem GROUP BY solves is aggregation: collapsing many rows into summarised results. Without it, you'd have to pull every row into your application layer and crunch the numbers in code — which is slow, memory-hungry, and frankly the database's job. HAVING then solves a second problem that naturally emerges: once you've grouped and aggregated, you need a way to filter those groups. WHERE can't do it because WHERE runs before aggregation happens. HAVING runs after, so it can see the computed totals and counts that WHERE never gets to touch.
By the end of this article you'll understand exactly when to reach for GROUP BY versus a simple WHERE clause, why HAVING exists as a separate keyword rather than just extending WHERE, how to combine them with multiple aggregate functions in a single query, and the specific mistakes that silently return wrong answers instead of errors. You'll also walk away with three interview questions that trip up a surprising number of candidates.
GROUP BY: Collapsing Rows Into Meaningful Summaries
GROUP BY tells the database to treat all rows that share the same value in a column as a single unit, then apply an aggregate function — SUM, COUNT, AVG, MAX, MIN — across that unit. The result set has one row per unique group, not one row per original record.
This is critical to internalise: after a GROUP BY, every column in your SELECT must either be the grouped column itself or wrapped in an aggregate function. If you select a non-grouped, non-aggregated column the database doesn't know which of the many original rows' values to display. Some databases (MySQL with loose mode) will silently pick a random row's value. PostgreSQL and SQL Server will flat-out error. Neither outcome is what you want.
The mental model that helps most: picture the database first sorting all rows by the GROUP BY column, then drawing a horizontal line between each new value, then running your aggregate function on each block between the lines. You only ever see the aggregated result per block — the individual rows are gone from view.
-- Sample table: orders -- Columns: order_id, customer_id, category, amount, order_date -- ── BASIC GROUP BY ────────────────────────────────────────────── -- Goal: find total revenue and number of orders per product category SELECT category, -- the grouping column — one row per unique value COUNT(order_id) AS order_count, -- how many orders fell into this group SUM(amount) AS total_revenue,-- aggregate: sum of all amounts in the group ROUND(AVG(amount), 2) AS avg_order_value -- aggregate: mean order size FROM orders GROUP BY category -- collapse every category into one summary row ORDER BY total_revenue DESC; -- show highest-revenue category first -- ── GROUPING BY MULTIPLE COLUMNS ──────────────────────────────── -- Goal: break revenue down by category AND by the year of the order -- This gives one row per (category, year) combination SELECT category, EXTRACT(YEAR FROM order_date) AS order_year, -- derive year from the date column COUNT(order_id) AS order_count, SUM(amount) AS total_revenue FROM orders GROUP BY category, -- group first by category... EXTRACT(YEAR FROM order_date) -- ...then by year within each category ORDER BY order_year, total_revenue DESC;
category | order_count | total_revenue | avg_order_value
--------------+-------------+---------------+----------------
Electronics | 142 | 84350.00 | 594.01
Furniture | 89 | 61200.00 | 687.64
Clothing | 310 | 31450.00 | 101.45
Groceries | 503 | 18900.00 | 37.57
-- Result of second query (multi-column GROUP BY):
category | order_year | order_count | total_revenue
------------+------------+-------------+--------------
Clothing | 2023 | 158 | 15800.00
Electronics | 2023 | 71 | 41200.00
Furniture | 2023 | 44 | 30100.00
Clothing | 2024 | 152 | 15650.00
Electronics | 2024 | 71 | 43150.00
Furniture | 2024 | 45 | 31100.00
HAVING: Filtering Groups After Aggregation (Not Before)
Here's the question that unlocks HAVING: once you've run GROUP BY and computed SUM(amount) per category, how do you show only categories where that total exceeds $50,000? You can't use WHERE — by the time WHERE runs, the aggregation hasn't happened yet. WHERE is a row-level filter applied to the raw table. HAVING is a group-level filter applied to the aggregated output.
Think of the SQL execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. WHERE filters raw rows before they enter the grouping step. HAVING filters the grouped rows after aggregation is complete. This is why you can reference SUM(amount) in HAVING but not in WHERE.
HAVING is also where you enforce data quality thresholds. 'Show me customers, but only if they've placed at least 5 orders' is a HAVING condition — COUNT(order_id) >= 5. It's common to combine both WHERE and HAVING in one query: WHERE removes irrelevant raw rows early (which improves performance by feeding fewer rows into the grouping step), and HAVING then prunes the resulting groups.
-- ── HAVING: filter groups by their aggregate value ────────────── -- Goal: find product categories that have generated over $50,000 -- and had more than 100 individual orders SELECT category, COUNT(order_id) AS order_count, SUM(amount) AS total_revenue FROM orders GROUP BY category HAVING SUM(amount) > 50000 -- only keep groups where the SUM clears this bar AND COUNT(order_id) > 100 -- AND the group contains more than 100 orders ORDER BY total_revenue DESC; -- ── WHERE + GROUP BY + HAVING together ────────────────────────── -- Goal: same report, but exclude any orders placed before 2023 -- (WHERE trims raw rows BEFORE grouping — faster than filtering in HAVING) SELECT category, COUNT(order_id) AS order_count, SUM(amount) AS total_revenue FROM orders WHERE order_date >= '2023-01-01' -- row-level filter: remove old orders FIRST GROUP BY category HAVING SUM(amount) > 50000 -- group-level filter: applied AFTER aggregation AND COUNT(order_id) > 100 ORDER BY total_revenue DESC; -- ── HAVING without GROUP BY (edge case worth knowing) ──────────── -- HAVING can technically apply to the entire table as one group. -- This returns the row only if the whole orders table has > 1000 rows. -- Rarely useful, but interviewers love asking if it's valid SQL. SELECT COUNT(*) AS total_orders FROM orders HAVING COUNT(*) > 1000;
category | order_count | total_revenue
------------+-------------+--------------
Electronics | 142 | 84350.00
Furniture | 89 | 61200.00
-- (Clothing had 310 orders but only $31,450 revenue — fails the $50k HAVING check)
-- (Groceries fails both conditions)
-- Result of WHERE + GROUP BY + HAVING query (2023 data only):
category | order_count | total_revenue
------------+-------------+--------------
Electronics | 142 | 84350.00
Furniture | 89 | 61200.00
-- Result of HAVING without GROUP BY:
total_orders
------------
1044
-- (Returns the count row because 1044 > 1000 — if < 1000, returns empty set)
Real-World Patterns: How GROUP BY and HAVING Are Actually Used
In production code these clauses rarely appear in isolation. Here are three patterns you'll encounter constantly and should be able to write from memory.
The first is the top-N per group pattern: find the most active customers, highest-revenue regions, or most common error codes. You GROUP BY the entity, aggregate a metric, HAVING optionally filters noise, then ORDER BY the metric with LIMIT to get your top results.
The second is the data quality audit pattern: find anomalies by grouping records that should be unique and counting how many duplicates exist. A COUNT(*) > 1 in HAVING is the classic duplicate-detection query every data engineer writes at some point.
The third is the cohort threshold pattern common in analytics: 'show me all users who took action X at least N times.' This is GROUP BY user_id, HAVING COUNT(*) >= N — the backbone of retention and engagement reports. Once you recognise these shapes, you stop treating GROUP BY and HAVING as separate features and start seeing them as a single analytical toolset.
-- ── PATTERN 1: Top-N customers by total spend ─────────────────── -- Goal: find the 5 highest-spending customers in 2024 SELECT customer_id, COUNT(order_id) AS total_orders, SUM(amount) AS lifetime_spend FROM orders WHERE order_date >= '2024-01-01' -- narrow to 2024 first (WHERE = pre-filter) GROUP BY customer_id HAVING SUM(amount) > 0 -- exclude any customer with $0 net (edge case) ORDER BY lifetime_spend DESC LIMIT 5; -- take only the top 5 after sorting -- ── PATTERN 2: Duplicate detection ────────────────────────────── -- Goal: find any email addresses that appear more than once in the -- customers table — a sign of duplicate registrations SELECT email, COUNT(*) AS registration_count -- count how many rows share this email FROM customers GROUP BY email HAVING COUNT(*) > 1 -- HAVING filters to only the duplicates ORDER BY registration_count DESC; -- worst offenders first -- ── PATTERN 3: Engagement cohort — users who placed 3+ orders ─── -- Goal: identify loyal repeat customers for a marketing campaign SELECT customer_id, COUNT(order_id) AS order_count, MIN(order_date) AS first_order_date, -- when they became a customer MAX(order_date) AS latest_order_date, -- most recent activity SUM(amount) AS total_spend FROM orders GROUP BY customer_id HAVING COUNT(order_id) >= 3 -- only customers with 3 or more orders ORDER BY total_spend DESC;
customer_id | total_orders | lifetime_spend
------------+--------------+---------------
1042 | 17 | 4830.00
891 | 12 | 4215.50
2201 | 9 | 3990.00
334 | 22 | 3740.25
1788 | 8 | 3600.00
-- Pattern 2 result (duplicate emails):
email | registration_count
--------------------------+-------------------
john.doe@example.com | 3
test.user@example.com | 2
-- Pattern 3 result (loyal customers, 3+ orders):
customer_id | order_count | first_order_date | latest_order_date | total_spend
------------+-------------+------------------+-------------------+------------
1042 | 17 | 2022-03-14 | 2024-11-01 | 12840.00
334 | 22 | 2021-07-22 | 2024-10-28 | 9310.50
891 | 12 | 2023-01-05 | 2024-09-15 | 8450.00
Common Mistakes That Silently Break Your Queries
Most GROUP BY and HAVING bugs don't throw errors — they return wrong results that look plausible, which makes them especially dangerous. These are the ones that catch experienced developers off guard, not just beginners.
The first is using WHERE where you need HAVING, or vice versa. It sounds obvious once you know the execution order, but under deadline pressure it's easy to write WHERE SUM(amount) > 50000 and then be confused when the database throws an error about aggregate functions not being allowed in WHERE.
The second is including a column in SELECT that isn't in GROUP BY and isn't aggregated — discussed earlier, but worth re-emphasising because MySQL's lenient default mode will run the query and silently return a non-deterministic value from one of the grouped rows.
The third is a subtler one: filtering on a column alias in HAVING. Because HAVING runs after SELECT in the logical execution order (but before the alias is fully materialised in most engines), referencing an alias defined in SELECT often fails. You need to repeat the expression.
-- ── MISTAKE 1: Using WHERE on an aggregate ─────────────────────── -- WRONG — this throws: "aggregate functions are not allowed in WHERE" SELECT category, SUM(amount) AS total_revenue FROM orders WHERE SUM(amount) > 50000 -- ❌ WHERE runs before aggregation — SUM doesn't exist yet GROUP BY category; -- FIXED — move the aggregate condition to HAVING SELECT category, SUM(amount) AS total_revenue FROM orders GROUP BY category HAVING SUM(amount) > 50000; -- ✅ HAVING runs after aggregation — SUM exists here -- ── MISTAKE 2: Referencing a SELECT alias in HAVING ────────────── -- WRONG — most databases can't resolve the alias 'total_revenue' in HAVING -- (PostgreSQL and SQL Server will error; MySQL may accept it as a non-standard extension) SELECT category, SUM(amount) AS total_revenue FROM orders GROUP BY category HAVING total_revenue > 50000; -- ❌ alias not guaranteed to be visible in HAVING -- FIXED — repeat the full aggregate expression in HAVING SELECT category, SUM(amount) AS total_revenue FROM orders GROUP BY category HAVING SUM(amount) > 50000; -- ✅ reference the expression, not the alias -- ── MISTAKE 3: Forgetting that COUNT(*) includes NULLs ─────────── -- COUNT(*) counts all rows in the group, including rows with NULL values. -- COUNT(column_name) counts only non-NULL values in that column. -- These can give different numbers — and the difference matters. SELECT category, COUNT(*) AS all_rows, -- counts every row, NULLs included COUNT(amount) AS non_null_amounts -- counts only rows where amount IS NOT NULL FROM orders GROUP BY category; -- If your 'amount' column has NULLs, these two columns will differ. -- Use COUNT(column_name) when you specifically want to exclude NULLs from your count.
-- ERROR: aggregate functions are not allowed in WHERE
-- Mistake 1 (fixed): OK
category | total_revenue
------------+--------------
Electronics | 84350.00
Furniture | 61200.00
-- Mistake 2 (wrong version): ERROR in PostgreSQL/SQL Server
-- ERROR: column 'total_revenue' does not exist
-- (MySQL may succeed due to its alias-in-HAVING extension — dangerous portability trap)
-- Mistake 3 output:
category | all_rows | non_null_amounts
------------+----------+-----------------
Electronics | 142 | 138 -- 4 orders had NULL amount
Furniture | 89 | 89 -- no NULLs here
Clothing | 310 | 305 -- 5 NULLs
| Aspect | WHERE | HAVING |
|---|---|---|
| When it runs | Before GROUP BY — filters raw rows | After GROUP BY — filters aggregated groups |
| What it can filter on | Any column value in the raw table | Aggregate results (SUM, COUNT, AVG, etc.) or grouped columns |
| Can use aggregate functions? | No — causes a syntax error | Yes — that's its entire purpose |
| Performance impact | Reduces rows fed into grouping — faster | Runs after aggregation — no early row reduction benefit |
| Use case example | WHERE order_date >= '2024-01-01' | HAVING SUM(amount) > 50000 |
| Works without GROUP BY? | Yes — filters all rows normally | Yes, but treats entire table as one group (rare / edge case) |
| Can they coexist in one query? | Yes — WHERE fires first, then GROUP BY, then HAVING | Yes — the typical production pattern uses both together |
🎯 Key Takeaways
- GROUP BY collapses many rows into one summary row per unique group value — after that point, every SELECT column must be the group key or an aggregate function, with no exceptions.
- SQL execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY — WHERE never sees aggregate results; HAVING never sees raw pre-grouped rows. Knowing this order makes every GROUP BY / HAVING decision obvious.
- HAVING and WHERE can and should coexist: use WHERE to strip irrelevant raw rows before grouping (cheaper), then use HAVING to filter the resulting groups by their aggregate values (necessary for conditions WHERE can't reach).
- The duplicate-detection pattern — SELECT col, COUNT() FROM table GROUP BY col HAVING COUNT() > 1 — is one of the most practically useful SQL patterns in production data work and one of the most commonly asked interview queries.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Putting an aggregate function in WHERE — e.g. WHERE SUM(amount) > 1000 — causes 'aggregate functions are not allowed in WHERE' because WHERE executes before the aggregation step. Fix: move any condition involving SUM, COUNT, AVG, MAX or MIN to the HAVING clause.
- ✕Mistake 2: Selecting a non-grouped, non-aggregated column — e.g. SELECT customer_name, category, SUM(amount) ... GROUP BY category omitting customer_name — PostgreSQL and SQL Server error immediately; MySQL silently picks an arbitrary customer_name from the group, returning non-deterministic results that look valid. Fix: every column in SELECT must appear in GROUP BY or be wrapped in an aggregate function.
- ✕Mistake 3: Using COUNT() when you mean COUNT(specific_column) — COUNT() counts every row including NULLs, so if the target column has NULL values your count is inflated. Fix: use COUNT(column_name) when nulls should be excluded, and document the choice with an inline comment so future maintainers know it was intentional.
Interview Questions on This Topic
- QWhat's the difference between WHERE and HAVING, and can you use them together in the same query? Walk me through the execution order.
- QWrite a query to find all email addresses that appear more than once in a users table — then explain why you can't solve this with WHERE instead of HAVING.
- QIf I write HAVING category = 'Electronics' instead of WHERE category = 'Electronics', do I get the same result? If yes, why would you still prefer WHERE?
Frequently Asked Questions
Can I use HAVING without GROUP BY in SQL?
Yes, technically. Without GROUP BY, the entire table is treated as one single group, and HAVING applies to it as a whole. The query SELECT COUNT() FROM orders HAVING COUNT() > 1000 is valid SQL — it returns the count if the table has over 1000 rows, or an empty result set if not. It's a valid edge case but rarely the right tool — a simple WHERE or a subquery is usually clearer.
Why can't I use a column alias defined in SELECT inside my HAVING clause?
Because of SQL's logical execution order: HAVING is evaluated before the SELECT list is fully resolved into its final aliases. Standard SQL requires you to repeat the full aggregate expression in HAVING (e.g. HAVING SUM(amount) > 1000, not HAVING total_revenue > 1000). MySQL allows the alias as a non-standard extension, but it's not portable to PostgreSQL, SQL Server, or Oracle.
What's the performance difference between filtering with WHERE versus HAVING?
WHERE is faster for non-aggregate conditions because it eliminates rows before the GROUP BY step even begins — meaning the database groups fewer rows, which is less work. HAVING runs after the full grouping and aggregation is complete, so it doesn't reduce the cost of the aggregation itself. As a rule: if the condition doesn't involve an aggregate function, put it in WHERE, not HAVING.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.