Aggregate functions collapse multiple rows into a single summary value — COUNT, SUM, AVG, MIN, MAX
COUNT(*) counts every row including NULLs; COUNT(column) skips NULLs silently
AVG also skips NULLs — AVG of (10, 20, NULL) returns 15, not 10
GROUP BY groups rows by a column before aggregating — required when mixing aggregate and non-aggregate columns in SELECT
HAVING filters groups after aggregation; WHERE filters rows before aggregation
Biggest mistake: selecting a non-aggregated column without including it in GROUP BY — most databases throw an error
Plain-English First
Imagine you have a spreadsheet with 10,000 sales records. Aggregate functions are the 'Totals' row at the bottom — COUNT tells you how many rows you have, SUM adds them all up, AVG gives you the average, and MIN/MAX find the smallest and largest values. GROUP BY is what lets you get a separate total row for each category, region, or customer rather than just one grand total for everything.
Every real application — whether it's a shopping site showing 'Average rating: 4.7 stars', a bank dashboard showing 'Total spent this month: $1,240', or an HR tool reporting 'Headcount: 342 employees' — uses aggregate functions behind the scenes. They're the bridge between raw row-level data and the summary numbers that drive business decisions.
The rules are simple until NULL gets involved. COUNT(*) and COUNT(column_name) behave completely differently. AVG silently ignores NULL values. GROUP BY requires every non-aggregated column in your SELECT. These aren't obscure edge cases — they're the cause of a surprisingly large percentage of wrong report numbers in production.
What Aggregate Functions Actually Do (And Why They Exist)
A regular SQL query like SELECT name FROM employees returns one row per employee — every single row that matches your WHERE clause. An aggregate function changes that contract entirely. Instead of 'give me all the rows', it says 'give me one summarised value across all the rows.' This is what makes them aggregate — they combine many rows into one.
Five aggregate functions cover the vast majority of real-world analytics: COUNT (how many rows), SUM (total value), AVG (mean value), MIN (smallest value), MAX (largest value). All five share a critical characteristic: they handle NULL values by ignoring them — except COUNT(*), which is the one exception.
aggregate_basics.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Setup: a simple orders tableCREATETABLEorders (
order_id INTPRIMARYKEY,
customer_id INT,
total DECIMAL(8,2),
status VARCHAR(20)
);
INSERTINTO orders VALUES
(1, 101, 49.99, 'completed'),
(2, 102, 120.00, 'completed'),
(3, 101, NULL, 'pending'), -- NULL total: not yet charged
(4, 103, 75.50, 'completed'),
(5, 102, 30.00, 'refunded');
-- Five aggregates on the same columnSELECTCOUNT(*) AS total_rows, -- 5: counts every rowCOUNT(total) AS rows_with_total, -- 4: skips the NULL rowSUM(total) AS revenue, -- 275.49: skips NULLAVG(total) AS avg_order, -- 68.87: 275.49/4, not /5!MIN(total) AS smallest,
MAX(total) AS largest
FROM orders;
AVG Skips NULLs — This Is Almost Never What Stakeholders Expect
AVG(total) on these 5 rows divides by 4, not 5, because NULL is excluded. If the pending order should count as zero revenue, use AVG(COALESCE(total, 0)). If it should genuinely be excluded from the average, the implicit NULL skip is correct — but document it explicitly so the next engineer reading the query understands the intent.
Production Insight
COUNT(*) vs COUNT(column) is the source of countless wrong dashboard numbers in production.
Always run both side by side when debugging a count discrepancy — the difference reveals your NULL row count.
For revenue calculations, always COALESCE NULL numeric columns explicitly rather than relying on implicit skip behavior.
Key Takeaway
Aggregate functions collapse many rows into one summary — that's their entire purpose.
All five functions skip NULLs, except COUNT(*) which counts every row regardless.
Document NULL treatment explicitly in any aggregate query that will be read by business stakeholders.
COUNT, SUM and AVG — The Most-Used Trio
COUNT comes in two forms with very different behavior. COUNT(*) counts every row in the result set, including rows with NULL values in every column. COUNT(column_name) counts only rows where that specific column is not NULL. This distinction bites constantly in production.
SUM adds all non-NULL values. NULL rows contribute nothing. SUM of a column with all NULLs returns NULL, not zero — this surprises developers who expect 0. Use COALESCE(SUM(column), 0) when zero is the correct return for an all-NULL column.
AVG computes the mean of non-NULL values only. It's mathematically equivalent to SUM(column) / COUNT(column) — both numerator and denominator exclude NULLs.
count_sum_avg_queries.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Demonstrating the COUNT(*) vs COUNT(column) differenceSELECTCOUNT(*) AS all_orders,
COUNT(total) AS charged_orders,
COUNT(*) - COUNT(total) AS pending_not_charged,
SUM(total) AS total_revenue,
COALESCE(SUM(total), 0) AS revenue_safe, -- 0 if all NULLsAVG(total) AS avg_charged, -- excludes NULLsAVG(COALESCE(total, 0)) AS avg_all -- includes NULLs as zeroFROM orders
WHERE status != 'refunded';
-- COUNT DISTINCT: unique customer countSELECTCOUNT(DISTINCT customer_id) AS unique_customers,
COUNT(customer_id) AS total_orders
FROM orders
WHERE status = 'completed';
COUNT(DISTINCT customer_id) gives you unique customers; COUNT(customer_id) gives you total orders. If Alice placed 3 orders, COUNT includes her 3 times but COUNT DISTINCT includes her once. This is the pattern behind 'unique visitors', 'active users', and 'distinct products ordered'.
Production Insight
SUM of an all-NULL column returns NULL, not zero — wrap with COALESCE(SUM(col), 0) for dashboard totals.
COUNT DISTINCT is expensive on large tables — it requires sorting or hashing all values before counting.
For approximate unique counts at scale, consider HyperLogLog (available in PostgreSQL, Redis, BigQuery) which trades 1% accuracy for O(1) memory.
Key Takeaway
COUNT(*) ≠ COUNT(column) — the difference is exactly the number of NULL rows in that column.
SUM returns NULL (not zero) when all input values are NULL — always COALESCE for dashboard totals.
COUNT DISTINCT is the right tool for unique user/product counts but is expensive — index the column.
MIN and MAX — Finding the Extremes in Your Data
MIN and MAX are the simplest aggregate functions to understand, and also the most underestimated. They don't just work on numbers — they work on dates, times, and text too. MIN on a date column gives the earliest date. MAX on a VARCHAR column gives the last value alphabetically. This makes them useful for pattern like 'first order date', 'most recent login', and 'last employee hired'.
min_max_all_five_aggregates.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
-- MIN/MAX on dates: first and most recent order per customerSELECT
customer_id,
COUNT(*) AS order_count,
MIN(created_at) AS first_order_date,
MAX(created_at) AS most_recent_order,
SUM(total) AS lifetime_value,
MAX(total) AS largest_single_order
FROM orders
WHERE status = 'completed'GROUPBY customer_id
ORDERBY lifetime_value DESC;
Like all aggregate functions, MIN and MAX skip NULL values. If a column has 4 values and one NULL, MIN and MAX operate on the 4 non-NULL values. This is usually correct behavior for date columns (NULLs typically mean 'not set yet') but worth being explicit about in documentation.
Production Insight
MIN(created_at) per customer_id is the standard pattern for 'acquisition date' in cohort analysis.
MAX(updated_at) identifies which rows were most recently modified — useful for incremental ETL pipelines.
For the single row with the minimum or maximum value (not just the value itself), use ORDER BY + LIMIT 1 or a window function — MIN/MAX alone can't return the full row.
Key Takeaway
MIN and MAX work on dates and text, not just numbers — first_order_date and last_login are MIN/MAX patterns.
Neither function returns the full row — use ORDER BY + LIMIT 1 or ROW_NUMBER() when you need the whole record.
Both skip NULLs — document explicitly if NULL means something specific in your column.
GROUP BY — Where Aggregate Functions Become Truly Powerful
Everything so far has given us a single row summarising the whole table. GROUP BY changes that: instead of one grand total, it calculates a separate aggregate for each unique value in the grouping column. This is what produces 'revenue by category', 'orders per customer', 'signups per month'.
The rule is strict: every column in SELECT must either be in GROUP BY or wrapped in an aggregate function. You cannot SELECT a column that isn't in GROUP BY and isn't aggregated — it's undefined which row's value should appear for the group.
group_by_with_aggregates.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Revenue and order count by statusSELECT
status,
COUNT(*) AS order_count,
SUM(total) AS total_revenue,
AVG(total) AS avg_order_value,
MIN(total) AS smallest_order,
MAX(total) AS largest_order
FROM orders
GROUPBY status
ORDERBY total_revenue DESC;
-- HAVING: only show statuses with 2 or more ordersSELECT
status,
COUNT(*) AS order_count,
SUM(total) AS revenue
FROM orders
GROUPBY status
HAVINGCOUNT(*) >= 2-- filter AFTER aggregationORDERBY revenue DESC;
Output
status | order_count | total_revenue | avg_order_value
WHERE runs before GROUP BY — it removes individual rows. HAVING runs after GROUP BY — it removes entire groups. You cannot use aggregate functions in WHERE. HAVING COUNT() > 5 is valid; WHERE COUNT() > 5 throws an error. Use both in the same query when you need to: WHERE to pre-filter rows, HAVING to post-filter groups.
Production Insight
GROUP BY without an ORDER BY returns groups in undefined order — always add ORDER BY for consistent report output.
Grouping by date truncated to a period (DATE_TRUNC('month', created_at)) is the standard pattern for monthly trend reports.
For large tables, GROUP BY on an unindexed column forces a hash aggregate or sort — add an index on the GROUP BY column for repeated report queries.
Key Takeaway
GROUP BY produces one aggregate row per unique value — the foundation of every summary report.
Every SELECT column must be in GROUP BY or an aggregate function — this is a hard rule, not a style preference.
WHERE filters rows before grouping; HAVING filters groups after aggregation — they serve different purposes.
● Production incidentPOST-MORTEMseverity: high
Customer Lifetime Value Report Was 23% Higher Than Reality
Symptom
Monthly business review showed customer LTV 23% above the finance team's number. Both teams were querying the same database. No data discrepancy — just different queries.
Assumption
The analytics team used AVG(lifetime_value) and assumed NULL lifetime_value meant the customer had zero value. Finance used SUM(lifetime_value) / COUNT(*) which treated NULLs as zero in the denominator.
Root cause
AVG(lifetime_value) skips NULL rows entirely — it only averages the rows where a value exists. Customers who had never completed a purchase had NULL lifetime_value. The analytics query divided by the number of paying customers only; finance divided by all customers including nulls.
Fix
Changed to AVG(COALESCE(lifetime_value, 0)) to treat non-purchasing customers as zero value. Added a comment explaining the NULL treatment so the query is unambiguous to future readers.
Key lesson
AVG skips NULLs — this is mathematically valid but rarely what business stakeholders expect
Explicitly COALESCE NULL numeric columns before aggregating when zero is the intended value
When two queries on the same data return different aggregates, NULL treatment is the first thing to check
Production debug guideDiagnosing wrong counts, inflated averages, and GROUP BY errors4 entries
Symptom · 01
COUNT returns fewer rows than expected
→
Fix
You are using COUNT(column_name) on a column with NULLs. Switch to COUNT() to count all rows, or run SELECT COUNT(), COUNT(column_name) side by side to see the gap. The difference is your NULL row count.
Symptom · 02
AVG is higher than expected
→
Fix
NULLs are being excluded from the average. Run SELECT COUNT(), COUNT(column_name), SUM(column_name) to verify. If COUNT() != COUNT(column_name), you have NULLs. Use AVG(COALESCE(column_name, 0)) if zero is the correct treatment.
Symptom · 03
Error: column must appear in GROUP BY or aggregate function
→
Fix
You have a non-aggregated column in SELECT that isn't in GROUP BY. Either add it to GROUP BY, wrap it in an aggregate (MAX, MIN), or remove it from SELECT. PostgreSQL and SQL Server are strict about this; MySQL's ONLY_FULL_GROUP_BY mode controls it.
Symptom · 04
HAVING filter isn't working as expected
→
Fix
Check if you meant WHERE. HAVING filters groups after aggregation — it can use aggregate conditions like HAVING COUNT(*) > 5. If you're filtering on a non-aggregated column, use WHERE before GROUP BY.
Function
What It Computes
NULL Behavior
Works On
COUNT(*)
Total row count
Includes NULLs
Any table
COUNT(col)
Non-NULL row count
Skips NULLs
Any column
SUM(col)
Total of all values
Skips NULLs; returns NULL if all NULL
Numeric
AVG(col)
Mean of all values
Skips NULLs — divides by non-NULL count
Numeric
MIN(col)
Smallest value
Skips NULLs
Numeric, date, text
MAX(col)
Largest value
Skips NULLs
Numeric, date, text
Key takeaways
1
COUNT(*) counts every row including NULLs; COUNT(column) skips NULLs
the difference equals your NULL row count.
2
All aggregate functions except COUNT(*) skip NULL values
document NULL treatment explicitly for any business metric.
3
AVG skips NULLs in the denominator
use AVG(COALESCE(col, 0)) when NULL should mean zero.
4
GROUP BY requires every non-aggregated SELECT column to be in the GROUP BY clause
this is a hard rule.
5
WHERE filters rows before GROUP BY; HAVING filters groups after
aggregate functions belong in HAVING, not WHERE.
Common mistakes to avoid
3 patterns
×
Using COUNT(column) when COUNT(*) is intended
Symptom
Row count is lower than expected — the discrepancy equals the number of NULL rows in that column, but no error is thrown
Fix
Use COUNT(*) to count all rows regardless of NULLs. Only use COUNT(column_name) when you explicitly want to count non-NULL entries in that specific column. Run both side by side to expose the difference.
×
Selecting a non-aggregated column without adding it to GROUP BY
Symptom
Error: column must appear in GROUP BY clause or be used in an aggregate function (PostgreSQL) or column is not in SELECT list (MySQL in strict mode)
Fix
Every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate function (MAX, MIN, COUNT, SUM, AVG). Add the column to GROUP BY if you want one row per unique combination, or remove it from SELECT if you don't need it.
×
Using WHERE to filter aggregate results instead of HAVING
Symptom
Error: aggregate functions are not allowed in WHERE — WHERE SUM(amount) > 1000 throws an error immediately
Fix
Use HAVING to filter on aggregate results: HAVING SUM(amount) > 1000. WHERE filters individual rows before grouping; HAVING filters groups after aggregation. Both can appear in the same query.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01JUNIOR
What is the difference between COUNT(*) and COUNT(column_name)?
Q02JUNIOR
What is the difference between WHERE and HAVING? Why can't you use WHERE...
Q03SENIOR
A column contains five values — 10, 20, NULL, 30, NULL. What does AVG re...
Q01 of 03JUNIOR
What is the difference between COUNT(*) and COUNT(column_name)?
ANSWER
COUNT() counts every row in the result set, including rows where every column is NULL. COUNT(column_name) counts only rows where that specific column has a non-NULL value — NULL rows are silently skipped. If a table has 100 rows and 20 have NULL in the email column, COUNT() returns 100 and COUNT(email) returns 80. The difference tells you the NULL count: COUNT() - COUNT(email) = 20. In practice, use COUNT() when you want row counts and COUNT(column) only when you explicitly want to exclude NULLs from the count.
Q02 of 03JUNIOR
What is the difference between WHERE and HAVING? Why can't you use WHERE to filter aggregate results?
ANSWER
WHERE filters individual rows before GROUP BY runs — it operates on raw row data. HAVING filters groups after GROUP BY and aggregation — it operates on aggregate results. SQL executes in logical order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. When WHERE runs, GROUP BY hasn't happened yet, so aggregate values don't exist — this is why WHERE SUM(amount) > 100 throws an error. HAVING runs after GROUP BY, so HAVING SUM(amount) > 100 works correctly. You can use both in the same query: WHERE filters rows before grouping, HAVING filters groups after.
Q03 of 03SENIOR
A column contains five values — 10, 20, NULL, 30, NULL. What does AVG return?
ANSWER
AVG returns 20, not 12. AVG skips NULL values and computes the mean of non-NULL values only: (10 + 20 + 30) / 3 = 20. It does not treat NULL as zero. If zero is the intended value for NULL rows, use AVG(COALESCE(column, 0)) which returns (10 + 20 + 0 + 30 + 0) / 5 = 12. This distinction matters enormously for business metrics — AVG customer lifetime value, average order value, and average response times can all be significantly inflated if NULL rows are silently excluded rather than counted as zero.
01
What is the difference between COUNT(*) and COUNT(column_name)?
JUNIOR
02
What is the difference between WHERE and HAVING? Why can't you use WHERE to filter aggregate results?
JUNIOR
03
A column contains five values — 10, 20, NULL, 30, NULL. What does AVG return?
SENIOR
FAQ · 3 QUESTIONS
Frequently Asked Questions
01
Can I use multiple aggregate functions in the same SELECT?
Yes. SELECT COUNT(*), SUM(total), AVG(total), MIN(total), MAX(total) FROM orders is perfectly valid. All five aggregate functions run over the same set of rows and return their respective summary values in a single result row.
Was this helpful?
02
What does SUM return if all values in the column are NULL?
SUM returns NULL, not zero. This catches developers off guard when building dashboards — an empty period returns NULL rather than 0. Wrap with COALESCE: COALESCE(SUM(column), 0) to return zero when no non-NULL rows exist.
Was this helpful?
03
How do I count only rows that match a condition within an aggregate?
Use COUNT with a CASE expression: COUNT(CASE WHEN status = 'completed' THEN 1 END) counts only completed orders. This is the conditional count pattern — also works with SUM: SUM(CASE WHEN status = 'completed' THEN total ELSE 0 END). Some databases support COUNT(column) FILTER (WHERE condition) syntax as a cleaner alternative.