Senior 8 min · March 06, 2026

SQL Table Partitioning — Function on Key Disables Pruning

A function on the partition key column in WHERE disables pruning, scanning all 800M rows instead of expected 2M.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Partitioning splits one table into physical segments based on a rule — date, region, category, or hash
  • Partition pruning is the core benefit: the planner skips partitions that cannot contain matching rows
  • Four strategies: range (dates), list (categories), hash (even distribution), composite (range + sub-hash or list)
  • Dropping old data becomes a metadata operation (DROP PARTITION) instead of a row-by-row DELETE
  • Partition key must appear in WHERE clause for pruning — wrapping it in a function kills pruning silently
  • Biggest mistake: partitioning on a column not used in queries — adds overhead with zero pruning benefit
Plain-English First

Imagine a massive filing cabinet with 10 million folders — finding one folder means searching every drawer. Now imagine splitting those folders into 12 labelled drawers, one per month. You instantly know which drawer to open. SQL table partitioning does exactly that: it physically divides one enormous table into smaller, manageable chunks (called partitions) based on a rule you define — like date, region, or category. The table still looks like one table to your application, but the database engine is quietly routing queries to only the relevant chunk.

At some point in every database engineer's career, a query that used to run in 200ms starts taking 45 seconds. The table hasn't changed structurally — it's just grown from 2 million rows to 800 million. Indexes help, but even a perfectly tuned B-tree index on 800 million rows requires significant I/O just to traverse the tree and resolve heap pages. You start questioning your index design, your query shape, your statistics configuration — and most of the time the real answer is that you've hit the ceiling of what single-segment storage can do efficiently.

This is where table partitioning stops being an academic concept and becomes an operational lifeline.

The core mechanism partitioning exploits is called partition pruning: instead of scanning or even index-seeking across the full dataset, the query planner eliminates entire physical segments it knows cannot contain rows matching your WHERE clause. A query filtering on the last 30 days never touches the 7 years of historical data sitting in other partitions — those segments are excluded at plan time, before a single page is read from disk. Beyond query performance, partitioning unlocks fast bulk operations. Dropping a year's worth of old data becomes a near-instant metadata operation (DROP PARTITION) instead of a DELETE that locks the table for hours, generates hundreds of gigabytes of WAL, and leaves you running VACUUM for days.

I've seen teams partition a table and wonder why nothing improved — usually because they partitioned on the wrong column, or because a single function call in the WHERE clause was silently disabling pruning on every query. I've also seen partitioning done right turn a 45-second dashboard query into a 180ms one without touching a single index or rewriting application logic.

By the end of this article you'll understand how each partitioning strategy works at the storage level, write production-ready DDL for range, list, hash, and composite schemes, diagnose when partition pruning is silently failing, manage partition maintenance without downtime, and avoid the three mistakes that turn partitioning from a performance win into a support nightmare.

Partitioning Strategies — Range, List, Hash, and Composite

SQL offers four partitioning strategies, each designed for a different access pattern. Choosing the wrong strategy is the most common partitioning mistake — it adds storage and planning overhead with zero pruning benefit on your actual queries.

Range partitioning splits data by a continuous value — typically a date or timestamp. Each partition covers a defined interval (e.g., January 2026, February 2026). This is the most common strategy for time-series data, event logs, and financial transactions. Queries filtering on date ranges prune efficiently because the planner can compare your WHERE clause bounds against partition boundaries directly at plan time — no row evaluation needed.

List partitioning splits data by discrete values — a region code, status, or category. Each partition contains rows matching a specific set of known values (e.g., region IN ('US', 'CA')). This works well when your queries filter on a fixed enumeration of categories and you want each category group to live in its own physical segment for both pruning and operational isolation.

Hash partitioning distributes rows evenly across N partitions using a hash function applied to the partition key. There is no semantic meaning to which partition a row lands in — the goal is purely even distribution. Hash partitioning is the right choice when you have no natural range or list dimension but need to reduce per-partition size for maintenance operations like index rebuilds and vacuum. It provides no pruning benefit for range queries.

Composite partitioning combines two strategies — typically range at the first level and hash (or list) at the second level. For example: partition by month (range), then sub-partition each month into 4 hash buckets. This gives you time-based pruning at the top level and even distribution within each time window at the second level. It's the right call when pure range partitioning creates hot partitions because recent months concentrate all writes.

partitioning_strategies.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
-- ============================================================
-- RANGE PARTITIONING: split by date — most common for time-series
-- Each partition holds one month of data
-- ============================================================
CREATE TABLE orders (
    order_id     BIGINT NOT NULL,
    customer_id  INT NOT NULL,
    total_amount DECIMAL(10, 2),
    created_at   TIMESTAMP NOT NULL,
    status       VARCHAR(20)
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE orders_2026_01 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE orders_2026_02 PARTITION OF orders
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE orders_2026_03 PARTITION OF orders
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');
-- ... continue for each month

-- Catch-all partition for values outside defined ranges
-- Without this, out-of-range inserts fail with a hard error
CREATE TABLE orders_default PARTITION OF orders DEFAULT;

-- ============================================================
-- LIST PARTITIONING: split by discrete category values
-- Each partition holds a specific region's data
-- ============================================================
CREATE TABLE transactions (
    txn_id      BIGINT NOT NULL,
    amount      DECIMAL(12, 2),
    region      VARCHAR(10) NOT NULL,
    txn_date    DATE
) PARTITION BY LIST (region);

CREATE TABLE txn_north_america PARTITION OF transactions
    FOR VALUES IN ('US', 'CA', 'MX');
CREATE TABLE txn_europe PARTITION OF transactions
    FOR VALUES IN ('UK', 'DE', 'FR', 'ES', 'IT');
CREATE TABLE txn_asia_pacific PARTITION OF transactions
    FOR VALUES IN ('JP', 'KR', 'AU', 'IN', 'SG');
CREATE TABLE txn_default PARTITION OF transactions DEFAULT;

-- ============================================================
-- HASH PARTITIONING: even distribution — no semantic meaning
-- Use when you have no natural range or list dimension
-- ============================================================
CREATE TABLE user_events (
    event_id    BIGINT NOT NULL,
    user_id     INT NOT NULL,
    event_type  VARCHAR(50),
    event_time  TIMESTAMP
) PARTITION BY HASH (user_id);

CREATE TABLE user_events_p0 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE user_events_p1 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE user_events_p2 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE user_events_p3 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- ============================================================
-- COMPOSITE PARTITIONING: range (monthly) + hash sub-partitions
-- Time-based pruning at level 1, even distribution at level 2
-- Solves hot-partition problem when recent months are write-heavy
-- ============================================================
CREATE TABLE audit_log (
    log_id      BIGINT NOT NULL,
    actor_id    INT NOT NULL,
    action      VARCHAR(100),
    log_time    TIMESTAMP NOT NULL
) PARTITION BY RANGE (log_time);

-- Monthly partition, itself partitioned by hash on actor_id
CREATE TABLE audit_log_2026_01 PARTITION OF audit_log
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01')
    PARTITION BY HASH (actor_id);

CREATE TABLE audit_log_2026_01_p0 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE audit_log_2026_01_p1 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE audit_log_2026_01_p2 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE audit_log_2026_01_p3 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Output
-- Range partitioning: queries on created_at prune to one monthly partition
-- List partitioning: queries on region prune to one partition group
-- Hash partitioning: inserts distribute evenly; equality queries on user_id hit one partition
-- Composite: time-range queries prune to one month, then hash distributes within that month
How Partition Pruning Works
  • Range: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01' prunes to exactly one monthly partition
  • List: WHERE region = 'US' prunes to the North America partition group — all other regions are excluded
  • Hash: WHERE user_id = 12345 prunes to exactly one hash bucket — the planner computes hash(12345) % modulus at plan time
  • Pruning happens at plan time — zero I/O for skipped partitions, not just filtered rows
  • Any function applied to the partition key column in WHERE prevents the planner from comparing against boundaries — pruning is disabled entirely
Production Insight
I reviewed a schema where a team had partitioned their events table by HASH on a timestamp column. Their reasoning was that the table was large and needed to be split up. What they got was even distribution across 8 partitions and zero pruning on any of their date-filtered queries — every query scanned all 8 partitions regardless of the time range. Partitioning on the wrong dimension is worse than not partitioning at all because you pay the planner overhead without getting the pruning benefit. The fix was rebuilding with RANGE on the timestamp. Match the strategy to the column your queries actually filter on.
Key Takeaway
Range for dates, list for categories, hash for even distribution, composite for both. The strategy must match your dominant WHERE clause filter — otherwise pruning never activates and you have added complexity with no performance return. Run EXPLAIN on your top 20 queries before committing to any partitioning scheme and verify that pruning fires on the majority of them.
Partitioning Strategy Selection
IfQueries filter primarily by date or timestamp range
UseUse RANGE partitioning — the planner prunes to the exact matching time window using boundary comparison
IfQueries filter by a fixed set of discrete values (region, status, category)
UseUse LIST partitioning — one partition per value group, direct equality pruning
IfNo natural range or list dimension, but table is too large for efficient vacuum and index maintenance
UseUse HASH partitioning — even distribution reduces per-partition size, but accept that range queries scan all partitions
IfNeed time-based pruning AND even distribution within each period to prevent hot partitions
UseUse COMPOSITE (range + hash) — pruning at level 1 on time, distribution at level 2 via hash
IfTable has fewer than 10M rows
UseDo not partition — the planning overhead and operational complexity outweigh any benefit at this scale

Partition Pruning — The Silent Performance Killer When It Fails

Partition pruning is the entire reason partitioning exists. When it works, queries touch only the relevant physical segment — 1 month out of 48, 1 region out of 10. When it fails silently, every query scans all partitions, and partitioning actually degrades performance compared to the unpartitioned table because of the per-partition planning overhead.

The most common pruning killers, in order of how often I see them in the wild:

  1. Function calls on the partition key: WHERE EXTRACT(month FROM created_at) = 1 or WHERE DATE_TRUNC('month', created_at) = '2026-01-01'. The planner cannot compare the function's output against partition boundaries without evaluating the function for every possible input value — which it cannot do at plan time. So it gives up and scans everything.
  2. Implicit type casting: WHERE created_at = '2026-01-15' where created_at is a TIMESTAMP and the literal is a VARCHAR. Some databases resolve this gracefully; others disable pruning rather than risk a type mismatch. Always cast explicitly: WHERE created_at = TIMESTAMP '2026-01-15 00:00:00'.
  3. OR conditions that include a non-partition-key column: WHERE created_at >= '2026-01-01' OR updated_at >= '2026-01-01' when the table is partitioned by created_at. The OR with updated_at forces the planner to consider every partition because it cannot prune based on a column that isn't the partition key.
  4. Predicates buried inside subqueries or CTEs: Some planners cannot push WHERE clause predicates down through a CTE boundary, which means the partition scan at the inner level sees no filter and scans everything. Materializing CTEs in older PostgreSQL versions (pre-12) is a common cause of this.

The diagnostic command is always the same: EXPLAIN (ANALYZE, BUFFERS). Look for 'Partitions scanned: N' in the output. If N equals your total partition count, pruning is not working and you need to find why before the query touches production at scale.

pruning_diagnosis.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
-- ============================================================
-- DIAGNOSIS: verify partition pruning is active
-- Run EXPLAIN before and after any WHERE clause change
-- ============================================================

-- GOOD: bare column with direct range comparison — pruning fires
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at >= '2026-01-01'
  AND created_at < '2026-02-01';
-- Expected output: Partitions scanned: 1
-- Execution time: ~200ms on 48-partition table

-- BAD: function call on partition key — pruning disabled
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE DATE_TRUNC('month', created_at) = '2026-01-01';
-- Expected output: Partitions scanned: 48
-- Execution time: ~45s — 225x regression

-- BAD: EXTRACT on partition key — same problem, different syntax
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE EXTRACT(year FROM created_at) = 2026
  AND EXTRACT(month FROM created_at) = 1;
-- Expected output: Partitions scanned: 48

-- BAD: OR with non-partition-key column — forces full scan
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at >= '2026-01-01'
   OR updated_at >= '2026-01-01';
-- Expected output: Partitions scanned: 48
-- Fix: split into two queries with UNION ALL if the OR is unavoidable

-- ============================================================
-- PARTITION SIZE AUDIT: identify hot or imbalanced partitions
-- Run monthly to catch retention and growth anomalies
-- ============================================================
SELECT
    c.relname AS partition_name,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
    pg_stat_user_tables.n_live_tup AS live_rows
FROM pg_inherits i
JOIN pg_class c ON c.oid = i.inhrelid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_stat_user_tables ON pg_stat_user_tables.relname = c.relname
WHERE p.relname = 'orders'
ORDER BY pg_total_relation_size(c.oid) DESC;
Output
-- Good query: Partitions scanned: 1, execution ~200ms
-- Bad query (DATE_TRUNC): Partitions scanned: 48, execution ~45s
-- Bad query (EXTRACT): Partitions scanned: 48 — same penalty, different syntax
-- The only difference between good and bad is the WHERE clause form
Functions on Partition Key Kill Pruning
DATE_TRUNC, EXTRACT, CAST, COALESCE, and any user-defined function applied to the partition key column in WHERE will disable partition pruning entirely. The planner cannot evaluate the function's output against partition boundaries at plan time — it would need to call the function for every possible input value to know which partitions to skip. It doesn't do that. It scans everything instead. Always use bare column comparisons: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'.
Production Insight
A single DATE_TRUNC call in the WHERE clause of a reporting query caused the planner to scan 48 partitions instead of 1 — 96M rows instead of 2M. Latency went from 200ms to 45 seconds, a 225x regression. The query looked correct syntactically and returned the right results. The only indication something was wrong was the EXPLAIN output showing all partitions scanned. The team had not run EXPLAIN after adding partitioning. That is the rule: always run EXPLAIN immediately after deploying a partitioned schema and after any WHERE clause change on a partitioned table. Do not trust that it works without verifying.
Key Takeaway
Partition pruning is the entire reason partitioning exists — without it, you have planning overhead and zero performance benefit. Functions on the partition key in WHERE are the number one silent pruning killer and the hardest to catch in code review because the query looks correct. Always verify pruning with EXPLAIN. If partitions scanned equals your total partition count, your partitioning is actively making performance worse.
Pruning Diagnosis Decision Tree
IfEXPLAIN shows 'Partitions scanned: 1' or a small subset
UsePruning is working — query is touching only the relevant partition(s)
IfEXPLAIN shows 'Partitions scanned: all' or count equals total partitions
UsePruning has failed — inspect WHERE clause for functions, implicit casts, or OR conditions involving non-partition columns
IfPartition key is wrapped in DATE_TRUNC, EXTRACT, CAST, or COALESCE
UseRewrite to direct range comparison: WHERE col >= X AND col < Y using typed literals
IfPartition key is compared with a string literal instead of a typed value
UseCast explicitly: WHERE created_at >= TIMESTAMP '2026-01-01 00:00:00' to avoid implicit cast ambiguity
IfWHERE clause uses OR with a column that is not the partition key
UseSplit into two separate queries combined with UNION ALL, or add an index on the non-partition column and accept the full scan as unavoidable given the OR logic

Partition Maintenance — Adding, Dropping, and Managing at Scale

Partitioning shifts database maintenance from row-level operations to metadata-level operations. This is the operational payoff that makes partitioning worth the schema complexity — but only if you manage it correctly. Get maintenance wrong and you end up with the worst of both worlds: the overhead of a partitioned schema and the pain of row-level operations on large datasets.

Dropping old data: Without partitioning, deleting 1 billion rows requires a DELETE statement that generates one WAL entry per deleted row, bloats the table with dead tuples, holds row-level locks during execution, and leaves VACUUM with an enormous cleanup job that takes hours and competes with production reads. With partitioning, ALTER TABLE ... DETACH PARTITION followed by DROP TABLE is a catalog metadata operation — it removes the partition's file pointers from the system catalog in milliseconds. No row-level locks on the parent table, no WAL bloat, no vacuum pressure afterward.

Adding new partitions: For range-partitioned tables, partitions must exist before data arrives. If a row's partition key value falls outside all defined partition ranges and no DEFAULT partition exists, the INSERT fails with a hard error and the transaction rolls back. The standard pattern is a scheduled job (cron or an internal scheduler) that creates partitions 3–6 months ahead using CREATE TABLE IF NOT EXISTS so the job is idempotent.

Sliding window pattern: The production standard for time-series retention is a sliding window: in one monthly maintenance job, create the partition for the month 3 months out and drop the partition from 13 months ago. The table stays at a constant size, the job is fast, and there's no manual intervention needed as long as the job runs.

Partition count limits: Each partition adds overhead to the planner's boundary evaluation step. PostgreSQL handles up to ~500 partitions well; beyond that, query planning time becomes measurable and can exceed execution time for fast queries. If you need more coverage than 500 monthly partitions (roughly 40 years), use composite partitioning — range at the top level with hash sub-partitions — instead of pushing the partition count higher.

partition_maintenance.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
-- ============================================================
-- SLIDING WINDOW: automated partition management
-- Create future partition, drop expired one — zero downtime
-- Run monthly via cron or internal scheduler
-- ============================================================

-- Step 1: Create next month's partition before data arrives
CREATE TABLE IF NOT EXISTS orders_2026_04 PARTITION OF orders
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

-- Step 2: Detach the partition from 13 months ago from the parent
-- DETACH removes catalog entry without locking the parent table
ALTER TABLE orders DETACH PARTITION orders_2025_03;

-- Step 3: Drop the physical files — instant once detached
DROP TABLE orders_2025_03;

-- PostgreSQL 14+: use CONCURRENTLY to avoid any parent table lock during detach
-- ALTER TABLE orders DETACH PARTITION orders_2025_03 CONCURRENTLY;

-- ============================================================
-- AUTOMATED PARTITION CREATION: stored procedure
-- Call monthly to stay 6 months ahead of data arrival
-- Schema: io.thecodeforge.maint for operational procedures
-- ============================================================
CREATE OR REPLACE PROCEDURE io.thecodeforge.maint.create_monthly_partitions(
    p_table_name TEXT,
    p_months_ahead INT DEFAULT 6
)
LANGUAGE plpgsql AS $$
DECLARE
    v_start_date DATE;
    v_end_date   DATE;
    v_partition_name TEXT;
    v_current    DATE := DATE_TRUNC('month', CURRENT_DATE);
BEGIN
    FOR i IN 0..p_months_ahead LOOP
        v_start_date     := v_current + (i || ' months')::INTERVAL;
        v_end_date       := v_start_date + INTERVAL '1 month';
        v_partition_name := p_table_name || '_' || TO_CHAR(v_start_date, 'YYYY_MM');

        EXECUTE format(
            'CREATE TABLE IF NOT EXISTS %I PARTITION OF %I
             FOR VALUES FROM (%L) TO (%L)',
            v_partition_name,
            p_table_name,
            v_start_date,
            v_end_date
        );

        RAISE NOTICE 'Ensured partition exists: % (% to %)',
            v_partition_name, v_start_date, v_end_date;
    END LOOP;
END;
$$;

-- Usage: call at the start of each month to stay ahead
CALL io.thecodeforge.maint.create_monthly_partitions('orders', 6);

-- ============================================================
-- PARTITION HEALTH CHECK: detect gaps and size anomalies
-- Run weekly — catches maintenance job failures before inserts fail
-- ============================================================
SELECT
    c.relname                                                    AS partition_name,
    pg_size_pretty(pg_total_relation_size(c.oid))               AS total_size,
    pg_stat_user_tables.n_live_tup                              AS live_rows,
    pg_stat_user_tables.last_autoanalyze                        AS last_analyzed
FROM pg_inherits i
JOIN pg_class c ON c.oid = i.inhrelid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_stat_user_tables ON pg_stat_user_tables.relname = c.relname
WHERE p.relname = 'orders'
ORDER BY c.relname;

-- Monitor DEFAULT partition for unexpected row accumulation
-- Rows landing here mean a future partition is missing
SELECT COUNT(*) AS default_partition_rows
FROM orders_default;
Output
-- DETACH + DROP removes a 13-month-old partition in milliseconds — no WAL, no locks
-- create_monthly_partitions is idempotent — safe to run repeatedly
-- Health check surfaces missing partitions and hot partition size anomalies
-- Default partition monitor catches gaps in future partition coverage before insert failures
Always DETACH Before DROP
Use ALTER TABLE ... DETACH PARTITION before DROP TABLE. DETACH removes the partition from the parent table's catalog entry as a quick metadata update. Dropping the partition directly without detaching first can briefly acquire a lock on the parent table, blocking concurrent queries during that window. For PostgreSQL 14+, DETACH PARTITION CONCURRENTLY avoids any parent table lock entirely — the detach happens in the background while the parent remains fully accessible.
Production Insight
A team was running DELETE FROM orders WHERE created_at < '2025-01-01' to implement their 13-month retention policy. The DELETE ran for 6 hours, generated roughly 200GB of WAL, triggered autovacuum on the entire table, and caused write latency to spike for concurrent inserts throughout the window. They added a maintenance window alert, which made the problem visible but didn't fix it. Switching to DETACH + DROP reduced the retention job from 6 hours to under 30 seconds. The WAL went from 200GB to zero. The maintenance window alert was removed because there was nothing left to protect against.
Key Takeaway
Partitioning turns row-level retention into a metadata operation — DROP instead of DELETE. That's the operational payoff that justifies the schema complexity. Automate partition creation with a job that runs at the start of each month and creates partitions 3–6 months ahead. Monitor the DEFAULT partition — rows accumulating there are an early warning that your creation job has fallen behind.

Composite Partitioning and Cross-Partition Query Patterns

When a single partitioning dimension doesn't fully address your access pattern or your write distribution, composite partitioning (also called sub-partitioning) combines two strategies. The most common production pattern is range at the first level for time-based pruning and hash at the second level for even distribution within each time window.

The problem composite partitioning solves: pure range partitioning on a timestamp creates structurally uneven partitions as traffic grows. Your January 2023 partition has 2M rows. Your January 2026 partition has 40M rows because traffic has grown 20x over three years. Writes concentrate on the current month's partition — that single segment becomes a hotspot for concurrent inserts, autovacuum, and index updates. Adding hash sub-partitions distributes writes across N buckets per time period, keeping per-leaf-partition size roughly constant regardless of when the data was written.

The trade-off is total partition count. Partitioning by month over 4 years gives you 48 partitions. Adding 4 hash sub-partitions per month gives you 192. Adding 8 hash sub-partitions gives you 384. That's still under the 500-partition threshold where planner overhead becomes noticeable. But if you go to 16 sub-partitions over 4 years, you're at 768 — and you'll start seeing planning time increase on fast point-lookup queries.

Cross-partition queries deserve specific attention. When a query cannot prune at either level — because it doesn't filter on the partition key or because the predicate is in a form the planner can't evaluate at plan time — the database must probe every leaf partition. On a composite-partitioned table with 192 partitions, that's 192 separate segment scans. That is materially slower than scanning one unpartitioned table with a good index because of the per-partition coordination overhead. Analytics workloads that routinely scan the entire dataset are not good candidates for row-level partitioning — columnar storage or pre-aggregated materialized views are better tools for that access pattern.

composite_partitioning.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
-- ============================================================
-- COMPOSITE: range (monthly) + hash (4 buckets per month)
-- Prevents hot-partition problem when traffic grows over time
-- Total partitions: months × hash_modulus (e.g., 48 × 4 = 192)
-- ============================================================
CREATE TABLE event_stream (
    event_id    BIGINT NOT NULL,
    user_id     INT NOT NULL,
    event_type  VARCHAR(50),
    event_time  TIMESTAMP NOT NULL,
    payload     JSONB
) PARTITION BY RANGE (event_time);

-- Monthly range partition, itself partitioned by hash on user_id
CREATE TABLE event_stream_2026_01 PARTITION OF event_stream
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01')
    PARTITION BY HASH (user_id);

CREATE TABLE event_stream_2026_01_p0 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE event_stream_2026_01_p1 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE event_stream_2026_01_p2 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE event_stream_2026_01_p3 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- ============================================================
-- PRUNING BEHAVIOR: which queries prune and at which level
-- Understanding this is essential for query design on composite tables
-- ============================================================

-- FULL PRUNING: both range key and hash key in WHERE
-- Planner prunes level 1 (month) then level 2 (hash bucket)
-- Touches exactly 1 leaf partition
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE event_time >= '2026-01-01' AND event_time < '2026-02-01'
  AND user_id = 12345;
-- Partitions scanned: 1 — the hash bucket for user 12345 within January

-- PARTIAL PRUNING: range key only
-- Level 1 prunes to one month; level 2 scans all 4 hash buckets
-- Still useful — touches 4 partitions instead of 192
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE event_time >= '2026-01-01' AND event_time < '2026-02-01';
-- Partitions scanned: 4 (all hash sub-partitions of 2026_01)

-- NO PRUNING: hash key only, no range key
-- Level 1 cannot prune — hash key alone doesn't establish a time boundary
-- Scans every month's hash bucket for the matching remainder
-- Often slower than the unpartitioned equivalent due to per-partition overhead
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE user_id = 12345;
-- Partitions scanned: all leaf partitions across all months

-- ============================================================
-- COMPOSITE PARTITION HEALTH CHECK
-- Verify all sub-partitions exist for each range parent
-- Missing sub-partitions silently drop inserts to the DEFAULT bucket
-- ============================================================
SELECT
    parent.relname  AS range_partition,
    child.relname   AS hash_sub_partition,
    pg_size_pretty(pg_total_relation_size(child.oid)) AS size
FROM pg_inherits i1
JOIN pg_class parent ON parent.oid = i1.inhrelid
JOIN pg_inherits i2  ON i2.inhparent = parent.oid
JOIN pg_class child  ON child.oid = i2.inhrelid
JOIN pg_class root   ON root.oid = i1.inhparent
WHERE root.relname = 'event_stream'
ORDER BY parent.relname, child.relname;
Output
-- Full pruning (both keys): 1 leaf partition scanned — sub-millisecond lookup
-- Partial pruning (range key only): 4 leaf partitions — still fast, month is isolated
-- No pruning (hash key only): all leaf partitions — worse than unpartitioned on small result sets
-- Health check: verifies all sub-partitions are created for each range parent
Composite Pruning Works Top-Down
  • Level 1 (range key): WHERE event_time >= X AND event_time < Y prunes to the matching monthly range partition
  • Level 2 (hash key): WHERE user_id = Z within a pruned range partition prunes to one hash bucket
  • Both keys in WHERE: full pruning — exactly 1 leaf partition scanned regardless of total partition count
  • Only range key in WHERE: partial pruning — all hash sub-partitions of the matching month are scanned
  • Only hash key in WHERE: no level-1 pruning — all months must be evaluated, then hash prunes within each month
Production Insight
A team needed 8 years of daily granularity for compliance reporting, so they created daily range partitions — 2920 partitions. Point-lookup queries that executed in 50ms were taking 800ms to plan because the planner evaluated all 2920 boundaries at plan time. The execution time was fine; the planning time dominated. The fix was restructuring as monthly range partitions with 8 hash sub-partitions each (96 × 8 = 768 total partitions, just under the 500-per-level threshold that starts causing problems). Planning time dropped to under 5ms. The daily granularity they needed for retention was implemented at the application layer, not the partition layer.
Key Takeaway
Composite partitioning solves the hot-partition problem but increases total leaf partition count — watch the 500-partition planning threshold. Pruning works top-down: you only reach one leaf partition when both the range key and the hash key appear as bare columns in WHERE. Design your composite scheme around your dominant query pattern, not around making the partition tree look balanced.
Composite vs Single-Level Partitioning
IfTable is time-partitioned and write throughput has grown significantly — recent partitions are visibly larger and hotter than old ones
UseAdd hash sub-partitions to distribute writes within each time window and prevent autovacuum hotspots
IfQueries almost always filter by date but rarely or never by the hash key column
UseSingle-level range partitioning is sufficient — hash sub-partitions add partition count and operational complexity without meaningful pruning benefit for your query pattern
IfTotal partition count is approaching or exceeding 500
UseConsolidate into larger range windows (monthly instead of weekly, quarterly instead of monthly) or switch to composite to keep each level under the planning overhead threshold
IfAnalytics queries routinely scan the entire dataset regardless of filters
UseRow-level partitioning is not the right tool — consider columnar storage extensions or pre-aggregated materialized views for full-scan analytics workloads
IfNeed both time-based DROP PARTITION for retention and even write distribution across the current window
UseComposite (range + hash) with a small hash modulus (4–8) — range enables clean date-based retention, hash prevents write hotspots on the current partition
● Production incidentPOST-MORTEMseverity: high

Partition pruning silently disabled — query scanned 800M rows instead of 2M

Symptom
Dashboard queries that previously ran in 200ms started taking 45 seconds after the table was migrated to a partitioned schema. EXPLAIN showed sequential scans across all 48 partitions. CPU on the database server spiked to 100% during business hours, and the on-call rotation started getting paged for every reporting window.
Assumption
The team assumed that partitioning would automatically speed up any query that touched the partition key in its WHERE clause, regardless of how the comparison was written. The logic was reasonable on the surface: the column is there, the partition boundaries are defined, the planner should figure it out. It doesn't.
Root cause
The application query used WHERE DATE_TRUNC('month', created_at) = '2026-01-01' instead of WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. DATE_TRUNC is a function call applied to the column — the planner cannot compare the function's output against partition boundaries at plan time because it doesn't know what values the function will return for each row without evaluating it. So it conservatively disables pruning and scans every partition. 48 partitions × roughly 2M rows each means the query was reading 96M rows to return results from 2M. The partitioning itself was correct. The query pattern was the problem.
Fix
Rewrote all reporting queries to use direct range comparisons on the bare partition key column: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. Added a lint rule to the query review checklist that rejects any query wrapping a known partition key column in a function inside a WHERE clause. Also wired EXPLAIN output into the CI pipeline for the top 50 reporting queries, so any future pruning regression surfaces in a pull request before it reaches production.
Key lesson
  • Partition pruning requires the partition key to appear as a bare column in WHERE — no functions, no expressions, no casting through a function
  • Always verify pruning with EXPLAIN immediately after deploying a partitioned schema — do not assume it works
  • Add automated EXPLAIN checks in CI targeting your critical queries; pruning regressions are easy to introduce and hard to notice in production until latency spikes
  • DATE_TRUNC, EXTRACT, CAST, and COALESCE applied to the partition key column in WHERE all disable pruning — this is not a bug, it is a fundamental constraint of how the planner evaluates partition boundaries at plan time
Production debug guideSymptom → Action mapping for common partition-related production issues5 entries
Symptom · 01
Query scans all partitions instead of pruning
Fix
Run EXPLAIN (ANALYZE, BUFFERS) and check for 'Partitions scanned: all' or a count equal to your total partition count. Inspect the WHERE clause for function calls on the partition key — DATE_TRUNC, EXTRACT, CAST, and COALESCE are the usual suspects. Remove the function and rewrite to a direct range comparison using bare column comparisons with >= and < operators. Re-run EXPLAIN to confirm the partition scan count drops to 1 or the expected subset.
Symptom · 02
Partition maintenance job runs for hours and blocks writes
Fix
Check whether you're using DELETE instead of DROP PARTITION for data retention. DELETE is a row-level operation — it generates one WAL entry per row, bloats the table until VACUUM runs, and holds row locks that slow concurrent reads. Use ALTER TABLE ... DETACH PARTITION followed by DROP TABLE instead. DETACH is a catalog metadata operation; DROP removes the physical files. The combined operation takes milliseconds and generates no WAL bloat. Schedule this in a sliding window job that runs monthly.
Symptom · 03
Insert performance degrades as partition count grows
Fix
Count your active partitions: SELECT COUNT(*) FROM pg_partitions WHERE tablename = 'your_table'. PostgreSQL's planner evaluates every partition boundary at plan time — at 500+ partitions, this overhead becomes measurable. If you're above that threshold, consider consolidating to larger range windows (quarterly instead of monthly) or restructuring as composite partitioning (range at level one, hash sub-partitions at level two) to reduce total partition count while maintaining distribution.
Symptom · 04
Cross-partition joins are extremely slow
Fix
Check whether the join key aligns with the partition key. If you're joining two tables and the join column is not the partition key on both sides, the database must probe every partition on the inner table for each outer row. The fix is co-partitioning: partition both tables on the same column with identical boundaries so the planner can perform partition-wise joins. If co-partitioning is not feasible, ensure the join column is indexed on both tables and accept the cross-partition overhead as the cost of the data model.
Symptom · 05
Partition key value falls outside all defined ranges — insert fails
Fix
Check whether a DEFAULT partition exists. In PostgreSQL 11+, a DEFAULT partition absorbs any row whose partition key does not match a defined range or list. Without it, inserts with out-of-range values throw a hard error and the transaction rolls back. Add a DEFAULT partition as a safety net and monitor it — if rows are accumulating there, your partition creation job is behind and you need to investigate.
★ SQL Partition Quick Debug Cheat SheetFast diagnostics for partition pruning and maintenance issues
Partition pruning not working
Immediate action
Check EXPLAIN output for partition scan count
Commands
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01';
SELECT schemaname, tablename, n_live_tup AS row_count FROM pg_stat_user_tables WHERE tablename LIKE 'orders_%' ORDER BY tablename;
Fix now
Remove function calls on partition key in WHERE clause; use bare column with direct range operators (>= and <)
Too many partitions causing slow planning+
Immediate action
Count active partitions and check planner time vs execution time in EXPLAIN ANALYZE output
Commands
SELECT COUNT(*) FROM pg_inherits i JOIN pg_class p ON p.oid = i.inhparent WHERE p.relname = 'orders';
SELECT partitionname, partitionrangestart, partitionrangeend FROM pg_partitions WHERE tablename = 'orders' ORDER BY partitionrangestart;
Fix now
If total partition count exceeds 500, consolidate into larger range windows or restructure as composite partitioning (range + hash sub-partitions) to reduce planner evaluation overhead
Partition maintenance blocking production traffic+
Immediate action
Identify long-running DDL or DELETE operations and check for ungranted locks
Commands
SELECT pid, query, state, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'active' AND query ILIKE '%partition%' ORDER BY duration DESC;
SELECT locktype, relation::regclass, mode, granted FROM pg_locks WHERE NOT granted;
Fix now
Replace DELETE-based retention with ALTER TABLE ... DETACH PARTITION followed by DROP TABLE; schedule DDL operations during low-traffic maintenance windows; use DETACH PARTITION CONCURRENTLY on PostgreSQL 14+ to avoid parent table locks entirely
Partitioning Strategies Compared
StrategyBest ForPruning TriggerData DistributionDROP for RetentionMax Practical Partitions
RangeTime-series, event logs, financial transactionsDate/timestamp range in WHERE using direct >= and < comparisonsUneven — recent partitions grow as traffic increases, historical partitions remain staticYes — drop by date range, instant metadata operation~500 (monthly over roughly 40 years)
ListCategorical data — region, status, department, product lineExact value match or IN clause on the discrete partition keyUneven — depends entirely on value distribution in your datasetYes — detach and drop by category value or group50–100 (one partition per logical category group)
HashLarge tables with no natural range or list dimension; workloads needing even distribution for maintenanceEquality on the hash key — the planner computes hash(value) % modulus at plan timeEven — hash function distributes uniformly across all bucketsNo — hash buckets have no semantic meaning; you cannot drop 'old' hash partitions32–128 (use powers of 2 to allow future modulus doubling)
Composite (Range + Hash)Time-series tables where traffic growth creates hot partitions on the current range windowRange key prunes at level 1 (month); hash key prunes at level 2 (bucket within month)Even within each range window — hash distributes writes across sub-partitionsYes — drop the entire range partition (including all its hash sub-partitions) by date~500 total (e.g., 48 months × 4 hash = 192; 48 × 8 = 384)
None (Unpartitioned)Tables under 10M rows; analytics workloads that routinely scan the full datasetN/A — rely on indexes for point lookups and range scansN/AN/A — use time-based DELETE or archive-and-truncate patternsN/A

Key takeaways

1
Partition pruning is the entire reason partitioning works
without it, you have per-partition planning overhead and zero performance benefit. Always verify with EXPLAIN immediately after deploying a partitioned schema and after any WHERE clause change on a partitioned table.
2
The partition key must appear as a bare column in WHERE with direct comparison operators
DATE_TRUNC, EXTRACT, CAST, and COALESCE applied to the partition key column in WHERE silently disable pruning on every query they touch.
3
Range for dates, list for categories, hash for even distribution, composite for both
the strategy must match your dominant query filter. Partitioning on the wrong column is worse than not partitioning at all.
4
DROP PARTITION (DETACH then DROP) is a millisecond metadata operation vs DELETE which runs for hours, generates massive WAL, and triggers autovacuum for days. This is the operational payoff that justifies partitioning's complexity.
5
Keep total leaf partition count under 500
each partition adds to the planner's boundary evaluation work at plan time. Use composite (range + hash) to achieve fine-grained distribution without creating thousands of tiny range partitions.
6
Automate partition management
create future partitions 3–6 months ahead using a monthly scheduled job with idempotent CREATE TABLE IF NOT EXISTS; drop expired partitions in the same job via the sliding window pattern. Monitor the DEFAULT partition — rows accumulating there mean your creation job is behind.

Common mistakes to avoid

5 patterns
×

Partitioning on a column not used in WHERE clauses

Symptom
Partitioning adds per-partition planning overhead but provides zero pruning benefit. Every query scans all partitions. Query latency increases as partition count grows instead of staying constant. Performance is measurably worse than the original unpartitioned table for queries that don't filter on the partition key.
Fix
Before partitioning, run EXPLAIN on your top 20 queries and identify which column appears in WHERE on at least 80% of them. That column is your partition key candidate. Partition on the column your queries actually filter on — typically created_at for time-series workloads or region for categorical data. If no single column dominates your WHERE clauses, partitioning is probably not the right tool for your access pattern.
×

Wrapping the partition key in a function in WHERE

Symptom
Partition pruning silently fails. EXPLAIN shows 'Partitions scanned: all' even though the WHERE clause references the partition key. Query latency increases linearly with partition count instead of staying constant. The query returns correct results, which makes this failure mode particularly hard to notice without inspecting EXPLAIN output.
Fix
Use bare column comparisons with direct range operators: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. Never apply DATE_TRUNC, EXTRACT, CAST, COALESCE, or any other function to the partition key column inside a WHERE clause. Add a lint rule or a pre-commit hook that flags any query wrapping a known partition key column in a function. Wire EXPLAIN output into CI for your critical reporting queries so pruning regressions surface in pull requests.
×

Using DELETE instead of DROP PARTITION for data retention

Symptom
The retention job runs for hours, generates massive WAL output (often hundreds of gigabytes), bloats the table with dead tuples, and triggers aggressive autovacuum that competes with production reads and writes. Concurrent insert latency spikes during the DELETE window because row-level locks are held throughout.
Fix
Replace DELETE-based retention with ALTER TABLE ... DETACH PARTITION followed by DROP TABLE. DETACH removes the partition from the parent table's catalog as a fast metadata update. DROP removes the physical files. The combined operation completes in milliseconds with zero WAL generation and no row-level locks. Automate this with a monthly sliding window job that creates the next period's partition and drops the expired one in the same run.
×

Creating too many fine-grained partitions (thousands of daily or hourly partitions)

Symptom
Query planning time exceeds query execution time. EXPLAIN ANALYZE shows the planner taking 500–800ms to produce a plan for a query that executes in 30–50ms. INSERT performance also degrades as the catalog grows because each insert requires the planner to evaluate partition boundaries to route the row correctly.
Fix
Keep total partition count under 500 leaf partitions. Use monthly or quarterly range windows instead of daily or hourly. If you need finer granularity within a month for distribution purposes, use composite partitioning (range + hash sub-partitions) — 48 monthly partitions with 4 hash sub-partitions each gives you 192 leaf partitions with both time-based pruning and even write distribution.
×

Not creating future partitions before data arrives

Symptom
INSERT fails with 'no partition of relation found for row' at the start of a new month. Application returns 500 errors until a DBA manually creates the missing partition. If the application does not retry on failure, rows written during the gap are permanently lost.
Fix
Automate partition creation with a scheduled job that creates partitions 3–6 months ahead using CREATE TABLE IF NOT EXISTS (making the job idempotent and safe to run repeatedly). Always include a DEFAULT partition as a safety net to absorb unexpected values instead of throwing a hard error. Monitor the DEFAULT partition's row count weekly — rows accumulating there signal that your creation job has fallen behind or your partition key is receiving values outside the expected range.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is partition pruning and how would you verify it's working in a pro...
Q02SENIOR
You have a table with 800 million rows partitioned by month. How would y...
Q03SENIOR
When would you choose hash partitioning over range partitioning, and wha...
Q04SENIOR
Explain the difference between partitioning and sharding. When would you...
Q05SENIOR
A query on a partitioned table is scanning all 48 partitions even though...
Q06SENIOR
How does composite partitioning (range + hash) solve the hot-partition p...
Q01 of 06SENIOR

What is partition pruning and how would you verify it's working in a production query?

ANSWER
Partition pruning is the process by which the query planner eliminates entire physical partitions from the scan plan at plan time, before any rows are read, because it can determine from the WHERE clause that those partitions cannot contain matching rows. The planner compares your WHERE clause bounds against the partition boundary definitions stored in the catalog — if the ranges don't overlap, that partition is skipped entirely with zero I/O. To verify pruning: run EXPLAIN (ANALYZE, BUFFERS) on the query and look for 'Partitions scanned: N' in the output. If N is 1 or a small subset of your total partition count, pruning is working. If N equals your total partition count, pruning has failed. Common failure causes in order of frequency: function calls on the partition key in WHERE (DATE_TRUNC, EXTRACT, CAST, COALESCE), implicit type casting between the partition key column type and a literal value, OR conditions that include a column other than the partition key, and predicates buried inside CTEs that the planner cannot push down through the CTE boundary. The fix is always the same: rewrite the WHERE clause to use the partition key as a bare column with direct range comparison operators.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Does partitioning replace the need for indexes?
02
Can I partition an existing table without downtime?
03
How do I choose between monthly and quarterly partitions?
04
What happens if I insert a row whose partition key value doesn't match any partition?
05
How does partitioning interact with foreign keys and joins?
🔥

That's SQL Advanced. Mark it forged?

8 min read · try the examples if you haven't

Previous
Database Cursors
12 / 16 · SQL Advanced
Next
Materialized Views