Partitioning splits one table into physical segments based on a rule — date, region, category, or hash
Partition pruning is the core benefit: the planner skips partitions that cannot contain matching rows
Four strategies: range (dates), list (categories), hash (even distribution), composite (range + sub-hash or list)
Dropping old data becomes a metadata operation (DROP PARTITION) instead of a row-by-row DELETE
Partition key must appear in WHERE clause for pruning — wrapping it in a function kills pruning silently
Biggest mistake: partitioning on a column not used in queries — adds overhead with zero pruning benefit
Plain-English First
Imagine a massive filing cabinet with 10 million folders — finding one folder means searching every drawer. Now imagine splitting those folders into 12 labelled drawers, one per month. You instantly know which drawer to open. SQL table partitioning does exactly that: it physically divides one enormous table into smaller, manageable chunks (called partitions) based on a rule you define — like date, region, or category. The table still looks like one table to your application, but the database engine is quietly routing queries to only the relevant chunk.
At some point in every database engineer's career, a query that used to run in 200ms starts taking 45 seconds. The table hasn't changed structurally — it's just grown from 2 million rows to 800 million. Indexes help, but even a perfectly tuned B-tree index on 800 million rows requires significant I/O just to traverse the tree and resolve heap pages. You start questioning your index design, your query shape, your statistics configuration — and most of the time the real answer is that you've hit the ceiling of what single-segment storage can do efficiently.
This is where table partitioning stops being an academic concept and becomes an operational lifeline.
The core mechanism partitioning exploits is called partition pruning: instead of scanning or even index-seeking across the full dataset, the query planner eliminates entire physical segments it knows cannot contain rows matching your WHERE clause. A query filtering on the last 30 days never touches the 7 years of historical data sitting in other partitions — those segments are excluded at plan time, before a single page is read from disk. Beyond query performance, partitioning unlocks fast bulk operations. Dropping a year's worth of old data becomes a near-instant metadata operation (DROP PARTITION) instead of a DELETE that locks the table for hours, generates hundreds of gigabytes of WAL, and leaves you running VACUUM for days.
I've seen teams partition a table and wonder why nothing improved — usually because they partitioned on the wrong column, or because a single function call in the WHERE clause was silently disabling pruning on every query. I've also seen partitioning done right turn a 45-second dashboard query into a 180ms one without touching a single index or rewriting application logic.
By the end of this article you'll understand how each partitioning strategy works at the storage level, write production-ready DDL for range, list, hash, and composite schemes, diagnose when partition pruning is silently failing, manage partition maintenance without downtime, and avoid the three mistakes that turn partitioning from a performance win into a support nightmare.
Partitioning Strategies — Range, List, Hash, and Composite
SQL offers four partitioning strategies, each designed for a different access pattern. Choosing the wrong strategy is the most common partitioning mistake — it adds storage and planning overhead with zero pruning benefit on your actual queries.
Range partitioning splits data by a continuous value — typically a date or timestamp. Each partition covers a defined interval (e.g., January 2026, February 2026). This is the most common strategy for time-series data, event logs, and financial transactions. Queries filtering on date ranges prune efficiently because the planner can compare your WHERE clause bounds against partition boundaries directly at plan time — no row evaluation needed.
List partitioning splits data by discrete values — a region code, status, or category. Each partition contains rows matching a specific set of known values (e.g., region IN ('US', 'CA')). This works well when your queries filter on a fixed enumeration of categories and you want each category group to live in its own physical segment for both pruning and operational isolation.
Hash partitioning distributes rows evenly across N partitions using a hash function applied to the partition key. There is no semantic meaning to which partition a row lands in — the goal is purely even distribution. Hash partitioning is the right choice when you have no natural range or list dimension but need to reduce per-partition size for maintenance operations like index rebuilds and vacuum. It provides no pruning benefit for range queries.
Composite partitioning combines two strategies — typically range at the first level and hash (or list) at the second level. For example: partition by month (range), then sub-partition each month into 4 hash buckets. This gives you time-based pruning at the top level and even distribution within each time window at the second level. It's the right call when pure range partitioning creates hot partitions because recent months concentrate all writes.
partitioning_strategies.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
-- ============================================================-- RANGE PARTITIONING: split by date — most common for time-series-- Each partition holds one month of data-- ============================================================CREATETABLEorders (
order_id BIGINTNOTNULL,
customer_id INTNOTNULL,
total_amount DECIMAL(10, 2),
created_at TIMESTAMPNOTNULL,
status VARCHAR(20)
) PARTITIONBYRANGE (created_at);
-- Create monthly partitionsCREATETABLE orders_2026_01 PARTITIONOF orders
FORVALUESFROM ('2026-01-01') TO ('2026-02-01');
CREATETABLE orders_2026_02 PARTITIONOF orders
FORVALUESFROM ('2026-02-01') TO ('2026-03-01');
CREATETABLE orders_2026_03 PARTITIONOF orders
FORVALUESFROM ('2026-03-01') TO ('2026-04-01');
-- ... continue for each month-- Catch-all partition for values outside defined ranges-- Without this, out-of-range inserts fail with a hard errorCREATETABLE orders_default PARTITIONOF orders DEFAULT;
-- ============================================================-- LIST PARTITIONING: split by discrete category values-- Each partition holds a specific region's data-- ============================================================CREATETABLEtransactions (
txn_id BIGINTNOTNULL,
amount DECIMAL(12, 2),
region VARCHAR(10) NOTNULL,
txn_date DATE
) PARTITIONBYLIST (region);
CREATETABLE txn_north_america PARTITIONOF transactions
FORVALUESIN ('US', 'CA', 'MX');
CREATETABLE txn_europe PARTITIONOF transactions
FORVALUESIN ('UK', 'DE', 'FR', 'ES', 'IT');
CREATETABLE txn_asia_pacific PARTITIONOF transactions
FORVALUESIN ('JP', 'KR', 'AU', 'IN', 'SG');
CREATETABLE txn_default PARTITIONOF transactions DEFAULT;
-- ============================================================-- HASH PARTITIONING: even distribution — no semantic meaning-- Use when you have no natural range or list dimension-- ============================================================CREATETABLEuser_events (
event_id BIGINTNOTNULL,
user_id INTNOTNULL,
event_type VARCHAR(50),
event_time TIMESTAMP
) PARTITIONBYHASH (user_id);
CREATETABLE user_events_p0 PARTITIONOF user_events
FORVALUESWITH (MODULUS4, REMAINDER0);
CREATETABLE user_events_p1 PARTITIONOF user_events
FORVALUESWITH (MODULUS4, REMAINDER1);
CREATETABLE user_events_p2 PARTITIONOF user_events
FORVALUESWITH (MODULUS4, REMAINDER2);
CREATETABLE user_events_p3 PARTITIONOF user_events
FORVALUESWITH (MODULUS4, REMAINDER3);
-- ============================================================-- COMPOSITE PARTITIONING: range (monthly) + hash sub-partitions-- Time-based pruning at level 1, even distribution at level 2-- Solves hot-partition problem when recent months are write-heavy-- ============================================================CREATETABLEaudit_log (
log_id BIGINTNOTNULL,
actor_id INTNOTNULL,
action VARCHAR(100),
log_time TIMESTAMPNOTNULL
) PARTITIONBYRANGE (log_time);
-- Monthly partition, itself partitioned by hash on actor_idCREATETABLE audit_log_2026_01 PARTITIONOF audit_log
FORVALUESFROM ('2026-01-01') TO ('2026-02-01')
PARTITIONBYHASH (actor_id);
CREATETABLE audit_log_2026_01_p0 PARTITIONOF audit_log_2026_01
FORVALUESWITH (MODULUS4, REMAINDER0);
CREATETABLE audit_log_2026_01_p1 PARTITIONOF audit_log_2026_01
FORVALUESWITH (MODULUS4, REMAINDER1);
CREATETABLE audit_log_2026_01_p2 PARTITIONOF audit_log_2026_01
FORVALUESWITH (MODULUS4, REMAINDER2);
CREATETABLE audit_log_2026_01_p3 PARTITIONOF audit_log_2026_01
FORVALUESWITH (MODULUS4, REMAINDER3);
Output
-- Range partitioning: queries on created_at prune to one monthly partition
-- List partitioning: queries on region prune to one partition group
-- Hash partitioning: inserts distribute evenly; equality queries on user_id hit one partition
-- Composite: time-range queries prune to one month, then hash distributes within that month
How Partition Pruning Works
Range: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01' prunes to exactly one monthly partition
List: WHERE region = 'US' prunes to the North America partition group — all other regions are excluded
Hash: WHERE user_id = 12345 prunes to exactly one hash bucket — the planner computes hash(12345) % modulus at plan time
Pruning happens at plan time — zero I/O for skipped partitions, not just filtered rows
Any function applied to the partition key column in WHERE prevents the planner from comparing against boundaries — pruning is disabled entirely
Production Insight
I reviewed a schema where a team had partitioned their events table by HASH on a timestamp column. Their reasoning was that the table was large and needed to be split up. What they got was even distribution across 8 partitions and zero pruning on any of their date-filtered queries — every query scanned all 8 partitions regardless of the time range. Partitioning on the wrong dimension is worse than not partitioning at all because you pay the planner overhead without getting the pruning benefit. The fix was rebuilding with RANGE on the timestamp. Match the strategy to the column your queries actually filter on.
Key Takeaway
Range for dates, list for categories, hash for even distribution, composite for both. The strategy must match your dominant WHERE clause filter — otherwise pruning never activates and you have added complexity with no performance return. Run EXPLAIN on your top 20 queries before committing to any partitioning scheme and verify that pruning fires on the majority of them.
Partitioning Strategy Selection
IfQueries filter primarily by date or timestamp range
→
UseUse RANGE partitioning — the planner prunes to the exact matching time window using boundary comparison
IfQueries filter by a fixed set of discrete values (region, status, category)
→
UseUse LIST partitioning — one partition per value group, direct equality pruning
IfNo natural range or list dimension, but table is too large for efficient vacuum and index maintenance
→
UseUse HASH partitioning — even distribution reduces per-partition size, but accept that range queries scan all partitions
IfNeed time-based pruning AND even distribution within each period to prevent hot partitions
→
UseUse COMPOSITE (range + hash) — pruning at level 1 on time, distribution at level 2 via hash
IfTable has fewer than 10M rows
→
UseDo not partition — the planning overhead and operational complexity outweigh any benefit at this scale
Partition Pruning — The Silent Performance Killer When It Fails
Partition pruning is the entire reason partitioning exists. When it works, queries touch only the relevant physical segment — 1 month out of 48, 1 region out of 10. When it fails silently, every query scans all partitions, and partitioning actually degrades performance compared to the unpartitioned table because of the per-partition planning overhead.
The most common pruning killers, in order of how often I see them in the wild:
Function calls on the partition key: WHERE EXTRACT(month FROM created_at) = 1 or WHERE DATE_TRUNC('month', created_at) = '2026-01-01'. The planner cannot compare the function's output against partition boundaries without evaluating the function for every possible input value — which it cannot do at plan time. So it gives up and scans everything.
Implicit type casting: WHERE created_at = '2026-01-15' where created_at is a TIMESTAMP and the literal is a VARCHAR. Some databases resolve this gracefully; others disable pruning rather than risk a type mismatch. Always cast explicitly: WHERE created_at = TIMESTAMP '2026-01-15 00:00:00'.
OR conditions that include a non-partition-key column: WHERE created_at >= '2026-01-01' OR updated_at >= '2026-01-01' when the table is partitioned by created_at. The OR with updated_at forces the planner to consider every partition because it cannot prune based on a column that isn't the partition key.
Predicates buried inside subqueries or CTEs: Some planners cannot push WHERE clause predicates down through a CTE boundary, which means the partition scan at the inner level sees no filter and scans everything. Materializing CTEs in older PostgreSQL versions (pre-12) is a common cause of this.
The diagnostic command is always the same: EXPLAIN (ANALYZE, BUFFERS). Look for 'Partitions scanned: N' in the output. If N equals your total partition count, pruning is not working and you need to find why before the query touches production at scale.
pruning_diagnosis.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
-- ============================================================-- DIAGNOSIS: verify partition pruning is active-- Run EXPLAIN before and after any WHERE clause change-- ============================================================-- GOOD: bare column with direct range comparison — pruning firesEXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at >= '2026-01-01'AND created_at < '2026-02-01';
-- Expected output: Partitions scanned: 1-- Execution time: ~200ms on 48-partition table-- BAD: function call on partition key — pruning disabledEXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE DATE_TRUNC('month', created_at) = '2026-01-01';
-- Expected output: Partitions scanned: 48-- Execution time: ~45s — 225x regression-- BAD: EXTRACT on partition key — same problem, different syntaxEXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHEREEXTRACT(year FROM created_at) = 2026ANDEXTRACT(month FROM created_at) = 1;
-- Expected output: Partitions scanned: 48-- BAD: OR with non-partition-key column — forces full scanEXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at >= '2026-01-01'OR updated_at >= '2026-01-01';
-- Expected output: Partitions scanned: 48-- Fix: split into two queries with UNION ALL if the OR is unavoidable-- ============================================================-- PARTITION SIZE AUDIT: identify hot or imbalanced partitions-- Run monthly to catch retention and growth anomalies-- ============================================================SELECT
c.relname AS partition_name,
pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
pg_stat_user_tables.n_live_tup AS live_rows
FROM pg_inherits i
JOIN pg_class c ON c.oid = i.inhrelid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_stat_user_tables ON pg_stat_user_tables.relname = c.relname
WHERE p.relname = 'orders'ORDERBYpg_total_relation_size(c.oid) DESC;
Output
-- Good query: Partitions scanned: 1, execution ~200ms
-- Bad query (DATE_TRUNC): Partitions scanned: 48, execution ~45s
-- Bad query (EXTRACT): Partitions scanned: 48 — same penalty, different syntax
-- The only difference between good and bad is the WHERE clause form
Functions on Partition Key Kill Pruning
DATE_TRUNC, EXTRACT, CAST, COALESCE, and any user-defined function applied to the partition key column in WHERE will disable partition pruning entirely. The planner cannot evaluate the function's output against partition boundaries at plan time — it would need to call the function for every possible input value to know which partitions to skip. It doesn't do that. It scans everything instead. Always use bare column comparisons: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'.
Production Insight
A single DATE_TRUNC call in the WHERE clause of a reporting query caused the planner to scan 48 partitions instead of 1 — 96M rows instead of 2M. Latency went from 200ms to 45 seconds, a 225x regression. The query looked correct syntactically and returned the right results. The only indication something was wrong was the EXPLAIN output showing all partitions scanned. The team had not run EXPLAIN after adding partitioning. That is the rule: always run EXPLAIN immediately after deploying a partitioned schema and after any WHERE clause change on a partitioned table. Do not trust that it works without verifying.
Key Takeaway
Partition pruning is the entire reason partitioning exists — without it, you have planning overhead and zero performance benefit. Functions on the partition key in WHERE are the number one silent pruning killer and the hardest to catch in code review because the query looks correct. Always verify pruning with EXPLAIN. If partitions scanned equals your total partition count, your partitioning is actively making performance worse.
Pruning Diagnosis Decision Tree
IfEXPLAIN shows 'Partitions scanned: 1' or a small subset
→
UsePruning is working — query is touching only the relevant partition(s)
IfEXPLAIN shows 'Partitions scanned: all' or count equals total partitions
→
UsePruning has failed — inspect WHERE clause for functions, implicit casts, or OR conditions involving non-partition columns
IfPartition key is wrapped in DATE_TRUNC, EXTRACT, CAST, or COALESCE
→
UseRewrite to direct range comparison: WHERE col >= X AND col < Y using typed literals
IfPartition key is compared with a string literal instead of a typed value
→
UseCast explicitly: WHERE created_at >= TIMESTAMP '2026-01-01 00:00:00' to avoid implicit cast ambiguity
IfWHERE clause uses OR with a column that is not the partition key
→
UseSplit into two separate queries combined with UNION ALL, or add an index on the non-partition column and accept the full scan as unavoidable given the OR logic
Partition Maintenance — Adding, Dropping, and Managing at Scale
Partitioning shifts database maintenance from row-level operations to metadata-level operations. This is the operational payoff that makes partitioning worth the schema complexity — but only if you manage it correctly. Get maintenance wrong and you end up with the worst of both worlds: the overhead of a partitioned schema and the pain of row-level operations on large datasets.
Dropping old data: Without partitioning, deleting 1 billion rows requires a DELETE statement that generates one WAL entry per deleted row, bloats the table with dead tuples, holds row-level locks during execution, and leaves VACUUM with an enormous cleanup job that takes hours and competes with production reads. With partitioning, ALTER TABLE ... DETACH PARTITION followed by DROP TABLE is a catalog metadata operation — it removes the partition's file pointers from the system catalog in milliseconds. No row-level locks on the parent table, no WAL bloat, no vacuum pressure afterward.
Adding new partitions: For range-partitioned tables, partitions must exist before data arrives. If a row's partition key value falls outside all defined partition ranges and no DEFAULT partition exists, the INSERT fails with a hard error and the transaction rolls back. The standard pattern is a scheduled job (cron or an internal scheduler) that creates partitions 3–6 months ahead using CREATE TABLE IF NOT EXISTS so the job is idempotent.
Sliding window pattern: The production standard for time-series retention is a sliding window: in one monthly maintenance job, create the partition for the month 3 months out and drop the partition from 13 months ago. The table stays at a constant size, the job is fast, and there's no manual intervention needed as long as the job runs.
Partition count limits: Each partition adds overhead to the planner's boundary evaluation step. PostgreSQL handles up to ~500 partitions well; beyond that, query planning time becomes measurable and can exceed execution time for fast queries. If you need more coverage than 500 monthly partitions (roughly 40 years), use composite partitioning — range at the top level with hash sub-partitions — instead of pushing the partition count higher.
partition_maintenance.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
-- ============================================================-- SLIDING WINDOW: automated partition management-- Create future partition, drop expired one — zero downtime-- Run monthly via cron or internal scheduler-- ============================================================-- Step 1: Create next month's partition before data arrivesCREATETABLEIFNOTEXISTS orders_2026_04 PARTITIONOF orders
FORVALUESFROM ('2026-04-01') TO ('2026-05-01');
-- Step 2: Detach the partition from 13 months ago from the parent-- DETACH removes catalog entry without locking the parent tableALTERTABLE orders DETACHPARTITION orders_2025_03;
-- Step 3: Drop the physical files — instant once detachedDROPTABLE orders_2025_03;
-- PostgreSQL 14+: use CONCURRENTLY to avoid any parent table lock during detach-- ALTER TABLE orders DETACH PARTITION orders_2025_03 CONCURRENTLY;-- ============================================================-- AUTOMATED PARTITION CREATION: stored procedure-- Call monthly to stay 6 months ahead of data arrival-- Schema: io.thecodeforge.maint for operational procedures-- ============================================================CREATEORREPLACEPROCEDURE io.thecodeforge.maint.create_monthly_partitions(
p_table_name TEXT,
p_months_ahead INTDEFAULT6
)
LANGUAGE plpgsql AS $$
DECLARE
v_start_date DATE;
v_end_date DATE;
v_partition_name TEXT;
v_current DATE := DATE_TRUNC('month', CURRENT_DATE);
BEGINFOR i IN0..p_months_ahead LOOP
v_start_date := v_current + (i || ' months')::INTERVAL;
v_end_date := v_start_date + INTERVAL'1 month';
v_partition_name := p_table_name || '_' || TO_CHAR(v_start_date, 'YYYY_MM');
EXECUTEformat(
'CREATETABLEIFNOTEXISTS %I PARTITIONOF %I
FORVALUESFROM (%L) TO (%L)',
v_partition_name,
p_table_name,
v_start_date,
v_end_date
);
RAISENOTICE'Ensured partition exists: % (% to %)',
v_partition_name, v_start_date, v_end_date;
ENDLOOP;
END;
$$;
-- Usage: call at the start of each month to stay aheadCALL io.thecodeforge.maint.create_monthly_partitions('orders', 6);
-- ============================================================-- PARTITION HEALTH CHECK: detect gaps and size anomalies-- Run weekly — catches maintenance job failures before inserts fail-- ============================================================SELECT
c.relname AS partition_name,
pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
pg_stat_user_tables.n_live_tup AS live_rows,
pg_stat_user_tables.last_autoanalyze AS last_analyzed
FROM pg_inherits i
JOIN pg_class c ON c.oid = i.inhrelid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_stat_user_tables ON pg_stat_user_tables.relname = c.relname
WHERE p.relname = 'orders'ORDERBY c.relname;
-- Monitor DEFAULT partition for unexpected row accumulation-- Rows landing here mean a future partition is missingSELECTCOUNT(*) AS default_partition_rows
FROM orders_default;
Output
-- DETACH + DROP removes a 13-month-old partition in milliseconds — no WAL, no locks
-- create_monthly_partitions is idempotent — safe to run repeatedly
-- Health check surfaces missing partitions and hot partition size anomalies
-- Default partition monitor catches gaps in future partition coverage before insert failures
Always DETACH Before DROP
Use ALTER TABLE ... DETACH PARTITION before DROP TABLE. DETACH removes the partition from the parent table's catalog entry as a quick metadata update. Dropping the partition directly without detaching first can briefly acquire a lock on the parent table, blocking concurrent queries during that window. For PostgreSQL 14+, DETACH PARTITION CONCURRENTLY avoids any parent table lock entirely — the detach happens in the background while the parent remains fully accessible.
Production Insight
A team was running DELETE FROM orders WHERE created_at < '2025-01-01' to implement their 13-month retention policy. The DELETE ran for 6 hours, generated roughly 200GB of WAL, triggered autovacuum on the entire table, and caused write latency to spike for concurrent inserts throughout the window. They added a maintenance window alert, which made the problem visible but didn't fix it. Switching to DETACH + DROP reduced the retention job from 6 hours to under 30 seconds. The WAL went from 200GB to zero. The maintenance window alert was removed because there was nothing left to protect against.
Key Takeaway
Partitioning turns row-level retention into a metadata operation — DROP instead of DELETE. That's the operational payoff that justifies the schema complexity. Automate partition creation with a job that runs at the start of each month and creates partitions 3–6 months ahead. Monitor the DEFAULT partition — rows accumulating there are an early warning that your creation job has fallen behind.
Composite Partitioning and Cross-Partition Query Patterns
When a single partitioning dimension doesn't fully address your access pattern or your write distribution, composite partitioning (also called sub-partitioning) combines two strategies. The most common production pattern is range at the first level for time-based pruning and hash at the second level for even distribution within each time window.
The problem composite partitioning solves: pure range partitioning on a timestamp creates structurally uneven partitions as traffic grows. Your January 2023 partition has 2M rows. Your January 2026 partition has 40M rows because traffic has grown 20x over three years. Writes concentrate on the current month's partition — that single segment becomes a hotspot for concurrent inserts, autovacuum, and index updates. Adding hash sub-partitions distributes writes across N buckets per time period, keeping per-leaf-partition size roughly constant regardless of when the data was written.
The trade-off is total partition count. Partitioning by month over 4 years gives you 48 partitions. Adding 4 hash sub-partitions per month gives you 192. Adding 8 hash sub-partitions gives you 384. That's still under the 500-partition threshold where planner overhead becomes noticeable. But if you go to 16 sub-partitions over 4 years, you're at 768 — and you'll start seeing planning time increase on fast point-lookup queries.
Cross-partition queries deserve specific attention. When a query cannot prune at either level — because it doesn't filter on the partition key or because the predicate is in a form the planner can't evaluate at plan time — the database must probe every leaf partition. On a composite-partitioned table with 192 partitions, that's 192 separate segment scans. That is materially slower than scanning one unpartitioned table with a good index because of the per-partition coordination overhead. Analytics workloads that routinely scan the entire dataset are not good candidates for row-level partitioning — columnar storage or pre-aggregated materialized views are better tools for that access pattern.
composite_partitioning.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
-- ============================================================-- COMPOSITE: range (monthly) + hash (4 buckets per month)-- Prevents hot-partition problem when traffic grows over time-- Total partitions: months × hash_modulus (e.g., 48 × 4 = 192)-- ============================================================CREATETABLEevent_stream (
event_id BIGINTNOTNULL,
user_id INTNOTNULL,
event_type VARCHAR(50),
event_time TIMESTAMPNOTNULL,
payload JSONB
) PARTITIONBYRANGE (event_time);
-- Monthly range partition, itself partitioned by hash on user_idCREATETABLE event_stream_2026_01 PARTITIONOF event_stream
FORVALUESFROM ('2026-01-01') TO ('2026-02-01')
PARTITIONBYHASH (user_id);
CREATETABLE event_stream_2026_01_p0 PARTITIONOF event_stream_2026_01
FORVALUESWITH (MODULUS4, REMAINDER0);
CREATETABLE event_stream_2026_01_p1 PARTITIONOF event_stream_2026_01
FORVALUESWITH (MODULUS4, REMAINDER1);
CREATETABLE event_stream_2026_01_p2 PARTITIONOF event_stream_2026_01
FORVALUESWITH (MODULUS4, REMAINDER2);
CREATETABLE event_stream_2026_01_p3 PARTITIONOF event_stream_2026_01
FORVALUESWITH (MODULUS4, REMAINDER3);
-- ============================================================-- PRUNING BEHAVIOR: which queries prune and at which level-- Understanding this is essential for query design on composite tables-- ============================================================-- FULL PRUNING: both range key and hash key in WHERE-- Planner prunes level 1 (month) then level 2 (hash bucket)-- Touches exactly 1 leaf partitionEXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE event_time >= '2026-01-01'AND event_time < '2026-02-01'AND user_id = 12345;
-- Partitions scanned: 1 — the hash bucket for user 12345 within January-- PARTIAL PRUNING: range key only-- Level 1 prunes to one month; level 2 scans all 4 hash buckets-- Still useful — touches 4 partitions instead of 192EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE event_time >= '2026-01-01'AND event_time < '2026-02-01';
-- Partitions scanned: 4 (all hash sub-partitions of 2026_01)-- NO PRUNING: hash key only, no range key-- Level 1 cannot prune — hash key alone doesn't establish a time boundary-- Scans every month's hash bucket for the matching remainder-- Often slower than the unpartitioned equivalent due to per-partition overheadEXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE user_id = 12345;
-- Partitions scanned: all leaf partitions across all months-- ============================================================-- COMPOSITE PARTITION HEALTH CHECK-- Verify all sub-partitions exist for each range parent-- Missing sub-partitions silently drop inserts to the DEFAULT bucket-- ============================================================SELECT
parent.relname AS range_partition,
child.relname AS hash_sub_partition,
pg_size_pretty(pg_total_relation_size(child.oid)) AS size
FROM pg_inherits i1
JOIN pg_class parent ON parent.oid = i1.inhrelid
JOIN pg_inherits i2 ON i2.inhparent = parent.oid
JOIN pg_class child ON child.oid = i2.inhrelid
JOIN pg_class root ON root.oid = i1.inhparent
WHERE root.relname = 'event_stream'ORDERBY parent.relname, child.relname;
-- Partial pruning (range key only): 4 leaf partitions — still fast, month is isolated
-- No pruning (hash key only): all leaf partitions — worse than unpartitioned on small result sets
-- Health check: verifies all sub-partitions are created for each range parent
Composite Pruning Works Top-Down
Level 1 (range key): WHERE event_time >= X AND event_time < Y prunes to the matching monthly range partition
Level 2 (hash key): WHERE user_id = Z within a pruned range partition prunes to one hash bucket
Both keys in WHERE: full pruning — exactly 1 leaf partition scanned regardless of total partition count
Only range key in WHERE: partial pruning — all hash sub-partitions of the matching month are scanned
Only hash key in WHERE: no level-1 pruning — all months must be evaluated, then hash prunes within each month
Production Insight
A team needed 8 years of daily granularity for compliance reporting, so they created daily range partitions — 2920 partitions. Point-lookup queries that executed in 50ms were taking 800ms to plan because the planner evaluated all 2920 boundaries at plan time. The execution time was fine; the planning time dominated. The fix was restructuring as monthly range partitions with 8 hash sub-partitions each (96 × 8 = 768 total partitions, just under the 500-per-level threshold that starts causing problems). Planning time dropped to under 5ms. The daily granularity they needed for retention was implemented at the application layer, not the partition layer.
Key Takeaway
Composite partitioning solves the hot-partition problem but increases total leaf partition count — watch the 500-partition planning threshold. Pruning works top-down: you only reach one leaf partition when both the range key and the hash key appear as bare columns in WHERE. Design your composite scheme around your dominant query pattern, not around making the partition tree look balanced.
Composite vs Single-Level Partitioning
IfTable is time-partitioned and write throughput has grown significantly — recent partitions are visibly larger and hotter than old ones
→
UseAdd hash sub-partitions to distribute writes within each time window and prevent autovacuum hotspots
IfQueries almost always filter by date but rarely or never by the hash key column
→
UseSingle-level range partitioning is sufficient — hash sub-partitions add partition count and operational complexity without meaningful pruning benefit for your query pattern
IfTotal partition count is approaching or exceeding 500
→
UseConsolidate into larger range windows (monthly instead of weekly, quarterly instead of monthly) or switch to composite to keep each level under the planning overhead threshold
IfAnalytics queries routinely scan the entire dataset regardless of filters
→
UseRow-level partitioning is not the right tool — consider columnar storage extensions or pre-aggregated materialized views for full-scan analytics workloads
IfNeed both time-based DROP PARTITION for retention and even write distribution across the current window
→
UseComposite (range + hash) with a small hash modulus (4–8) — range enables clean date-based retention, hash prevents write hotspots on the current partition
Dashboard queries that previously ran in 200ms started taking 45 seconds after the table was migrated to a partitioned schema. EXPLAIN showed sequential scans across all 48 partitions. CPU on the database server spiked to 100% during business hours, and the on-call rotation started getting paged for every reporting window.
Assumption
The team assumed that partitioning would automatically speed up any query that touched the partition key in its WHERE clause, regardless of how the comparison was written. The logic was reasonable on the surface: the column is there, the partition boundaries are defined, the planner should figure it out. It doesn't.
Root cause
The application query used WHERE DATE_TRUNC('month', created_at) = '2026-01-01' instead of WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. DATE_TRUNC is a function call applied to the column — the planner cannot compare the function's output against partition boundaries at plan time because it doesn't know what values the function will return for each row without evaluating it. So it conservatively disables pruning and scans every partition. 48 partitions × roughly 2M rows each means the query was reading 96M rows to return results from 2M. The partitioning itself was correct. The query pattern was the problem.
Fix
Rewrote all reporting queries to use direct range comparisons on the bare partition key column: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. Added a lint rule to the query review checklist that rejects any query wrapping a known partition key column in a function inside a WHERE clause. Also wired EXPLAIN output into the CI pipeline for the top 50 reporting queries, so any future pruning regression surfaces in a pull request before it reaches production.
Key lesson
Partition pruning requires the partition key to appear as a bare column in WHERE — no functions, no expressions, no casting through a function
Always verify pruning with EXPLAIN immediately after deploying a partitioned schema — do not assume it works
Add automated EXPLAIN checks in CI targeting your critical queries; pruning regressions are easy to introduce and hard to notice in production until latency spikes
DATE_TRUNC, EXTRACT, CAST, and COALESCE applied to the partition key column in WHERE all disable pruning — this is not a bug, it is a fundamental constraint of how the planner evaluates partition boundaries at plan time
Production debug guideSymptom → Action mapping for common partition-related production issues5 entries
Symptom · 01
Query scans all partitions instead of pruning
→
Fix
Run EXPLAIN (ANALYZE, BUFFERS) and check for 'Partitions scanned: all' or a count equal to your total partition count. Inspect the WHERE clause for function calls on the partition key — DATE_TRUNC, EXTRACT, CAST, and COALESCE are the usual suspects. Remove the function and rewrite to a direct range comparison using bare column comparisons with >= and < operators. Re-run EXPLAIN to confirm the partition scan count drops to 1 or the expected subset.
Symptom · 02
Partition maintenance job runs for hours and blocks writes
→
Fix
Check whether you're using DELETE instead of DROP PARTITION for data retention. DELETE is a row-level operation — it generates one WAL entry per row, bloats the table until VACUUM runs, and holds row locks that slow concurrent reads. Use ALTER TABLE ... DETACH PARTITION followed by DROP TABLE instead. DETACH is a catalog metadata operation; DROP removes the physical files. The combined operation takes milliseconds and generates no WAL bloat. Schedule this in a sliding window job that runs monthly.
Symptom · 03
Insert performance degrades as partition count grows
→
Fix
Count your active partitions: SELECT COUNT(*) FROM pg_partitions WHERE tablename = 'your_table'. PostgreSQL's planner evaluates every partition boundary at plan time — at 500+ partitions, this overhead becomes measurable. If you're above that threshold, consider consolidating to larger range windows (quarterly instead of monthly) or restructuring as composite partitioning (range at level one, hash sub-partitions at level two) to reduce total partition count while maintaining distribution.
Symptom · 04
Cross-partition joins are extremely slow
→
Fix
Check whether the join key aligns with the partition key. If you're joining two tables and the join column is not the partition key on both sides, the database must probe every partition on the inner table for each outer row. The fix is co-partitioning: partition both tables on the same column with identical boundaries so the planner can perform partition-wise joins. If co-partitioning is not feasible, ensure the join column is indexed on both tables and accept the cross-partition overhead as the cost of the data model.
Symptom · 05
Partition key value falls outside all defined ranges — insert fails
→
Fix
Check whether a DEFAULT partition exists. In PostgreSQL 11+, a DEFAULT partition absorbs any row whose partition key does not match a defined range or list. Without it, inserts with out-of-range values throw a hard error and the transaction rolls back. Add a DEFAULT partition as a safety net and monitor it — if rows are accumulating there, your partition creation job is behind and you need to investigate.
★ SQL Partition Quick Debug Cheat SheetFast diagnostics for partition pruning and maintenance issues
Partition pruning not working−
Immediate action
Check EXPLAIN output for partition scan count
Commands
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01';
SELECT schemaname, tablename, n_live_tup AS row_count FROM pg_stat_user_tables WHERE tablename LIKE 'orders_%' ORDER BY tablename;
Fix now
Remove function calls on partition key in WHERE clause; use bare column with direct range operators (>= and <)
Too many partitions causing slow planning+
Immediate action
Count active partitions and check planner time vs execution time in EXPLAIN ANALYZE output
Commands
SELECT COUNT(*) FROM pg_inherits i JOIN pg_class p ON p.oid = i.inhparent WHERE p.relname = 'orders';
SELECT partitionname, partitionrangestart, partitionrangeend FROM pg_partitions WHERE tablename = 'orders' ORDER BY partitionrangestart;
Fix now
If total partition count exceeds 500, consolidate into larger range windows or restructure as composite partitioning (range + hash sub-partitions) to reduce planner evaluation overhead
Partition maintenance blocking production traffic+
Immediate action
Identify long-running DDL or DELETE operations and check for ungranted locks
Commands
SELECT pid, query, state, now() - query_start AS duration FROM pg_stat_activity WHERE state = 'active' AND query ILIKE '%partition%' ORDER BY duration DESC;
SELECT locktype, relation::regclass, mode, granted FROM pg_locks WHERE NOT granted;
Fix now
Replace DELETE-based retention with ALTER TABLE ... DETACH PARTITION followed by DROP TABLE; schedule DDL operations during low-traffic maintenance windows; use DETACH PARTITION CONCURRENTLY on PostgreSQL 14+ to avoid parent table locks entirely
Partitioning Strategies Compared
Strategy
Best For
Pruning Trigger
Data Distribution
DROP for Retention
Max Practical Partitions
Range
Time-series, event logs, financial transactions
Date/timestamp range in WHERE using direct >= and < comparisons
Tables under 10M rows; analytics workloads that routinely scan the full dataset
N/A — rely on indexes for point lookups and range scans
N/A
N/A — use time-based DELETE or archive-and-truncate patterns
N/A
Key takeaways
1
Partition pruning is the entire reason partitioning works
without it, you have per-partition planning overhead and zero performance benefit. Always verify with EXPLAIN immediately after deploying a partitioned schema and after any WHERE clause change on a partitioned table.
2
The partition key must appear as a bare column in WHERE with direct comparison operators
DATE_TRUNC, EXTRACT, CAST, and COALESCE applied to the partition key column in WHERE silently disable pruning on every query they touch.
3
Range for dates, list for categories, hash for even distribution, composite for both
the strategy must match your dominant query filter. Partitioning on the wrong column is worse than not partitioning at all.
4
DROP PARTITION (DETACH then DROP) is a millisecond metadata operation vs DELETE which runs for hours, generates massive WAL, and triggers autovacuum for days. This is the operational payoff that justifies partitioning's complexity.
5
Keep total leaf partition count under 500
each partition adds to the planner's boundary evaluation work at plan time. Use composite (range + hash) to achieve fine-grained distribution without creating thousands of tiny range partitions.
6
Automate partition management
create future partitions 3–6 months ahead using a monthly scheduled job with idempotent CREATE TABLE IF NOT EXISTS; drop expired partitions in the same job via the sliding window pattern. Monitor the DEFAULT partition — rows accumulating there mean your creation job is behind.
Common mistakes to avoid
5 patterns
×
Partitioning on a column not used in WHERE clauses
Symptom
Partitioning adds per-partition planning overhead but provides zero pruning benefit. Every query scans all partitions. Query latency increases as partition count grows instead of staying constant. Performance is measurably worse than the original unpartitioned table for queries that don't filter on the partition key.
Fix
Before partitioning, run EXPLAIN on your top 20 queries and identify which column appears in WHERE on at least 80% of them. That column is your partition key candidate. Partition on the column your queries actually filter on — typically created_at for time-series workloads or region for categorical data. If no single column dominates your WHERE clauses, partitioning is probably not the right tool for your access pattern.
×
Wrapping the partition key in a function in WHERE
Symptom
Partition pruning silently fails. EXPLAIN shows 'Partitions scanned: all' even though the WHERE clause references the partition key. Query latency increases linearly with partition count instead of staying constant. The query returns correct results, which makes this failure mode particularly hard to notice without inspecting EXPLAIN output.
Fix
Use bare column comparisons with direct range operators: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. Never apply DATE_TRUNC, EXTRACT, CAST, COALESCE, or any other function to the partition key column inside a WHERE clause. Add a lint rule or a pre-commit hook that flags any query wrapping a known partition key column in a function. Wire EXPLAIN output into CI for your critical reporting queries so pruning regressions surface in pull requests.
×
Using DELETE instead of DROP PARTITION for data retention
Symptom
The retention job runs for hours, generates massive WAL output (often hundreds of gigabytes), bloats the table with dead tuples, and triggers aggressive autovacuum that competes with production reads and writes. Concurrent insert latency spikes during the DELETE window because row-level locks are held throughout.
Fix
Replace DELETE-based retention with ALTER TABLE ... DETACH PARTITION followed by DROP TABLE. DETACH removes the partition from the parent table's catalog as a fast metadata update. DROP removes the physical files. The combined operation completes in milliseconds with zero WAL generation and no row-level locks. Automate this with a monthly sliding window job that creates the next period's partition and drops the expired one in the same run.
×
Creating too many fine-grained partitions (thousands of daily or hourly partitions)
Symptom
Query planning time exceeds query execution time. EXPLAIN ANALYZE shows the planner taking 500–800ms to produce a plan for a query that executes in 30–50ms. INSERT performance also degrades as the catalog grows because each insert requires the planner to evaluate partition boundaries to route the row correctly.
Fix
Keep total partition count under 500 leaf partitions. Use monthly or quarterly range windows instead of daily or hourly. If you need finer granularity within a month for distribution purposes, use composite partitioning (range + hash sub-partitions) — 48 monthly partitions with 4 hash sub-partitions each gives you 192 leaf partitions with both time-based pruning and even write distribution.
×
Not creating future partitions before data arrives
Symptom
INSERT fails with 'no partition of relation found for row' at the start of a new month. Application returns 500 errors until a DBA manually creates the missing partition. If the application does not retry on failure, rows written during the gap are permanently lost.
Fix
Automate partition creation with a scheduled job that creates partitions 3–6 months ahead using CREATE TABLE IF NOT EXISTS (making the job idempotent and safe to run repeatedly). Always include a DEFAULT partition as a safety net to absorb unexpected values instead of throwing a hard error. Monitor the DEFAULT partition's row count weekly — rows accumulating there signal that your creation job has fallen behind or your partition key is receiving values outside the expected range.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
What is partition pruning and how would you verify it's working in a pro...
Q02SENIOR
You have a table with 800 million rows partitioned by month. How would y...
Q03SENIOR
When would you choose hash partitioning over range partitioning, and wha...
Q04SENIOR
Explain the difference between partitioning and sharding. When would you...
Q05SENIOR
A query on a partitioned table is scanning all 48 partitions even though...
Q06SENIOR
How does composite partitioning (range + hash) solve the hot-partition p...
Q01 of 06SENIOR
What is partition pruning and how would you verify it's working in a production query?
ANSWER
Partition pruning is the process by which the query planner eliminates entire physical partitions from the scan plan at plan time, before any rows are read, because it can determine from the WHERE clause that those partitions cannot contain matching rows. The planner compares your WHERE clause bounds against the partition boundary definitions stored in the catalog — if the ranges don't overlap, that partition is skipped entirely with zero I/O.
To verify pruning: run EXPLAIN (ANALYZE, BUFFERS) on the query and look for 'Partitions scanned: N' in the output. If N is 1 or a small subset of your total partition count, pruning is working. If N equals your total partition count, pruning has failed.
Common failure causes in order of frequency: function calls on the partition key in WHERE (DATE_TRUNC, EXTRACT, CAST, COALESCE), implicit type casting between the partition key column type and a literal value, OR conditions that include a column other than the partition key, and predicates buried inside CTEs that the planner cannot push down through the CTE boundary. The fix is always the same: rewrite the WHERE clause to use the partition key as a bare column with direct range comparison operators.
Q02 of 06SENIOR
You have a table with 800 million rows partitioned by month. How would you design a data retention policy that keeps 13 months of data and drops older data without impacting production traffic?
ANSWER
Use a sliding window pattern implemented as a monthly scheduled job with two operations:
First, create the partition for the month three to six periods out using CREATE TABLE IF NOT EXISTS. This ensures future data always has a destination and the job is idempotent — running it twice doesn't cause errors.
Second, drop the partition from 13 months ago. The correct sequence is: ALTER TABLE orders DETACH PARTITION orders_2025_03, then DROP TABLE orders_2025_03. Never drop directly without detaching first. DETACH removes the partition from the parent table's catalog entry as a fast metadata operation. Once detached, DROP TABLE removes the physical files. The combined operation completes in milliseconds with zero WAL generation, zero row-level locks on the parent table, and no autovacuum cleanup needed afterward.
For PostgreSQL 14+, use DETACH PARTITION CONCURRENTLY to avoid acquiring any lock on the parent table during the detach step — the parent remains fully accessible to reads and writes throughout.
Always monitor the DEFAULT partition. If rows are accumulating there, the scheduled job has fallen behind and future partitions are missing. Set up an alert on DEFAULT partition row count so you catch this before it causes insert failures.
Q03 of 06SENIOR
When would you choose hash partitioning over range partitioning, and what are the trade-offs?
ANSWER
Use hash partitioning when there is no natural date or discrete category dimension to partition on, but the table is large enough that you need to reduce per-partition size for operational reasons — specifically for index rebuilds, autovacuum efficiency, and parallel maintenance operations.
Hash partitioning provides perfectly even distribution across N partitions regardless of the key distribution in your data. That's its only benefit. It provides zero pruning benefit for range queries — the planner cannot determine which hash bucket contains rows for a given range without scanning all buckets.
Range partitioning provides strong pruning for time-filtered queries but creates increasingly uneven partitions as traffic grows — your current month's partition becomes much larger than historical months. It also enables the DROP PARTITION retention pattern, which hash partitioning does not support in any meaningful way because hash buckets have no semantic date meaning.
The right answer in practice is usually: use range partitioning when queries filter by date (which they almost always do in data-heavy systems), and if the current partition becomes a write hotspot due to traffic growth, add hash sub-partitions to create a composite scheme rather than switching to pure hash. Pure hash partitioning is the right call mainly for lookup tables and reference data where you need even distribution for join performance but have no time-based access pattern.
Q04 of 06SENIOR
Explain the difference between partitioning and sharding. When would you use one over the other?
ANSWER
Partitioning divides a single logical table into multiple physical segments within one database instance. All partitions share the same server process, the same connection pool, the same transaction coordinator, and the same WAL stream. From the application's perspective, the table behaves as one object. The database engine routes queries and inserts to the correct physical segment transparently.
Sharding distributes data across multiple independent database instances — each shard is a completely separate database server with its own storage, connections, WAL, and transaction context. The application (or a middleware layer) must be aware of sharding — it needs to route queries to the correct shard and handle cross-shard queries explicitly.
Use partitioning when: your data fits on one server's storage and I/O capacity but queries are slow because the table is too large for efficient index traversal; you need fast data retention via DROP PARTITION; you want the planner to skip irrelevant physical segments without any application-level routing logic.
Use sharding when: your data volume or write throughput exceeds what one server can handle regardless of partitioning and indexing optimizations; you need horizontal write scaling across multiple servers; your multi-tenant architecture requires strict data isolation at the infrastructure level with each tenant on a separate database instance.
Partitioning is operationally simpler — same ACID guarantees, same query interface, standard backup and replication tooling. Sharding introduces distributed transaction complexity, cross-shard aggregation overhead, rebalancing challenges when you add shards, and higher operational burden across the board. Start with partitioning and only consider sharding when you've genuinely hit the single-server ceiling.
Q05 of 06SENIOR
A query on a partitioned table is scanning all 48 partitions even though the WHERE clause filters on the partition key. Walk me through your debugging process.
ANSWER
Step one: run EXPLAIN (ANALYZE, BUFFERS) and confirm the partition scan count equals 48. This rules out the possibility that the query is already pruning and the slow performance has a different cause.
Step two: look at the WHERE clause carefully. Check for any function applied to the partition key column — DATE_TRUNC, EXTRACT, CAST, COALESCE, and any UDF are the most common culprits. If the partition key is inside a function, the planner cannot compare the function's output against partition boundaries at plan time and disables pruning.
Step three: check for implicit type casting. If the partition key is a TIMESTAMP column and you're comparing it to a VARCHAR literal without an explicit cast, some databases disable pruning rather than risk a type mismatch. Cast explicitly.
Step four: check for OR conditions. If the WHERE clause uses OR and one of the OR branches filters on a non-partition-key column, the planner cannot prune — it must consider all partitions to satisfy the OR.
Step five: check whether the partition key predicate is buried inside a CTE. In PostgreSQL pre-12, CTEs are optimization fences — the planner materializes the CTE result and cannot push predicates down through the CTE boundary. The inner partition scan sees no filter and scans everything.
Step six: rewrite the WHERE clause to use the partition key as a bare column with direct range operators and re-run EXPLAIN to confirm pruning activates. Step seven: add this query to your CI EXPLAIN check so a future regression surfaces in code review rather than as a production latency incident.
Q06 of 06SENIOR
How does composite partitioning (range + hash) solve the hot-partition problem, and what are the limits?
ANSWER
Pure range partitioning on a timestamp produces structurally uneven partitions over time. If your traffic has grown 10x over three years, your current month's partition receives 10x more writes per day than any historical partition. That single segment concentrates autovacuum work, index updates, and write I/O — it becomes a bottleneck that the rest of your partitioning scheme doesn't help with.
Composite partitioning addresses this by adding hash sub-partitions within each range window. Instead of one January 2026 partition receiving all January writes, you have four (or eight) hash buckets that each receive one-quarter (or one-eighth) of January writes based on a hash of the secondary key — typically user_id or a similar high-cardinality identifier. Each leaf partition stays smaller and more evenly loaded regardless of how much traffic has grown.
The limits are three. First, total partition count: 48 monthly range partitions with 8 hash sub-partitions each gives you 384 leaf partitions — approaching the ~500 threshold where PostgreSQL's planner evaluation overhead starts affecting fast queries. Keep the hash modulus small (4–8) unless you have a strong reason to go higher. Second, pruning requires both keys: a query filtering only on the range key scans all hash sub-partitions within the matching month. A query filtering only on the hash key gets no range pruning and scans every month's matching hash bucket — often worse than an unpartitioned table. Design your queries to include both keys in WHERE for full pruning. Third, maintenance complexity: when you create a new monthly range partition, you must also create all its hash sub-partitions. Your partition creation automation needs to handle both levels correctly.
01
What is partition pruning and how would you verify it's working in a production query?
SENIOR
02
You have a table with 800 million rows partitioned by month. How would you design a data retention policy that keeps 13 months of data and drops older data without impacting production traffic?
SENIOR
03
When would you choose hash partitioning over range partitioning, and what are the trade-offs?
SENIOR
04
Explain the difference between partitioning and sharding. When would you use one over the other?
SENIOR
05
A query on a partitioned table is scanning all 48 partitions even though the WHERE clause filters on the partition key. Walk me through your debugging process.
SENIOR
06
How does composite partitioning (range + hash) solve the hot-partition problem, and what are the limits?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Does partitioning replace the need for indexes?
No — and conflating the two is a common mistake. Partitioning and indexes operate at different levels and solve different problems. Partition pruning reduces the physical segments the planner considers at the table level — it skips entire partitions that cannot contain matching rows. Indexes provide fast lookup within a partition — they let the database find specific rows within a segment without a sequential scan.
A well-partitioned table still needs indexes on columns used in WHERE, JOIN, and ORDER BY clauses. The advantage is that each partition's index covers only that partition's rows, so the index is smaller, faster to traverse, and faster to rebuild or maintain. A composite query that prunes to one monthly partition and then uses an index on customer_id within that partition is orders of magnitude faster than either technique alone. The production standard for large tables is partition pruning plus local indexes — not one or the other.
Was this helpful?
02
Can I partition an existing table without downtime?
PostgreSQL does not support converting an existing unpartitioned table to a partitioned table in-place — there is no ALTER TABLE ... PARTITION BY command that restructures an existing heap. The standard migration approach is:
Create a new partitioned table with identical schema and all required partitions.
Copy data from the old table in batches using INSERT INTO new_table SELECT ... FROM old_table WHERE ..., chunking by the partition key range to avoid a single massive transaction.
Use logical replication or application-level dual-write to keep the new table current during the copy phase.
Once the new table is in sync, swap names using ALTER TABLE RENAME inside a transaction — this requires a brief write lock.
Update application connection strings or views to reference the new table and drop the old one.
For a true zero-downtime migration, the logical replication approach during the copy phase is essential — it lets you cut over during a low-traffic window with a write lock measured in seconds rather than hours. MySQL supports online DDL for partitioning modifications in some versions and storage engines, but the specific behavior depends on the MySQL version and whether the table uses InnoDB.
Was this helpful?
03
How do I choose between monthly and quarterly partitions?
Monthly partitions are the right default for most time-series workloads. They balance partition count (48 for 4 years of data) with reasonable per-partition size for most traffic levels, they align naturally with billing cycles and reporting periods, and they make the DROP PARTITION retention pattern intuitive to reason about.
Switch to quarterly partitions when: your data volume per month is very low (under 1–2 million rows) and you want to reduce the total partition count; your retention policy is expressed in quarters rather than months; or your queries typically span 3+ months and rarely filter to a single month, making monthly pruning less effective.
Switch to weekly or daily partitions only when your monthly row count exceeds 100–200 million rows and you need finer-grained retention — but watch the total partition count ceiling. Daily partitions over 4 years gives you 1460 leaf partitions, well above the 500-partition threshold where planner overhead becomes a problem. In that scenario, use monthly range partitions with hash sub-partitions rather than daily range partitions.
Was this helpful?
04
What happens if I insert a row whose partition key value doesn't match any partition?
Without a DEFAULT partition, the INSERT fails immediately with an error along the lines of 'no partition of relation found for row'. The entire transaction rolls back. The row is not inserted and does not land anywhere — it is simply rejected.
This is a silent operational failure mode: if your application does not retry on this specific error class, the data is lost. The failure typically happens at the start of a new month when the application starts generating data for a period that the partition creation job hasn't covered yet.
The defense has two layers: first, always create a DEFAULT partition that absorbs any value that doesn't match a defined range or list. Rows landing in DEFAULT are not lost and can be moved to the correct partition once it's created. Second, monitor the DEFAULT partition's row count weekly — accumulation there is an early warning that your partition creation automation has failed or fallen behind. An alert on DEFAULT partition row count gives you time to fix the creation job before the backlog grows large.
Was this helpful?
05
How does partitioning interact with foreign keys and joins?
Foreign key constraints on partitioned tables have significant limitations that vary by database. In PostgreSQL, a foreign key from a partitioned table to another table requires the referenced table to be either unpartitioned or partitioned with matching boundaries and the same partition key. References from an unpartitioned table to a partitioned table are not supported at all in most PostgreSQL versions prior to 12, and support in later versions has caveats. Check your specific database version before relying on foreign key constraints across partitioned tables.
For joins, the key concept is partition-wise joins. When two tables are co-partitioned on the same key with identical boundaries — for example, both orders and order_items are partitioned by created_at with matching monthly boundaries — PostgreSQL can perform the join partition-by-partition in parallel, which is significantly faster than joining the full datasets. Enable partition_wise_join in PostgreSQL's planner settings to allow this optimization.
When tables are not co-partitioned and you're joining on a non-partition-key column, the database must probe every partition on the inner table for each outer row. In that scenario, ensure the join column is indexed on both tables. The cross-partition join overhead is unavoidable given the data model — you're paying a structural cost for the way the data is organized.