Senior 11 min · March 06, 2026

SQL Table Partitioning — Function on Key Disables Pruning

Q: Does partitioning replace the need for indexes?

No — and conflating the two is a common mistake. Partitioning and indexes operate at different levels and solve different problems. Partition pruning reduces the physical segments the planner considers at the table level — it skips entire partitions that cannot contain matching rows. Indexes provide fast lookup within a partition — they let the database find specific rows within a segment without a sequential scan. A well-partitioned table still needs indexes on columns used in WHERE, JOIN, and ORDER BY clauses. The advantage is that each partition's index covers only that partition's rows, so the index is smaller, faster to traverse, and faster to rebuild or maintain. A composite query that prunes to one monthly partition and then uses an index on customer_id within that partition is orders of magnitude faster than either technique alone. The production standard for large tables is partition pruning plus local indexes — not one or the other.

Q: Can I partition an existing table without downtime?

PostgreSQL does not support converting an existing unpartitioned table to a partitioned table in-place — there is no ALTER TABLE ... PARTITION BY command that restructures an existing heap. The standard migration approach is: 1. Create a new partitioned table with identical schema and all required partitions. 2. Copy data from the old table in batches using INSERT INTO new_table SELECT ... FROM old_table WHERE ..., chunking by the partition key range to avoid a single massive transaction. 3. Use logical replication or application-level dual-write to keep the new table current during the copy phase. 4. Once the new table is in sync, swap names using ALTER TABLE RENAME inside a transaction — this requires a brief write lock. 5. Update application connection strings or views to reference the new table and drop the old one. For a true zero-downtime migration, the logical replication approach during the copy phase is essential — it lets you cut over during a low-traffic window with a write lock measured in seconds rather than hours. MySQL supports online DDL for partitioning modifications in some versions and storage engines, but the specific behavior depends on the MySQL version and whether the table uses InnoDB.

Q: How do I choose between monthly and quarterly partitions?

Monthly partitions are the right default for most time-series workloads. They balance partition count (48 for 4 years of data) with reasonable per-partition size for most traffic levels, they align naturally with billing cycles and reporting periods, and they make the DROP PARTITION retention pattern intuitive to reason about. Switch to quarterly partitions when: your data volume per month is very low (under 1–2 million rows) and you want to reduce the total partition count; your retention policy is expressed in quarters rather than months; or your queries typically span 3+ months and rarely filter to a single month, making monthly pruning less effective. Switch to weekly or daily partitions only when your monthly row count exceeds 100–200 million rows and you need finer-grained retention — but watch the total partition count ceiling. Daily partitions over 4 years gives you 1460 leaf partitions, well above the 500-partition threshold where planner overhead becomes a problem. In that scenario, use monthly range partitions with hash sub-partitions rather than daily range partitions.

Q: What happens if I insert a row whose partition key value doesn't match any partition?

Without a DEFAULT partition, the INSERT fails immediately with an error along the lines of 'no partition of relation found for row'. The entire transaction rolls back. The row is not inserted and does not land anywhere — it is simply rejected. This is a silent operational failure mode: if your application does not retry on this specific error class, the data is lost. The failure typically happens at the start of a new month when the application starts generating data for a period that the partition creation job hasn't covered yet. The defense has two layers: first, always create a DEFAULT partition that absorbs any value that doesn't match a defined range or list. Rows landing in DEFAULT are not lost and can be moved to the correct partition once it's created. Second, monitor the DEFAULT partition's row count weekly — accumulation there is an early warning that your partition creation automation has failed or fallen behind. An alert on DEFAULT partition row count gives you time to fix the creation job before the backlog grows large.

Q: How does partitioning interact with foreign keys and joins?

Foreign key constraints on partitioned tables have significant limitations that vary by database. In PostgreSQL, a foreign key from a partitioned table to another table requires the referenced table to be either unpartitioned or partitioned with matching boundaries and the same partition key. References from an unpartitioned table to a partitioned table are not supported at all in most PostgreSQL versions prior to 12, and support in later versions has caveats. Check your specific database version before relying on foreign key constraints across partitioned tables. For joins, the key concept is partition-wise joins. When two tables are co-partitioned on the same key with identical boundaries — for example, both orders and order_items are partitioned by created_at with matching monthly boundaries — PostgreSQL can perform the join partition-by-partition in parallel, which is significantly faster than joining the full datasets. Enable partition_wise_join in PostgreSQL's planner settings to allow this optimization. When tables are not co-partitioned and you're joining on a non-partition-key column, the database must probe every partition on the inner table for each outer row. In that scenario, ensure the join column is indexed on both tables. The cross-partition join overhead is unavoidable given the data model — you're paying a structural cost for the way the data is organized.

A function on the partition key column in WHERE disables pruning, scanning all 800M rows instead of expected 2M.

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Notes here come from systems that actually shipped.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Partitioning splits one table into physical segments based on a rule — date, region, category, or hash
Partition pruning is the core benefit: the planner skips partitions that cannot contain matching rows
Four strategies: range (dates), list (categories), hash (even distribution), composite (range + sub-hash or list)
Dropping old data becomes a metadata operation (DROP PARTITION) instead of a row-by-row DELETE
Partition key must appear in WHERE clause for pruning — wrapping it in a function kills pruning silently
Biggest mistake: partitioning on a column not used in queries — adds overhead with zero pruning benefit

✦ Definition~90s read

What is Partitioning Tables in SQL?

SQL table partitioning is a database design technique that physically splits a large table into smaller, more manageable segments called partitions, based on a defined key column (e.g., date, region, or ID). The primary reason to partition is performance: queries that filter on the partition key can skip irrelevant partitions entirely—a process called partition pruning—dramatically reducing I/O and scan times.

★

Imagine a massive filing cabinet with 10 million folders — finding one folder means searching every drawer.

It also simplifies data lifecycle management: you can drop or archive entire partitions in milliseconds rather than running costly DELETE operations on millions of rows. Major databases like PostgreSQL (declarative partitioning since v10), MySQL (InnoDB), and SQL Server all support it, but the implementation details and pruning behavior vary significantly.

However, partitioning is not a silver bullet. The most common pitfall—and the focus of this article—is that wrapping the partition key in a function (e.g., WHERE DATE(created_at) = '2024-01-01' instead of WHERE created_at >= '2024-01-01' AND created_at < '2024-01-02') disables partition pruning entirely.

The database cannot statically determine which partitions to scan because the function obscures the underlying key value. This turns a targeted query into a full table scan across all partitions, silently destroying performance at scale. You should avoid partitioning on columns that are frequently transformed in WHERE clauses, or use generated columns or expression-based partitioning (where supported) to preserve pruning.

Alternatives to partitioning include indexing (B-tree, BRIN, or GiST) for smaller tables, or sharding across separate database instances for extreme write throughput. Partitioning is best suited for append-heavy workloads (e.g., time-series logs, event streams) where you query recent data and archive old data.

It is overkill for tables under 10 million rows or for OLTP systems with frequent cross-partition updates, where the overhead of partition routing and metadata management can outweigh the benefits. Always verify pruning behavior using EXPLAIN plans—never assume it works.

Plain-English First

Imagine a massive filing cabinet with 10 million folders — finding one folder means searching every drawer. Now imagine splitting those folders into 12 labelled drawers, one per month. You instantly know which drawer to open. SQL table partitioning does exactly that: it physically divides one enormous table into smaller, manageable chunks (called partitions) based on a rule you define — like date, region, or category. The table still looks like one table to your application, but the database engine is quietly routing queries to only the relevant chunk.

At some point in every database engineer's career, a query that used to run in 200ms starts taking 45 seconds. The table hasn't changed structurally — it's just grown from 2 million rows to 800 million. Indexes help, but even a perfectly tuned B-tree index on 800 million rows requires significant I/O just to traverse the tree and resolve heap pages. You start questioning your index design, your query shape, your statistics configuration — and most of the time the real answer is that you've hit the ceiling of what single-segment storage can do efficiently.

This is where table partitioning stops being an academic concept and becomes an operational lifeline.

The core mechanism partitioning exploits is called partition pruning: instead of scanning or even index-seeking across the full dataset, the query planner eliminates entire physical segments it knows cannot contain rows matching your WHERE clause. A query filtering on the last 30 days never touches the 7 years of historical data sitting in other partitions — those segments are excluded at plan time, before a single page is read from disk. Beyond query performance, partitioning unlocks fast bulk operations. Dropping a year's worth of old data becomes a near-instant metadata operation (DROP PARTITION) instead of a DELETE that locks the table for hours, generates hundreds of gigabytes of WAL, and leaves you running VACUUM for days.

I've seen teams partition a table and wonder why nothing improved — usually because they partitioned on the wrong column, or because a single function call in the WHERE clause was silently disabling pruning on every query. I've also seen partitioning done right turn a 45-second dashboard query into a 180ms one without touching a single index or rewriting application logic.

By the end of this article you'll understand how each partitioning strategy works at the storage level, write production-ready DDL for range, list, hash, and composite schemes, diagnose when partition pruning is silently failing, manage partition maintenance without downtime, and avoid the three mistakes that turn partitioning from a performance win into a support nightmare.

Why Partition Pruning Fails When the Partition Key Is Wrapped in a Function

SQL table partitioning splits a large table into smaller physical segments (partitions) based on a partition key — typically a date column or an ID range. The core mechanic is partition pruning: the query planner eliminates irrelevant partitions at plan time, scanning only the subset that could contain matching rows. This is not an optimization hint; it's a structural property of the storage layout.

Pruning works because the planner compares query predicates directly against the partition boundary metadata. When you write WHERE created_at >= '2024-01-01', the planner knows which partitions hold January data. But if you wrap the partition key in a function — WHERE DATE(created_at) >= '2024-01-01' — the planner sees an opaque expression, not a direct column comparison. It cannot infer the partition range, so it falls back to a full scan of all partitions. This is not a planner bug; it's a fundamental limitation of how partition metadata is stored and matched.

Use partitioning when you have time-series data exceeding 100M rows, or when you need to drop old data via DROP PARTITION instead of slow DELETE. But never assume the planner will be smart about function-wrapped keys. The performance cost of a missed prune is linear in the number of partitions — if you have 365 daily partitions, a missed prune means scanning 365× more data than necessary.

Function on Key = No Prune

Even an immutable function like DATE(ts) disables pruning in PostgreSQL, MySQL, and most engines. Only direct column comparisons use partition metadata.

Production Insight

A team partitioned a 2B-row events table by event_date but wrote queries with WHERE DATE(event_date) = CURRENT_DATE. The planner scanned all 365 partitions every night, causing 40-minute query times and I/O saturation on the SAN. Rule: always compare the raw partition column — use event_date = CURRENT_DATE or a range event_date >= '2024-01-01' AND event_date < '2024-01-02'.

Key Takeaway

Partition pruning requires direct column predicates — any function wrapping the key disables it.

Always compare the raw partition key, not a derived expression.

Test pruning with EXPLAIN (or EXPLAIN PARTITIONS) before deploying to production.

thecodeforge.io

SQL Partition Pruning Failure on Key Functions

Partitioning Tables Sql

Partitioning Strategies — Range, List, Hash, and Composite

SQL offers four partitioning strategies, each designed for a different access pattern. Choosing the wrong strategy is the most common partitioning mistake — it adds storage and planning overhead with zero pruning benefit on your actual queries.

Range partitioning splits data by a continuous value — typically a date or timestamp. Each partition covers a defined interval (e.g., January 2026, February 2026). This is the most common strategy for time-series data, event logs, and financial transactions. Queries filtering on date ranges prune efficiently because the planner can compare your WHERE clause bounds against partition boundaries directly at plan time — no row evaluation needed.

List partitioning splits data by discrete values — a region code, status, or category. Each partition contains rows matching a specific set of known values (e.g., region IN ('US', 'CA')). This works well when your queries filter on a fixed enumeration of categories and you want each category group to live in its own physical segment for both pruning and operational isolation.

Hash partitioning distributes rows evenly across N partitions using a hash function applied to the partition key. There is no semantic meaning to which partition a row lands in — the goal is purely even distribution. Hash partitioning is the right choice when you have no natural range or list dimension but need to reduce per-partition size for maintenance operations like index rebuilds and vacuum. It provides no pruning benefit for range queries.

Composite partitioning combines two strategies — typically range at the first level and hash (or list) at the second level. For example: partition by month (range), then sub-partition each month into 4 hash buckets. This gives you time-based pruning at the top level and even distribution within each time window at the second level. It's the right call when pure range partitioning creates hot partitions because recent months concentrate all writes.

partitioning_strategies.sqlSQL

-- ============================================================
-- RANGE PARTITIONING: split by date — most common for time-series
-- Each partition holds one month of data
-- ============================================================
CREATE TABLE orders (
    order_id     BIGINT NOT NULL,
    customer_id  INT NOT NULL,
    total_amount DECIMAL(10, 2),
    created_at   TIMESTAMP NOT NULL,
    status       VARCHAR(20)
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE orders_2026_01 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE orders_2026_02 PARTITION OF orders
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
CREATE TABLE orders_2026_03 PARTITION OF orders
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');
-- ... continue for each month

-- Catch-all partition for values outside defined ranges
-- Without this, out-of-range inserts fail with a hard error
CREATE TABLE orders_default PARTITION OF orders DEFAULT;

-- ============================================================
-- LIST PARTITIONING: split by discrete category values
-- Each partition holds a specific region's data
-- ============================================================
CREATE TABLE transactions (
    txn_id      BIGINT NOT NULL,
    amount      DECIMAL(12, 2),
    region      VARCHAR(10) NOT NULL,
    txn_date    DATE
) PARTITION BY LIST (region);

CREATE TABLE txn_north_america PARTITION OF transactions
    FOR VALUES IN ('US', 'CA', 'MX');
CREATE TABLE txn_europe PARTITION OF transactions
    FOR VALUES IN ('UK', 'DE', 'FR', 'ES', 'IT');
CREATE TABLE txn_asia_pacific PARTITION OF transactions
    FOR VALUES IN ('JP', 'KR', 'AU', 'IN', 'SG');
CREATE TABLE txn_default PARTITION OF transactions DEFAULT;

-- ============================================================
-- HASH PARTITIONING: even distribution — no semantic meaning
-- Use when you have no natural range or list dimension
-- ============================================================
CREATE TABLE user_events (
    event_id    BIGINT NOT NULL,
    user_id     INT NOT NULL,
    event_type  VARCHAR(50),
    event_time  TIMESTAMP
) PARTITION BY HASH (user_id);

CREATE TABLE user_events_p0 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE user_events_p1 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE user_events_p2 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE user_events_p3 PARTITION OF user_events
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- ============================================================
-- COMPOSITE PARTITIONING: range (monthly) + hash sub-partitions
-- Time-based pruning at level 1, even distribution at level 2
-- Solves hot-partition problem when recent months are write-heavy
-- ============================================================
CREATE TABLE audit_log (
    log_id      BIGINT NOT NULL,
    actor_id    INT NOT NULL,
    action      VARCHAR(100),
    log_time    TIMESTAMP NOT NULL
) PARTITION BY RANGE (log_time);

-- Monthly partition, itself partitioned by hash on actor_id
CREATE TABLE audit_log_2026_01 PARTITION OF audit_log
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01')
    PARTITION BY HASH (actor_id);

CREATE TABLE audit_log_2026_01_p0 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE audit_log_2026_01_p1 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE audit_log_2026_01_p2 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE audit_log_2026_01_p3 PARTITION OF audit_log_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Output

-- Range partitioning: queries on created_at prune to one monthly partition

-- List partitioning: queries on region prune to one partition group

-- Hash partitioning: inserts distribute evenly; equality queries on user_id hit one partition

-- Composite: time-range queries prune to one month, then hash distributes within that month

How Partition Pruning Works

Range: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01' prunes to exactly one monthly partition
List: WHERE region = 'US' prunes to the North America partition group — all other regions are excluded
Hash: WHERE user_id = 12345 prunes to exactly one hash bucket — the planner computes hash(12345) % modulus at plan time
Pruning happens at plan time — zero I/O for skipped partitions, not just filtered rows
Any function applied to the partition key column in WHERE prevents the planner from comparing against boundaries — pruning is disabled entirely

Production Insight

I reviewed a schema where a team had partitioned their events table by HASH on a timestamp column. Their reasoning was that the table was large and needed to be split up. What they got was even distribution across 8 partitions and zero pruning on any of their date-filtered queries — every query scanned all 8 partitions regardless of the time range. Partitioning on the wrong dimension is worse than not partitioning at all because you pay the planner overhead without getting the pruning benefit. The fix was rebuilding with RANGE on the timestamp. Match the strategy to the column your queries actually filter on.

Key Takeaway

Range for dates, list for categories, hash for even distribution, composite for both. The strategy must match your dominant WHERE clause filter — otherwise pruning never activates and you have added complexity with no performance return. Run EXPLAIN on your top 20 queries before committing to any partitioning scheme and verify that pruning fires on the majority of them.

Partitioning Strategy Selection

IfQueries filter primarily by date or timestamp range

→

UseUse RANGE partitioning — the planner prunes to the exact matching time window using boundary comparison

IfQueries filter by a fixed set of discrete values (region, status, category)

→

UseUse LIST partitioning — one partition per value group, direct equality pruning

IfNo natural range or list dimension, but table is too large for efficient vacuum and index maintenance

→

UseUse HASH partitioning — even distribution reduces per-partition size, but accept that range queries scan all partitions

IfNeed time-based pruning AND even distribution within each period to prevent hot partitions

→

UseUse COMPOSITE (range + hash) — pruning at level 1 on time, distribution at level 2 via hash

IfTable has fewer than 10M rows

→

UseDo not partition — the planning overhead and operational complexity outweigh any benefit at this scale

Partition Pruning — The Silent Performance Killer When It Fails

Partition pruning is the entire reason partitioning exists. When it works, queries touch only the relevant physical segment — 1 month out of 48, 1 region out of 10. When it fails silently, every query scans all partitions, and partitioning actually degrades performance compared to the unpartitioned table because of the per-partition planning overhead.

The most common pruning killers, in order of how often I see them in the wild:

Function calls on the partition key: WHERE EXTRACT(month FROM created_at) = 1 or WHERE DATE_TRUNC('month', created_at) = '2026-01-01'. The planner cannot compare the function's output against partition boundaries without evaluating the function for every possible input value — which it cannot do at plan time. So it gives up and scans everything.
Implicit type casting: WHERE created_at = '2026-01-15' where created_at is a TIMESTAMP and the literal is a VARCHAR. Some databases resolve this gracefully; others disable pruning rather than risk a type mismatch. Always cast explicitly: WHERE created_at = TIMESTAMP '2026-01-15 00:00:00'.
OR conditions that include a non-partition-key column: WHERE created_at >= '2026-01-01' OR updated_at >= '2026-01-01' when the table is partitioned by created_at. The OR with updated_at forces the planner to consider every partition because it cannot prune based on a column that isn't the partition key.
Predicates buried inside subqueries or CTEs: Some planners cannot push WHERE clause predicates down through a CTE boundary, which means the partition scan at the inner level sees no filter and scans everything. Materializing CTEs in older PostgreSQL versions (pre-12) is a common cause of this.

The diagnostic command is always the same: EXPLAIN (ANALYZE, BUFFERS). Look for 'Partitions scanned: N' in the output. If N equals your total partition count, pruning is not working and you need to find why before the query touches production at scale.

pruning_diagnosis.sqlSQL

-- ============================================================
-- DIAGNOSIS: verify partition pruning is active
-- Run EXPLAIN before and after any WHERE clause change
-- ============================================================

-- GOOD: bare column with direct range comparison — pruning fires
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at >= '2026-01-01'
  AND created_at < '2026-02-01';
-- Expected output: Partitions scanned: 1
-- Execution time: ~200ms on 48-partition table

-- BAD: function call on partition key — pruning disabled
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE DATE_TRUNC('month', created_at) = '2026-01-01';
-- Expected output: Partitions scanned: 48
-- Execution time: ~45s — 225x regression

-- BAD: EXTRACT on partition key — same problem, different syntax
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE EXTRACT(year FROM created_at) = 2026
  AND EXTRACT(month FROM created_at) = 1;
-- Expected output: Partitions scanned: 48

-- BAD: OR with non-partition-key column — forces full scan
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, total_amount
FROM orders
WHERE created_at >= '2026-01-01'
   OR updated_at >= '2026-01-01';
-- Expected output: Partitions scanned: 48
-- Fix: split into two queries with UNION ALL if the OR is unavoidable

-- ============================================================
-- PARTITION SIZE AUDIT: identify hot or imbalanced partitions
-- Run monthly to catch retention and growth anomalies
-- ============================================================
SELECT
    c.relname AS partition_name,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
    pg_stat_user_tables.n_live_tup AS live_rows
FROM pg_inherits i
JOIN pg_class c ON c.oid = i.inhrelid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_stat_user_tables ON pg_stat_user_tables.relname = c.relname
WHERE p.relname = 'orders'
ORDER BY pg_total_relation_size(c.oid) DESC;

Output

-- Good query: Partitions scanned: 1, execution ~200ms

-- Bad query (DATE_TRUNC): Partitions scanned: 48, execution ~45s

-- Bad query (EXTRACT): Partitions scanned: 48 — same penalty, different syntax

-- The only difference between good and bad is the WHERE clause form

Functions on Partition Key Kill Pruning

DATE_TRUNC, EXTRACT, CAST, COALESCE, and any user-defined function applied to the partition key column in WHERE will disable partition pruning entirely. The planner cannot evaluate the function's output against partition boundaries at plan time — it would need to call the function for every possible input value to know which partitions to skip. It doesn't do that. It scans everything instead. Always use bare column comparisons: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'.

Production Insight

A single DATE_TRUNC call in the WHERE clause of a reporting query caused the planner to scan 48 partitions instead of 1 — 96M rows instead of 2M. Latency went from 200ms to 45 seconds, a 225x regression. The query looked correct syntactically and returned the right results. The only indication something was wrong was the EXPLAIN output showing all partitions scanned. The team had not run EXPLAIN after adding partitioning. That is the rule: always run EXPLAIN immediately after deploying a partitioned schema and after any WHERE clause change on a partitioned table. Do not trust that it works without verifying.

Key Takeaway

Partition pruning is the entire reason partitioning exists — without it, you have planning overhead and zero performance benefit. Functions on the partition key in WHERE are the number one silent pruning killer and the hardest to catch in code review because the query looks correct. Always verify pruning with EXPLAIN. If partitions scanned equals your total partition count, your partitioning is actively making performance worse.

Pruning Diagnosis Decision Tree

IfEXPLAIN shows 'Partitions scanned: 1' or a small subset

→

UsePruning is working — query is touching only the relevant partition(s)

IfEXPLAIN shows 'Partitions scanned: all' or count equals total partitions

→

UsePruning has failed — inspect WHERE clause for functions, implicit casts, or OR conditions involving non-partition columns

IfPartition key is wrapped in DATE_TRUNC, EXTRACT, CAST, or COALESCE

→

UseRewrite to direct range comparison: WHERE col >= X AND col < Y using typed literals

IfPartition key is compared with a string literal instead of a typed value

→

UseCast explicitly: WHERE created_at >= TIMESTAMP '2026-01-01 00:00:00' to avoid implicit cast ambiguity

IfWHERE clause uses OR with a column that is not the partition key

→

UseSplit into two separate queries combined with UNION ALL, or add an index on the non-partition column and accept the full scan as unavoidable given the OR logic

Partition Maintenance — Adding, Dropping, and Managing at Scale

Partitioning shifts database maintenance from row-level operations to metadata-level operations. This is the operational payoff that makes partitioning worth the schema complexity — but only if you manage it correctly. Get maintenance wrong and you end up with the worst of both worlds: the overhead of a partitioned schema and the pain of row-level operations on large datasets.

Dropping old data: Without partitioning, deleting 1 billion rows requires a DELETE statement that generates one WAL entry per deleted row, bloats the table with dead tuples, holds row-level locks during execution, and leaves VACUUM with an enormous cleanup job that takes hours and competes with production reads. With partitioning, ALTER TABLE ... DETACH PARTITION followed by DROP TABLE is a catalog metadata operation — it removes the partition's file pointers from the system catalog in milliseconds. No row-level locks on the parent table, no WAL bloat, no vacuum pressure afterward.

Adding new partitions: For range-partitioned tables, partitions must exist before data arrives. If a row's partition key value falls outside all defined partition ranges and no DEFAULT partition exists, the INSERT fails with a hard error and the transaction rolls back. The standard pattern is a scheduled job (cron or an internal scheduler) that creates partitions 3–6 months ahead using CREATE TABLE IF NOT EXISTS so the job is idempotent.

Sliding window pattern: The production standard for time-series retention is a sliding window: in one monthly maintenance job, create the partition for the month 3 months out and drop the partition from 13 months ago. The table stays at a constant size, the job is fast, and there's no manual intervention needed as long as the job runs.

Partition count limits: Each partition adds overhead to the planner's boundary evaluation step. PostgreSQL handles up to ~500 partitions well; beyond that, query planning time becomes measurable and can exceed execution time for fast queries. If you need more coverage than 500 monthly partitions (roughly 40 years), use composite partitioning — range at the top level with hash sub-partitions — instead of pushing the partition count higher.

partition_maintenance.sqlSQL

-- ============================================================
-- SLIDING WINDOW: automated partition management
-- Create future partition, drop expired one — zero downtime
-- Run monthly via cron or internal scheduler
-- ============================================================

-- Step 1: Create next month's partition before data arrives
CREATE TABLE IF NOT EXISTS orders_2026_04 PARTITION OF orders
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

-- Step 2: Detach the partition from 13 months ago from the parent
-- DETACH removes catalog entry without locking the parent table
ALTER TABLE orders DETACH PARTITION orders_2025_03;

-- Step 3: Drop the physical files — instant once detached
DROP TABLE orders_2025_03;

-- PostgreSQL 14+: use CONCURRENTLY to avoid any parent table lock during detach
-- ALTER TABLE orders DETACH PARTITION orders_2025_03 CONCURRENTLY;

-- ============================================================
-- AUTOMATED PARTITION CREATION: stored procedure
-- Call monthly to stay 6 months ahead of data arrival
-- Schema: io.thecodeforge.maint for operational procedures
-- ============================================================
CREATE OR REPLACE PROCEDURE io.thecodeforge.maint.create_monthly_partitions(
    p_table_name TEXT,
    p_months_ahead INT DEFAULT 6
)
LANGUAGE plpgsql AS $$
DECLARE
    v_start_date DATE;
    v_end_date   DATE;
    v_partition_name TEXT;
    v_current    DATE := DATE_TRUNC('month', CURRENT_DATE);
BEGIN
    FOR i IN 0..p_months_ahead LOOP
        v_start_date     := v_current + (i || ' months')::INTERVAL;
        v_end_date       := v_start_date + INTERVAL '1 month';
        v_partition_name := p_table_name || '_' || TO_CHAR(v_start_date, 'YYYY_MM');

        EXECUTE format(
            'CREATE TABLE IF NOT EXISTS %I PARTITION OF %I
             FOR VALUES FROM (%L) TO (%L)',
            v_partition_name,
            p_table_name,
            v_start_date,
            v_end_date
        );

        RAISE NOTICE 'Ensured partition exists: % (% to %)',
            v_partition_name, v_start_date, v_end_date;
    END LOOP;
END;
$$;

-- Usage: call at the start of each month to stay ahead
CALL io.thecodeforge.maint.create_monthly_partitions('orders', 6);

-- ============================================================
-- PARTITION HEALTH CHECK: detect gaps and size anomalies
-- Run weekly — catches maintenance job failures before inserts fail
-- ============================================================
SELECT
    c.relname                                                    AS partition_name,
    pg_size_pretty(pg_total_relation_size(c.oid))               AS total_size,
    pg_stat_user_tables.n_live_tup                              AS live_rows,
    pg_stat_user_tables.last_autoanalyze                        AS last_analyzed
FROM pg_inherits i
JOIN pg_class c ON c.oid = i.inhrelid
JOIN pg_class p ON p.oid = i.inhparent
JOIN pg_stat_user_tables ON pg_stat_user_tables.relname = c.relname
WHERE p.relname = 'orders'
ORDER BY c.relname;

-- Monitor DEFAULT partition for unexpected row accumulation
-- Rows landing here mean a future partition is missing
SELECT COUNT(*) AS default_partition_rows
FROM orders_default;

Output

-- DETACH + DROP removes a 13-month-old partition in milliseconds — no WAL, no locks

-- create_monthly_partitions is idempotent — safe to run repeatedly

-- Health check surfaces missing partitions and hot partition size anomalies

-- Default partition monitor catches gaps in future partition coverage before insert failures

Always DETACH Before DROP

Use ALTER TABLE ... DETACH PARTITION before DROP TABLE. DETACH removes the partition from the parent table's catalog entry as a quick metadata update. Dropping the partition directly without detaching first can briefly acquire a lock on the parent table, blocking concurrent queries during that window. For PostgreSQL 14+, DETACH PARTITION CONCURRENTLY avoids any parent table lock entirely — the detach happens in the background while the parent remains fully accessible.

Production Insight

A team was running DELETE FROM orders WHERE created_at < '2025-01-01' to implement their 13-month retention policy. The DELETE ran for 6 hours, generated roughly 200GB of WAL, triggered autovacuum on the entire table, and caused write latency to spike for concurrent inserts throughout the window. They added a maintenance window alert, which made the problem visible but didn't fix it. Switching to DETACH + DROP reduced the retention job from 6 hours to under 30 seconds. The WAL went from 200GB to zero. The maintenance window alert was removed because there was nothing left to protect against.

Key Takeaway

Partitioning turns row-level retention into a metadata operation — DROP instead of DELETE. That's the operational payoff that justifies the schema complexity. Automate partition creation with a job that runs at the start of each month and creates partitions 3–6 months ahead. Monitor the DEFAULT partition — rows accumulating there are an early warning that your creation job has fallen behind.

Composite Partitioning and Cross-Partition Query Patterns

When a single partitioning dimension doesn't fully address your access pattern or your write distribution, composite partitioning (also called sub-partitioning) combines two strategies. The most common production pattern is range at the first level for time-based pruning and hash at the second level for even distribution within each time window.

The problem composite partitioning solves: pure range partitioning on a timestamp creates structurally uneven partitions as traffic grows. Your January 2023 partition has 2M rows. Your January 2026 partition has 40M rows because traffic has grown 20x over three years. Writes concentrate on the current month's partition — that single segment becomes a hotspot for concurrent inserts, autovacuum, and index updates. Adding hash sub-partitions distributes writes across N buckets per time period, keeping per-leaf-partition size roughly constant regardless of when the data was written.

The trade-off is total partition count. Partitioning by month over 4 years gives you 48 partitions. Adding 4 hash sub-partitions per month gives you 192. Adding 8 hash sub-partitions gives you 384. That's still under the 500-partition threshold where planner overhead becomes noticeable. But if you go to 16 sub-partitions over 4 years, you're at 768 — and you'll start seeing planning time increase on fast point-lookup queries.

Cross-partition queries deserve specific attention. When a query cannot prune at either level — because it doesn't filter on the partition key or because the predicate is in a form the planner can't evaluate at plan time — the database must probe every leaf partition. On a composite-partitioned table with 192 partitions, that's 192 separate segment scans. That is materially slower than scanning one unpartitioned table with a good index because of the per-partition coordination overhead. Analytics workloads that routinely scan the entire dataset are not good candidates for row-level partitioning — columnar storage or pre-aggregated materialized views are better tools for that access pattern.

composite_partitioning.sqlSQL

-- ============================================================
-- COMPOSITE: range (monthly) + hash (4 buckets per month)
-- Prevents hot-partition problem when traffic grows over time
-- Total partitions: months × hash_modulus (e.g., 48 × 4 = 192)
-- ============================================================
CREATE TABLE event_stream (
    event_id    BIGINT NOT NULL,
    user_id     INT NOT NULL,
    event_type  VARCHAR(50),
    event_time  TIMESTAMP NOT NULL,
    payload     JSONB
) PARTITION BY RANGE (event_time);

-- Monthly range partition, itself partitioned by hash on user_id
CREATE TABLE event_stream_2026_01 PARTITION OF event_stream
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01')
    PARTITION BY HASH (user_id);

CREATE TABLE event_stream_2026_01_p0 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE event_stream_2026_01_p1 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE event_stream_2026_01_p2 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE event_stream_2026_01_p3 PARTITION OF event_stream_2026_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- ============================================================
-- PRUNING BEHAVIOR: which queries prune and at which level
-- Understanding this is essential for query design on composite tables
-- ============================================================

-- FULL PRUNING: both range key and hash key in WHERE
-- Planner prunes level 1 (month) then level 2 (hash bucket)
-- Touches exactly 1 leaf partition
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE event_time >= '2026-01-01' AND event_time < '2026-02-01'
  AND user_id = 12345;
-- Partitions scanned: 1 — the hash bucket for user 12345 within January

-- PARTIAL PRUNING: range key only
-- Level 1 prunes to one month; level 2 scans all 4 hash buckets
-- Still useful — touches 4 partitions instead of 192
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE event_time >= '2026-01-01' AND event_time < '2026-02-01';
-- Partitions scanned: 4 (all hash sub-partitions of 2026_01)

-- NO PRUNING: hash key only, no range key
-- Level 1 cannot prune — hash key alone doesn't establish a time boundary
-- Scans every month's hash bucket for the matching remainder
-- Often slower than the unpartitioned equivalent due to per-partition overhead
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM event_stream
WHERE user_id = 12345;
-- Partitions scanned: all leaf partitions across all months

-- ============================================================
-- COMPOSITE PARTITION HEALTH CHECK
-- Verify all sub-partitions exist for each range parent
-- Missing sub-partitions silently drop inserts to the DEFAULT bucket
-- ============================================================
SELECT
    parent.relname  AS range_partition,
    child.relname   AS hash_sub_partition,
    pg_size_pretty(pg_total_relation_size(child.oid)) AS size
FROM pg_inherits i1
JOIN pg_class parent ON parent.oid = i1.inhrelid
JOIN pg_inherits i2  ON i2.inhparent = parent.oid
JOIN pg_class child  ON child.oid = i2.inhrelid
JOIN pg_class root   ON root.oid = i1.inhparent
WHERE root.relname = 'event_stream'
ORDER BY parent.relname, child.relname;

Output

-- Full pruning (both keys): 1 leaf partition scanned — sub-millisecond lookup

-- Partial pruning (range key only): 4 leaf partitions — still fast, month is isolated

-- No pruning (hash key only): all leaf partitions — worse than unpartitioned on small result sets

-- Health check: verifies all sub-partitions are created for each range parent

Composite Pruning Works Top-Down

Level 1 (range key): WHERE event_time >= X AND event_time < Y prunes to the matching monthly range partition
Level 2 (hash key): WHERE user_id = Z within a pruned range partition prunes to one hash bucket
Both keys in WHERE: full pruning — exactly 1 leaf partition scanned regardless of total partition count
Only range key in WHERE: partial pruning — all hash sub-partitions of the matching month are scanned
Only hash key in WHERE: no level-1 pruning — all months must be evaluated, then hash prunes within each month

Production Insight

A team needed 8 years of daily granularity for compliance reporting, so they created daily range partitions — 2920 partitions. Point-lookup queries that executed in 50ms were taking 800ms to plan because the planner evaluated all 2920 boundaries at plan time. The execution time was fine; the planning time dominated. The fix was restructuring as monthly range partitions with 8 hash sub-partitions each (96 × 8 = 768 total partitions, just under the 500-per-level threshold that starts causing problems). Planning time dropped to under 5ms. The daily granularity they needed for retention was implemented at the application layer, not the partition layer.

Key Takeaway

Composite partitioning solves the hot-partition problem but increases total leaf partition count — watch the 500-partition planning threshold. Pruning works top-down: you only reach one leaf partition when both the range key and the hash key appear as bare columns in WHERE. Design your composite scheme around your dominant query pattern, not around making the partition tree look balanced.

Composite vs Single-Level Partitioning

IfTable is time-partitioned and write throughput has grown significantly — recent partitions are visibly larger and hotter than old ones

→

UseAdd hash sub-partitions to distribute writes within each time window and prevent autovacuum hotspots

IfQueries almost always filter by date but rarely or never by the hash key column

→

UseSingle-level range partitioning is sufficient — hash sub-partitions add partition count and operational complexity without meaningful pruning benefit for your query pattern

IfTotal partition count is approaching or exceeding 500

→

UseConsolidate into larger range windows (monthly instead of weekly, quarterly instead of monthly) or switch to composite to keep each level under the planning overhead threshold

IfAnalytics queries routinely scan the entire dataset regardless of filters

→

UseRow-level partitioning is not the right tool — consider columnar storage extensions or pre-aggregated materialized views for full-scan analytics workloads

IfNeed both time-based DROP PARTITION for retention and even write distribution across the current window

→

UseComposite (range + hash) with a small hash modulus (4–8) — range enables clean date-based retention, hash prevents write hotspots on the current partition

Querying Partition Metadata — Stop Guessing, Start Verifying

You think your table is partitioned? The optimizer might disagree. After a 'quick' ALTER TABLE someone forgot to check, you'll be scanning every partition like it's 1999. SQL Server exposes partition metadata through system views — use them before a production fire drill.

The three views you need: sys.partitions, sys.partition_schemes, and sys.partition_functions. They tell you exactly which partitions exist, their boundary values, and how data is mapped. No guesswork.

Start with sys.partitions to verify partition counts. Then check sys.partition_range_values for boundary clarity. If a partition has zero rows, that's a red flag — maybe your maintenance script dropped something it shouldn't have. Always query metadata before and after schema changes. It's the difference between a calm 'yes' in a meeting and a panicked rollback.

CheckPartitionMetadata.sqlSQL

// io.thecodeforge — database tutorial

-- Is your SalesOrderHeader actually partitioned?
SELECT 
    OBJECT_NAME(p.object_id) AS TableName,
    p.partition_number,
    p.rows,
    prv.boundary_id,
    prv.value AS boundary_value
FROM sys.partitions p
LEFT JOIN sys.partition_range_values prv
    ON prv.function_id = (
        SELECT function_id 
        FROM sys.partition_schemes ps
        WHERE ps.data_space_id = p.data_space_id
    )
    AND prv.boundary_id = p.partition_number - 1
WHERE OBJECT_NAME(p.object_id) = 'SalesOrderHeader'
ORDER BY p.partition_number;

Output

TableName partition_number rows boundary_id boundary_value

SalesOrderHeader 1 1250 NULL NULL

SalesOrderHeader 2 3400 1 2023-01-01

SalesOrderHeader 3 4200 2 2024-01-01

SalesOrderHeader 4 0 3 2024-06-01

Production Trap:

A partition with zero rows can mean empty range OR dropped partition. Always correlate with boundary values before assuming it's safe. I've seen rollbacks triggered by an empty partition that was actually correct.

Key Takeaway

Query sys.partitions, sys.partition_schemes, and sys.partition_range_values after any partition operation. Trust nothing, verify everything.

Limitations That Will Bite You — Partitioning Isn't a Silver Bullet

Partitioning is a scalpel, not a chainsaw. Most devs learn this after their first failed ALTER TABLE SWITCH or a query that went parallel-psycho. Here's the reality: not all operations play nicely with partitions.

First, you cannot use ALTER TABLE SWITCH to move data between partitions with different columns, collations, or computed columns unless they're persisted. The metadata checks are strict — one mismatch and you're stuck with a manual data move.

Second, only SQL Server Enterprise Edition supports full partition-level operations like TRUNCATE on a single partition. Standard Edition treats partitioned tables as regular tables for many operations — you'll lose the performance benefit.

Third, IDENTITY columns don't reset per partition. If you truncate partition 2, the identity seed won't reset unless the entire table is truncated. That catches teams doing rolling window maintenance.

Fourth, indexes on partitioned tables have their own rules. Non-aligned indexes can kill partition pruning and make queries cross partitions unnecessarily.

PartitionLimitations.sqlSQL

// io.thecodeforge — database tutorial

-- This SWITCH will fail silently if indexes don't align
-- Stage table must have same schema, same clustered index
CREATE TABLE SalesOrder_Stage (
    OrderDate datetime,
    OrderID int,
    Amount decimal(10,2)
) ON [PRIMARY];

-- SWITCH to partition 3
ALTER TABLE SalesOrderHeader
SWITCH PARTITION 3 TO SalesOrder_Stage;
-- Fails: 'The index ... is not aligned with the switch partition'

-- Workaround: ensure stage table has identical partitioned index
-- Or use a non-partitioned swap table with no indexes
-- Then rebuild indexes on target

Output

Msg 4909, Level 16, State 1, Line 10

Cannot alter table 'SalesOrder_Stage' because the table is not partitioned and is not in the same filegroup as partition 3 of table 'SalesOrderHeader'.

Senior Shortcut:

For rolling window scenarios, always validate ALTER TABLE SWITCH with a dry run on a test replica. Use $PARTITION function to confirm row counts match between source and stage before the switch statement executes.

Key Takeaway

Know your edition's partition support. Non-aligned indexes and strict SWITCH requirements will ruin your day. Test every partition operation in a lower environment with production-sized data.

Partition Permissions — Who Can Alter the Schema Without Breaking Queries

Partitioning adds schema-level objects that demand specific permissions beyond basic DML. The partition function and scheme are owned by the schema owner, not the table owner. Granting ALTER on a table does not automatically grant rights to modify partitions. To split or merge a partition boundary, a user needs ALTER ANY DATASPACE or ownership of the partition function. Without it, ADD PARTITION or SWITCH PARTITION fails with permission errors. Even viewing partition metadata via sys.partition_functions requires VIEW DEFINITION. The silent failure occurs when a DBA adds a new filegroup but forgets to grant the partition function owner execute rights on the filegroup’s database. Production trap: granting db_ddladmin to an application service account exposes the entire partition structure to accidental boundary shifts. Lock down partition function changes to a separate role and audit their usage.

PartitionPermissions.sqlSQL

// io.thecodeforge — database tutorial

-- Grant rights to view partition scheme metadata
GRANT VIEW DEFINITION ON PARTITION SCHEME: MonthlyScheme TO ReportingUser;

-- Grant ability to alter partition structure
GRANT ALTER ANY DATASPACE TO PartitionAdmin;

-- Error: missing permission causes silent failure
-- User must own partition function or have ALTER ANY DATASPACE
ALTER PARTITION FUNCTION MonthlyPF()
    SPLIT RANGE ('2025-04-01');

-- Verify effective permissions
SELECT dp.name, p.permission_name
FROM sys.database_permissions p
JOIN sys.database_principals dp ON p.grantee_principal_id = dp.principal_id
WHERE p.major_id = OBJECT_ID('PartitionFunction: MonthlyPF');

Output

name | permission_name

PartitionAdmin | ALTER ANY DATASPACE

ReportingUser | VIEW DEFINITION

Production Trap:

Granting db_ddladmin to an app service account gives implicit rights to modify partition functions, risking accidental boundary shifts during routine operations.

Key Takeaway

Partition permissions are separate from table permissions; lock partition function changes behind a dedicated role.

New Filegroups for Partitions — When and Why to Create Them

Partitioning across a single filegroup negates I/O isolation, the primary performance benefit. When each partition resides on its own filegroup, you can place filegroups on separate physical drives or storage tiers. Create filegroups before the partition scheme references them. Use ALTER DATABASE ADD FILEGROUP, then add files. A common mistake is creating one filegroup for the entire table instead of per partition range. Hot partitions — like current month data — benefit from fast SSD filegroups; cold partitions can live on slower HDD filegroups. When you split a partition boundary, the new partition inherits the default filegroup unless you specify NEXT USED. Production trap: adding a filegroup after the scheme is built requires dropping and recreating the scheme, which forces a full data movement. Plan all filegroups upfront. For historical data, rotate filegroups quarterly and set them to READ_ONLY after the partition is no longer written to.

PartitionFilegroupSetup.sqlSQL

// io.thecodeforge — database tutorial

-- Create filegroups for each partition range
ALTER DATABASE SalesDB ADD FILEGROUP FG_Month2025_01;
ALTER DATABASE SalesDB ADD FILEGROUP FG_Month2025_02;

-- Add physical files to each filegroup
ALTER DATABASE SalesDB ADD FILE (
    NAME = 'Data_2025_01',
    FILENAME = 'D:\SSD\Data_2025_01.ndf'
) TO FILEGROUP FG_Month2025_01;

ALTER DATABASE SalesDB ADD FILE (
    NAME = 'Data_2025_02',
    FILENAME = 'E:\HDD\Data_2025_02.ndf'
) TO FILEGROUP FG_Month2025_02;

-- Reference filegroups in partition scheme
CREATE PARTITION SCHEME MonthlyScheme
AS PARTITION MonthlyPF
TO (FG_Month2025_01, FG_Month2025_02);

Output

Commands completed successfully.

Filegroup 'FG_Month2025_01' added.

Filegroup 'FG_Month2025_02' added.

Production Trap:

Splitting a partition boundary after the scheme is built requires dropping the scheme and reloading data unless NEXT USED is pre-assigned.

Key Takeaway

Create filegroups per partition range upfront to enable I/O isolation and tiered storage for hot vs. cold data.

● Production incidentPOST-MORTEMseverity: high

Partition pruning silently disabled — query scanned 800M rows instead of 2M

Symptom

Dashboard queries that previously ran in 200ms started taking 45 seconds after the table was migrated to a partitioned schema. EXPLAIN showed sequential scans across all 48 partitions. CPU on the database server spiked to 100% during business hours, and the on-call rotation started getting paged for every reporting window.

Assumption

The team assumed that partitioning would automatically speed up any query that touched the partition key in its WHERE clause, regardless of how the comparison was written. The logic was reasonable on the surface: the column is there, the partition boundaries are defined, the planner should figure it out. It doesn't.

Root cause

The application query used WHERE DATE_TRUNC('month', created_at) = '2026-01-01' instead of WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. DATE_TRUNC is a function call applied to the column — the planner cannot compare the function's output against partition boundaries at plan time because it doesn't know what values the function will return for each row without evaluating it. So it conservatively disables pruning and scans every partition. 48 partitions × roughly 2M rows each means the query was reading 96M rows to return results from 2M. The partitioning itself was correct. The query pattern was the problem.

Fix

Rewrote all reporting queries to use direct range comparisons on the bare partition key column: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. Added a lint rule to the query review checklist that rejects any query wrapping a known partition key column in a function inside a WHERE clause. Also wired EXPLAIN output into the CI pipeline for the top 50 reporting queries, so any future pruning regression surfaces in a pull request before it reaches production.

Key lesson

Partition pruning requires the partition key to appear as a bare column in WHERE — no functions, no expressions, no casting through a function
Always verify pruning with EXPLAIN immediately after deploying a partitioned schema — do not assume it works
Add automated EXPLAIN checks in CI targeting your critical queries; pruning regressions are easy to introduce and hard to notice in production until latency spikes
DATE_TRUNC, EXTRACT, CAST, and COALESCE applied to the partition key column in WHERE all disable pruning — this is not a bug, it is a fundamental constraint of how the planner evaluates partition boundaries at plan time

Production debug guideSymptom → Action mapping for common partition-related production issues5 entries

Symptom · 01

Query scans all partitions instead of pruning

→

Fix

Run EXPLAIN (ANALYZE, BUFFERS) and check for 'Partitions scanned: all' or a count equal to your total partition count. Inspect the WHERE clause for function calls on the partition key — DATE_TRUNC, EXTRACT, CAST, and COALESCE are the usual suspects. Remove the function and rewrite to a direct range comparison using bare column comparisons with >= and < operators. Re-run EXPLAIN to confirm the partition scan count drops to 1 or the expected subset.

Symptom · 02

Partition maintenance job runs for hours and blocks writes

→

Fix

Check whether you're using DELETE instead of DROP PARTITION for data retention. DELETE is a row-level operation — it generates one WAL entry per row, bloats the table until VACUUM runs, and holds row locks that slow concurrent reads. Use ALTER TABLE ... DETACH PARTITION followed by DROP TABLE instead. DETACH is a catalog metadata operation; DROP removes the physical files. The combined operation takes milliseconds and generates no WAL bloat. Schedule this in a sliding window job that runs monthly.

Symptom · 03

Insert performance degrades as partition count grows

→

Fix

Count your active partitions: SELECT COUNT(*) FROM pg_partitions WHERE tablename = 'your_table'. PostgreSQL's planner evaluates every partition boundary at plan time — at 500+ partitions, this overhead becomes measurable. If you're above that threshold, consider consolidating to larger range windows (quarterly instead of monthly) or restructuring as composite partitioning (range at level one, hash sub-partitions at level two) to reduce total partition count while maintaining distribution.

Symptom · 04

Cross-partition joins are extremely slow

→

Fix

Check whether the join key aligns with the partition key. If you're joining two tables and the join column is not the partition key on both sides, the database must probe every partition on the inner table for each outer row. The fix is co-partitioning: partition both tables on the same column with identical boundaries so the planner can perform partition-wise joins. If co-partitioning is not feasible, ensure the join column is indexed on both tables and accept the cross-partition overhead as the cost of the data model.

Symptom · 05

Partition key value falls outside all defined ranges — insert fails

→

Fix

Check whether a DEFAULT partition exists. In PostgreSQL 11+, a DEFAULT partition absorbs any row whose partition key does not match a defined range or list. Without it, inserts with out-of-range values throw a hard error and the transaction rolls back. Add a DEFAULT partition as a safety net and monitor it — if rows are accumulating there, your partition creation job is behind and you need to investigate.

★ SQL Partition Quick Debug Cheat SheetFast diagnostics for partition pruning and maintenance issues

Partition pruning not working−

Immediate action

Check EXPLAIN output for partition scan count

Commands

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01';

SELECT schemaname, tablename, n_live_tup AS row_count FROM pg_stat_user_tables WHERE tablename LIKE 'orders_%' ORDER BY tablename;

Fix now

Remove function calls on partition key in WHERE clause; use bare column with direct range operators (>= and <)

Too many partitions causing slow planning+

Partition maintenance blocking production traffic+

Partitioning Strategies Compared

Strategy	Best For	Pruning Trigger	Data Distribution	DROP for Retention	Max Practical Partitions
Range	Time-series, event logs, financial transactions	Date/timestamp range in WHERE using direct >= and < comparisons	Uneven — recent partitions grow as traffic increases, historical partitions remain static	Yes — drop by date range, instant metadata operation	~500 (monthly over roughly 40 years)
List	Categorical data — region, status, department, product line	Exact value match or IN clause on the discrete partition key	Uneven — depends entirely on value distribution in your dataset	Yes — detach and drop by category value or group	50–100 (one partition per logical category group)
Hash	Large tables with no natural range or list dimension; workloads needing even distribution for maintenance	Equality on the hash key — the planner computes hash(value) % modulus at plan time	Even — hash function distributes uniformly across all buckets	No — hash buckets have no semantic meaning; you cannot drop 'old' hash partitions	32–128 (use powers of 2 to allow future modulus doubling)
Composite (Range + Hash)	Time-series tables where traffic growth creates hot partitions on the current range window	Range key prunes at level 1 (month); hash key prunes at level 2 (bucket within month)	Even within each range window — hash distributes writes across sub-partitions	Yes — drop the entire range partition (including all its hash sub-partitions) by date	~500 total (e.g., 48 months × 4 hash = 192; 48 × 8 = 384)
None (Unpartitioned)	Tables under 10M rows; analytics workloads that routinely scan the full dataset	N/A — rely on indexes for point lookups and range scans	N/A	N/A — use time-based DELETE or archive-and-truncate patterns	N/A

Key takeaways

Partition pruning is the entire reason partitioning works

without it, you have per-partition planning overhead and zero performance benefit. Always verify with EXPLAIN immediately after deploying a partitioned schema and after any WHERE clause change on a partitioned table.

The partition key must appear as a bare column in WHERE with direct comparison operators

DATE_TRUNC, EXTRACT, CAST, and COALESCE applied to the partition key column in WHERE silently disable pruning on every query they touch.

Range for dates, list for categories, hash for even distribution, composite for both

the strategy must match your dominant query filter. Partitioning on the wrong column is worse than not partitioning at all.

DROP PARTITION (DETACH then DROP) is a millisecond metadata operation vs DELETE which runs for hours, generates massive WAL, and triggers autovacuum for days. This is the operational payoff that justifies partitioning's complexity.

Keep total leaf partition count under 500

each partition adds to the planner's boundary evaluation work at plan time. Use composite (range + hash) to achieve fine-grained distribution without creating thousands of tiny range partitions.

Automate partition management

create future partitions 3–6 months ahead using a monthly scheduled job with idempotent CREATE TABLE IF NOT EXISTS; drop expired partitions in the same job via the sliding window pattern. Monitor the DEFAULT partition — rows accumulating there mean your creation job is behind.

Common mistakes to avoid

5 patterns

Partitioning on a column not used in WHERE clauses

Symptom

Partitioning adds per-partition planning overhead but provides zero pruning benefit. Every query scans all partitions. Query latency increases as partition count grows instead of staying constant. Performance is measurably worse than the original unpartitioned table for queries that don't filter on the partition key.

Fix

Before partitioning, run EXPLAIN on your top 20 queries and identify which column appears in WHERE on at least 80% of them. That column is your partition key candidate. Partition on the column your queries actually filter on — typically created_at for time-series workloads or region for categorical data. If no single column dominates your WHERE clauses, partitioning is probably not the right tool for your access pattern.

Wrapping the partition key in a function in WHERE

Symptom

Partition pruning silently fails. EXPLAIN shows 'Partitions scanned: all' even though the WHERE clause references the partition key. Query latency increases linearly with partition count instead of staying constant. The query returns correct results, which makes this failure mode particularly hard to notice without inspecting EXPLAIN output.

Fix

Use bare column comparisons with direct range operators: WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01'. Never apply DATE_TRUNC, EXTRACT, CAST, COALESCE, or any other function to the partition key column inside a WHERE clause. Add a lint rule or a pre-commit hook that flags any query wrapping a known partition key column in a function. Wire EXPLAIN output into CI for your critical reporting queries so pruning regressions surface in pull requests.

Using DELETE instead of DROP PARTITION for data retention

Symptom

The retention job runs for hours, generates massive WAL output (often hundreds of gigabytes), bloats the table with dead tuples, and triggers aggressive autovacuum that competes with production reads and writes. Concurrent insert latency spikes during the DELETE window because row-level locks are held throughout.

Fix

Replace DELETE-based retention with ALTER TABLE ... DETACH PARTITION followed by DROP TABLE. DETACH removes the partition from the parent table's catalog as a fast metadata update. DROP removes the physical files. The combined operation completes in milliseconds with zero WAL generation and no row-level locks. Automate this with a monthly sliding window job that creates the next period's partition and drops the expired one in the same run.

Creating too many fine-grained partitions (thousands of daily or hourly partitions)

Symptom

Query planning time exceeds query execution time. EXPLAIN ANALYZE shows the planner taking 500–800ms to produce a plan for a query that executes in 30–50ms. INSERT performance also degrades as the catalog grows because each insert requires the planner to evaluate partition boundaries to route the row correctly.

Fix

Keep total partition count under 500 leaf partitions. Use monthly or quarterly range windows instead of daily or hourly. If you need finer granularity within a month for distribution purposes, use composite partitioning (range + hash sub-partitions) — 48 monthly partitions with 4 hash sub-partitions each gives you 192 leaf partitions with both time-based pruning and even write distribution.

Not creating future partitions before data arrives

Symptom

INSERT fails with 'no partition of relation found for row' at the start of a new month. Application returns 500 errors until a DBA manually creates the missing partition. If the application does not retry on failure, rows written during the gap are permanently lost.

Fix

Automate partition creation with a scheduled job that creates partitions 3–6 months ahead using CREATE TABLE IF NOT EXISTS (making the job idempotent and safe to run repeatedly). Always include a DEFAULT partition as a safety net to absorb unexpected values instead of throwing a hard error. Monitor the DEFAULT partition's row count weekly — rows accumulating there signal that your creation job has fallen behind or your partition key is receiving values outside the expected range.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is partition pruning and how would you verify it's working in a pro...

Q02SENIOR

You have a table with 800 million rows partitioned by month. How would y...

Q03SENIOR

When would you choose hash partitioning over range partitioning, and wha...

Q04SENIOR

Explain the difference between partitioning and sharding. When would you...

Q05SENIOR

A query on a partitioned table is scanning all 48 partitions even though...

Q06SENIOR

How does composite partitioning (range + hash) solve the hot-partition p...

Q01 of 06SENIOR

What is partition pruning and how would you verify it's working in a production query?

ANSWER

Partition pruning is the process by which the query planner eliminates entire physical partitions from the scan plan at plan time, before any rows are read, because it can determine from the WHERE clause that those partitions cannot contain matching rows. The planner compares your WHERE clause bounds against the partition boundary definitions stored in the catalog — if the ranges don't overlap, that partition is skipped entirely with zero I/O. To verify pruning: run EXPLAIN (ANALYZE, BUFFERS) on the query and look for 'Partitions scanned: N' in the output. If N is 1 or a small subset of your total partition count, pruning is working. If N equals your total partition count, pruning has failed. Common failure causes in order of frequency: function calls on the partition key in WHERE (DATE_TRUNC, EXTRACT, CAST, COALESCE), implicit type casting between the partition key column type and a literal value, OR conditions that include a column other than the partition key, and predicates buried inside CTEs that the planner cannot push down through the CTE boundary. The fix is always the same: rewrite the WHERE clause to use the partition key as a bare column with direct range comparison operators.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Does partitioning replace the need for indexes?

Can I partition an existing table without downtime?

How do I choose between monthly and quarterly partitions?

What happens if I insert a row whose partition key value doesn't match any partition?

How does partitioning interact with foreign keys and joins?

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's SQL Advanced. Mark it forged?

11 min read · try the examples if you haven't