Checkpoint in DBMS — Why 60s Timeout Killed Production I/O
A 60-second checkpoint_timeout spiked disk to 100% and latency from 5ms to 5s.
- Checkpoint writes dirty pages from buffer pool to disk and updates WAL with a redo start point
- Reduces crash recovery time from hours to seconds by avoiding full WAL replay
- Sharp checkpoints write all dirty pages synchronously (slow but clean); fuzzy checkpoints spread writes over time (faster but complex)
- In PostgreSQL, checkpoint_timeout and checkpoint_completion_target control frequency and I/O impact
- Production gotcha: too frequent checkpoints cause I/O storms; too infrequent cause long recovery
Imagine you're doing a 500-piece jigsaw puzzle and you stop every hour to take a photo of your progress. If the dog jumps on the table and scatters everything, you restart from the photo — not from the empty table. A database checkpoint is exactly that photo: a confirmed 'safe point' on disk so that after a crash, the database only has to redo work done after the last photo, not replay every single move since it was first installed.
Every production database — Postgres, MySQL InnoDB, Oracle, SQL Server — will crash at some point. Power cuts happen. Kernel panics happen. Someone trips over the wrong cable. The question isn't if your database will die mid-transaction; it's how fast it can get back on its feet with zero data loss. That answer lives almost entirely in one mechanism: the checkpoint.
Without checkpoints, crash recovery means replaying every single log record ever written — potentially years of transactions — before the database can accept a single query. That would make restarts take hours or days. Checkpoints solve this by periodically writing all dirty pages from memory to disk and recording a 'you can start recovery from here' marker in the write-ahead log. That single act turns a potential multi-hour recovery into seconds.
By the end of this article you'll understand exactly what happens inside a checkpoint cycle, the difference between sharp and fuzzy checkpoints and why fuzzy ones exist, how the WAL and buffer pool interact during checkpointing, what the performance trade-offs look like in PostgreSQL's real configuration knobs, and the production gotchas that bite engineers who treat checkpoints as a background detail.
What Is a Checkpoint and Why Does It Exist?
A checkpoint in a DBMS is a synchronization point where the buffer pool's dirty pages (modified but not yet on disk) are flushed to the data files, and the WAL is updated to mark that all changes up to that point are safely stored. This drastically shortens crash recovery: instead of replaying the entire WAL from the beginning, the database can start from the most recent checkpoint. The checkpoint record in the WAL contains the LSN (Log Sequence Number) that acts as the new recovery start point.
Checkpoints are not optional – every ACID-compliant database implements them. Without checkpoints, recovery time would grow linearly with the age of the database. A ten-year-old database would need to replay ten years of transactions after each restart. That simply doesn't work at scale.
- Dirty pages = unsaved progress; checkpoint = save to disk.
- WAL = a journal of every action; checkpoint = a bookmark in the journal saying 'everything before this is safe'.
- Recovery scans the WAL from the checkpoint forward, not from the beginning — that's the speed gain.
- Without checkpoints, recovery time equals database age. With them, it's bounded by the checkpoint interval.
WAL and Checkpoint Interaction: The Recovery Handshake
The Write-Ahead Log (WAL) records every change before it's applied to data files. A checkpoint ensures that all WAL records up to a certain LSN have been fully applied to the data files on disk. After checkpoint, the WAL can be truncated up to that LSN. During recovery, the database reads the last checkpoint location from the WAL and replays all subsequent records. This is called the redo phase. Some databases also have an undo phase to roll back incomplete transactions.
The key insight: the checkpoint record itself is written after all dirty pages are safely on disk. If the checkpoint record is missing, recovery falls back to the previous checkpoint. That's why checkpoint writes are synchronous and include a WAL flush. The checkpoint writes use direct I/O to ensure persistence.
Sharp vs Fuzzy Checkpoints: The Performance Trade-off
Sharp checkpoints flush all dirty pages at one moment. This is simple but causes a massive I/O spike. Fuzzy checkpoints (used by PostgreSQL, MySQL InnoDB) spread the flushing over a configurable window. They start writing dirty pages early in the interval and try to complete before the next checkpoint. The WAL checkpoint record marks the boundary, but data writes are asynchronous.
In PostgreSQL, checkpoint_completion_target controls how much of the interval is used for flushing. A value of 0.9 means 90% of checkpoint_timeout is spent writing dirty pages, with the final 10% reserved for the sync. This avoids a sudden burst but requires a steady I/O capacity.
Sharp checkpoints still exist in some systems (e.g., SQL Server simple recovery model), but modern systems prefer fuzzy for stability.
- Sharp = sprint: all writes at once, I/O spike, fast but painful.
- Fuzzy = steady jog: writes spread over minutes, smooth I/O, easier on disk.
- Completion target = your pace plan: how much of the interval you use.
- Database replay doesn't care about fuzziness — it only looks at the LSN boundary.
Checkpoint Tuning in PostgreSQL: Real Knobs and Constraints
PostgreSQL provides several knobs to control checkpoint behaviour. The most important are:
- checkpoint_timeout: Maximum time between automatic checkpoints (default 5 minutes).
- max_wal_size: Target for total WAL size. When exceeded, a checkpoint is forced.
- min_wal_size: Minimum WAL size to keep for recycling.
- checkpoint_completion_target: Fraction of checkpoint_timeout spent flushing dirty pages.
- checkpoint_flush_after: Number of pages after which to flush writes to OS.
Understanding the interaction: checkpoints can be triggered by time (timed checkpoints) or by WAL size (requested checkpoints). The pg_stat_bgwriter view shows how many of each occurred. A high ratio of requested to timed means the WAL size is often hitting max_wal_size — that can signal write bursts or insufficient max_wal_size.
Common Checkpoint Mistakes and How to Fix Them
Even experienced DBAs misconfigure checkpoints. The most common issues:
- Using default settings in production: Default checkpoint_timeout=5min may be fine for dev but causes I/O spikes in heavy-load production. Always baseline I/O and adjust.
- Ignoring full_page_writes: PostgreSQL writes full page images to WAL during first modification after a checkpoint. Disabling this saves space but risks page corruption on crash. Only disable on replicas with data checksums.
- Setting checkpoint_completion_target >0.95: Leaves no buffer for OS I/O delays. A checkpoint that doesn't complete before the next timeout triggers a forced sync, causing a sharp I/O spike.
- Not monitoring buffers_backend: If backend processes are writing dirty pages themselves, the checkpointer is overwhelmed. Increase checkpoint_timeout or add I/O capacity.
PostgreSQL Checkpoint Storm Wiped Out Production DB Performance
- Monitor checkpoint-related metrics in pg_stat_bgwriter (buffers_checkpoint, checkpoint_sync_time).
- Never use default checkpoint_timeout in production without baseline I/O capacity testing.
- Fuzzy checkpoints are a lifesaver – they smooth out I/O load but require careful tuning of checkpoint_completion_target.
Key takeaways
Common mistakes to avoid
5 patternsUsing default checkpoint_timeout in high-write production
Setting checkpoint_completion_target too high (>0.95)
Disabling full_page_writes to save WAL space
Forgetting to monitor pg_stat_bgwriter
Checking only total time, not the distribution
Interview Questions on This Topic
What is the purpose of a checkpoint in a DBMS?
Frequently Asked Questions
That's DBMS. Mark it forged?
4 min read · try the examples if you haven't