MySQL & PostgreSQL Backup Failures — Why Exit Code 0 Lies
A 2TB mysqldump ran 21 days before bit rot made it unrestorable.
20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.
- Logical backups (mysqldump, pg_dump) produce SQL text files — portable, cross-version, but slow for large datasets
- Physical backups (XtraBackup, pg_basebackup) copy raw data files — fast, full/incremental, but tied to storage engine
- Production backups must include binary logs (MySQL) or WAL archives (PostgreSQL) for point-in-time recovery
- Automated scripts belong in cron with healthcheck — silent backup failures are the most dangerous
- The biggest mistake: never testing a restore — a backup you haven't restored is a wish
Imagine you spend three months building the world's greatest LEGO castle. A backup is a photo of that castle taken every day — so if your little sibling kicks it over, you can rebuild from yesterday's photo instead of starting from scratch. A restore is the act of rebuilding from that photo. Database backup and restore is exactly that: saving a copy of your data at a point in time, and bringing it back when something goes wrong.
Every developer eventually faces The Moment — a production database gets corrupted, a junior engineer runs DELETE without a WHERE clause, or a cloud disk silently fails. The difference between a bad afternoon and a company-ending catastrophe is whether you had a solid backup strategy before it happened. Backup and restore isn't a nice-to-have; it's the insurance policy your entire application depends on.
The problem is that most tutorials show you a command and call it a day. They don't explain that there are fundamentally different types of backups — logical vs. physical — and that choosing the wrong one for your situation can mean either a 6-hour restore window when you need 20 minutes, or a backup file that's completely unusable. They don't explain why pg_dump and mysqldump exist as separate tools with different philosophies, or when you'd reach for something else entirely.
By the end of this article you'll be able to: create logical and physical backups for both MySQL and PostgreSQL, write an automated backup script that you can drop into a cron job today, restore from a backup under pressure, and know exactly which approach to use for which scenario. Let's build that safety net.
Why Database Backup Restore Is a Recovery Protocol, Not a Copy Job
Database backup restore is the process of reconstructing a database from a previously captured snapshot or transaction log sequence. The core mechanic is not merely copying files back into place — it's replaying a point-in-time state through a recovery engine that validates checksums, applies WAL segments (PostgreSQL) or binary logs (MySQL), and rolls forward or backward to a consistent LSN or transaction ID. A restore that exits with code 0 can still produce a logically corrupt database if the backup itself was taken without a consistent read lock or if partial writes were captured during the snapshot.
In practice, restore correctness depends on three properties: atomicity of the backup point, replayability of the transaction log, and verification of page-level integrity. MySQL's mysqldump with --single-transaction gives a consistent snapshot for InnoDB but misses ongoing DDL; XtraBackup captures physical pages but requires redo log apply. PostgreSQL's pg_basebackup with --wal-method=stream ensures a consistent cluster state, but a missing WAL segment makes the restore fail silently — exit code 0, data missing. The restore tool must validate that the backup covers a complete log sequence number range, or you get a running database with silent data loss.
You use backup restore when a primary fails, a schema migration corrupts data, or a logical error (e.g., DELETE without WHERE) needs point-in-time recovery. In production, the restore is the only test that matters — if you haven't restored a backup in the last 30 days, you don't have a backup. Teams that rely on exit code 0 from pg_restore or mysql without verifying row counts or checksums discover this the hard way during an incident. The restore must be automated, monitored, and periodically validated against a staging environment with production-scale data.
pg_start_backup() completing, so the WAL segment was missing. The symptom: queries returned stale data for 30 minutes before the database crashed with a missing WAL error. Rule of thumb: always run pg_verify_checksums and a SELECT COUNT(*) on a critical table immediately after restore.Logical vs Physical Backups — The First Decision
Before you write a single backup script, you need to decide which type of backup fits your workload. Logical backups produce SQL statements or delimited text. Physical backups copy the raw database files.
Logical backups (mysqldump, pg_dump) are the Swiss Army knife of backups. They produce portable archives that you can restore on a different MySQL/PostgreSQL version, different architecture, or even different database engine with minor tweaks. They're ideal for small databases (<50 GB), schema-only backups, or when you need to migrate between environments. The cost: they are slow. A 500 GB MySQL dump can take hours, and restore takes even longer because every row must go through the SQL parser and re-index.
Physical backups (XtraBackup, pg_basebackup) copy raw data files at the filesystem level. They're fast — a 2 TB database can be backed up in under 30 minutes. They support incremental backups and have minimal database impact during operation. The trade-off: they're tightly coupled to the database version, storage engine, and operating system. You cannot restore an InnoDB physical backup into MyISAM, or a PostgreSQL 16 physical backup into version 15 without pg_upgrade.
The reality: in production, you need both. Logical backups for granular restore (single table, specific rows), and physical backups for full database recovery speed. You also need continuous archiving of transaction logs (binary logs for MySQL, WAL for PostgreSQL) for point-in-time recovery — no backup is complete without them.
- Logical = human-readable, portable, slow, good for small databases and migration
- Physical = machine-optimised, fast, tied to version/storage engine, good for large databases
- Both need transaction log archiving for point-in-time recovery — don't skip this
mysqldump: Options That Matter in Production
mysqldump is ubiquitous, but most engineers use it with defaults that are dangerous in production. The default mode locks tables with a global read lock, blocking writes for the entire dump duration. On a busy database, that's a full write outage.
Critical options: - --single-transaction: For InnoDB tables, starts a transaction using MVCC to get a consistent snapshot without locking writes. This is the most important flag for production MySQL. Do not dump a live InnoDB database without it. - --routines --events --triggers: By default, mysqldump does NOT include stored procedures, events, or triggers unless you explicitly add these flags. Missing them means your restore will lose application logic. - --master-data=2: Records the binary log file and position at dump time, essential for replication setup or point-in-time recovery. Also enables --flush-logs implicitly. - --compress: Compresses the output on the fly, saving disk space and network bandwidth without needing a separate gzip step. - --opt: A shortcut for several optimizations (disable foreign key checks, extended inserts) — but it's enabled by default. Good for restore speed, but the extended inserts can cause memory issues on restricted test instances.
Performance consideration: A dump of a 200 GB InnoDB database will take about 30–60 minutes on modern hardware. During that time, the database buffer pool might be flushed by the full table scan. Monitor SHOW ENGINE INNODB STATUS for pending disk reads. For databases over 500 GB, use mydumper (parallel logical dump) or switch to physical backup.
SELECT ENGINE, TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA NOT IN ('sys','performance_schema','information_schema') AND ENGINE != 'InnoDB';pg_dump: What Seniors Know About PostgreSQL Backups
PostgreSQL's logical backup tool pg_dump is more flexible than mysqldump but has its own sharp edges. The key difference: pg_dump runs in a READ COMMITTED isolation level by default, not REPEATABLE READ. That means if another transaction modifies data during the dump, you may get an inconsistent snapshot.
Critical flags: - -j N (parallel jobs): Speeds up dump by using N concurrent workers, each dumping a separate table. The default is 1. For databases with large tables, use -j 4 to cut dump time by 60–70%. But beware: parallel dumps put more load on the server. - --schema-only / --data-only: Useful when you need to preserve structure separately from data, e.g., for change management. - --format=directory: Outputs each table as a separate file inside a directory. This enables parallel restore with pg_restore -j. It also compresses each file individually, so you can restore single tables. - --no-owner / --no-privileges: Essential when restoring on a different server where the PostgreSQL user may not exist. - --snapshot: Use a previously exported snapshot ID for consistent backup across multiple dumps (e.g., schema + data).
Point-in-time recovery prerequisite: pg_dump alone cannot give you PITR. You must combine it with continuous WAL archiving via archive_mode = on and archive_command. The base backup (e.g., pg_basebackup) plus WAL segments allows restoring to any second.
- pg_dump is fine for databases up to 100 GB, beyond that use pg_basebackup
- Parallel jobs help but increase server load — monitor CPU and I/O
- For PITR, pg_dump is not enough; you need WAL archiving configured before the backup
-Fc (custom format) — good for small databases, single-file restore-Fd (directory format) — best for production, enables pg_restore -jAutomating Backups: The Cron Script That Won't Fail Silently
Most backup strategies fail not because of bad tools, but because the automation is brittle. A cron job that silently stops running, or a script that returns exit code 0 even when the backup file is empty, is worse than no backup — it gives false confidence.
Build a robust automated backup pipeline: 1. Pre-checks: Before starting a backup, verify disk space (df -h) is above threshold, database is reachable (mysqladmin ping or pg_isready), and no other backup is running (lock file). 2. Backup execution: Use the commands from earlier sections with proper flags. Redirect stdout/stderr to a log file. 3. Post-check: Immediately verify the backup by attempting a partial restore on a temporary database. For logical dumps, run the first 1000 lines through the parser. For physical backups, check the last modified timestamp and file size. 4. Rotation & retention: Keep daily backups for 7 days, weekly for 4 weeks, monthly for 12 months. Delete old backups automatically, but keep the latest full backup in a separate location. 5. Alerting: If the backup fails or the verification step fails, send an alert to PagerDuty, email, or Slack. Do not rely on cron's built-in mail — it's often not configured.
The cron schedule: For most workloads, run a full backup daily during off-peak hours (e.g., 2 AM). For MySQL with high transaction volume, stream binary logs every 5 minutes to a remote server. For PostgreSQL, configure archive_timeout = 60 in postgresql.conf to archive WAL segments at least every minute.
Pro tip: Use a tool like autobackup.py or a simple bash script with error handling. Avoid complex frameworks unless you have a dedicated team to maintain them.
pg_restore -l to list contents.Restore Under Pressure: The Procedure That Saves Your Career
Restores happen under pressure. The system is down, stakeholders are watching, and your hands are shaking. A well-documented restore procedure is your lifeline. The number one rule: never restore directly onto the production server unless you have no other choice. Always restore to a separate instance first to validate the backup.
Restore steps for logical backup: 1. Spin up a clean instance with the same version and configuration. 2. Transfer the backup file (if not already local). 3. For MySQL: mysql -u root -p < full_dump.sql or using a compressed file: zcat full_dump.sql.gz | mysql -u root -p. 4. For PostgreSQL: pg_restore -d new_db -j 4 -Fd backup_directory or for plain SQL: psql -d new_db -f backup.sql. 5. Verify the data: run select count(*) on critical tables, test application queries. 6. If success, point application to new database (or promote restored instance).
Restore steps for physical backup (XtraBackup): 1. Prepare the backup: xtrabackup --prepare --target-dir=/backup/xtra. 2. Copy the prepared backup to the MySQL data directory. 3. Start MySQL; check error log. 4. Apply binary logs from the dump position to the desired point in time.
Restore steps for physical backup (pg_basebackup): 1. Ensure the base backup directory is in place. 2. Edit postgresql.auto.conf or recovery.conf to set restore_command to fetch WAL from archive. 3. Create a recovery target if needed: recovery_target_time '2026-04-22 14:30:00'. 4. Start PostgreSQL; it will automatically replay WAL until the target. 5. Run SELECT to confirm it's up.pg_is_in_recovery();
Test your restores routinely. Schedule a monthly restore drill where you restore the latest backup to a test environment and run a script that validates row counts from a known checkpoint. Without testing, you're gambling.
- Document the restore procedure step by step, with exact commands
- Include environment variables, paths, and which backup file to use
- Practice once a month — rotate the person running the drill
- After each restore, write a one-paragraph retrospective
Point-in-Time Recovery: Your Undo Button for the Entire Database
Point-in-time recovery (PITR) isn't a feature. It's your career insurance. The difference between restoring a full backup from midnight and recovering to 2:47 PM — right before that DELETE without a WHERE clause ran — is a matter of minutes, not days.
Every major database supports it. MySQL binlogs, PostgreSQL WAL archives, SQL Server transaction logs. They all record every write operation in sequence. PITR replays those logs up to a specific timestamp or LSN. You don't need a separate backup for every minute. You need one full backup plus all the logs generated after it.
The gotcha: you must have the logs. No logs, no PITR. That means log archiving must be configured before the incident. Not after. If your binlog retention is 24 hours and someone drops a table on Friday afternoon, your Saturday morning restore will miss that window. Set retention to cover your worst-case recovery scenario, not your average Tuesday.
Senior move: test PITR quarterly. Restore a copy of production to a staging server and roll forward to five minutes before a known event. If it takes longer than your RTO allows, you either need faster storage or a smaller full backup frequency. Either way, you find out before the CEO asks why the dashboard is showing yesterday's numbers.
Restore to a Different Server: Why You Never Overwrite Production
Restoring over the original database is the cardinal sin of database recovery. It turns a single-point failure into a total loss. If the restore fails halfway, you've nuked the source and the target. Now you have no database at all.
The rule: always restore to a new server or database instance. Every major database supports this. PostgreSQL has createdb from template. MySQL lets you restore into a new database name. SQL Server has the WITH MOVE option. Azure SQL forces it — you can't overwrite an existing database. That's not a limitation, it's a safety feature.
Why this matters beyond safety: you can verify the restored data before cutting over. Run integrity checks. Compare row counts. Validate that your report queries still work. If something is broken, you don't cascade the corruption back into production. You fix the restore process, drop the bad copy, and try again.
Senior shortcut: script the rename-and-switch. Restore to 'payments_restored', run checks, then rename 'payments' to 'payments_old' and 'payments_restored' to 'payments'. Atomic. Reversible. If the checks fail, you just delete the restores copy and production stays untouched. This is the difference between a 15-minute recovery and a 3-hour firefight with your CTO on the bridge.
Compression and Encryption: Why Your Backup Pipeline Must Be a Safe, Slim Pipeline
You can dump all day, but if you're copying raw 50GB SQL files across the wire, you deserve the bandwidth bill. Compress before you transfer, encrypt before you store. Your backup script is not just a copy job—it's a data exfiltration risk if left unencrypted.
Use gzip on the fly with mysqldump: mysqldump ... | gzip > dump.sql.gz. For encryption, pipe through openssl or use GPG. PostgreSQL allows compressed dumps natively with pg_dump -Z 9. Don't skip integrity checks. Compute SHA256 checksums on the encrypted file and store them separately. Verifying checksums before restore is what separates you from the junior who restores a corrupted archive at 3 AM.
Test your decryption pipeline. If you can't decrypt your own backup in a dry run, it's not a backup—it's a paperweight.
Restore to a Staging Environment First: Your Safety Net Before Touching Production
The fastest way to get fired is restoring a backup directly onto production without verifying it first. I've seen a senior dev overwrite a live database with a week-old snapshot because he didn't check the backup timestamp. Don't be that guy.
Always restore to a staging or isolated environment first. Validate the data: row counts, recent timestamps, foreign key integrity, and any business-critical queries. Run a smoke test that exercises your application's core features against the restored data. Only after confirmation do you point production traffic at the fresh copy or migrate the validated snapshot.
Automate this validation. A cron job that restores last night's backup to a staging DB and runs a test suite every morning pays for itself the first time it catches corruption. If you can't restore and verify automatically, your backup strategy is a hope, not a plan.
The Silent Backup Blob: When 2 TB of Nothing Saves Your Database
--checksum and --single-transaction to mysqldump. 3) Implemented a post-backup verification step using mysql < backup.sql > /dev/null on a staging instance. 4) Enabled binary log retention and started shipping WAL-equivalent logs to an S3 bucket. 5) Instituted monthly full restore drills.- A backup that hasn't been restored is a wish, not a backup.
- Exit code 0 from a dump tool is necessary but not sufficient — verify the restore on a test instance.
- Always combine logical backups with continuous archiving of transaction logs for point-in-time recovery.
head -c 100 backup.sql | xxd. Use mysql -e "SELECT @@version" to verify server version compatibility. Test restore on a staging DB: mysql test_db < backup.sql and look for error line numbers.SELECT * FROM pg_stat_archiver. Use pg_dump -j 4 --format=directory for parallel dump. Monitor iotop and pg_stat_activity for blocking sessions. Consider using pg_basebackup for large databases.restore_command in recovery.conf points to the correct location and the WAL segments are present. Use pg_waldump on the first segment to check timeline. Ensure --wal-method=stream was used during base backup if continuous archiving is off.SHOW BINARY LOGS; — are old logs being purged? PostgreSQL: SELECT * FROM pg_stat_replication — is a stale replication slot preventing WAL cleanup? For logical dumps, check if binary logs are included (MySQL --flush-logs).systemctl stop mysql && mv /var/lib/mysql/ibdata1 /tmp/ibdata1.corruptmysqld --innodb_force_recovery=1 &Key takeaways
Common mistakes to avoid
5 patternsUsing mysqldump without --single-transaction on a live InnoDB database
--single-transaction to the mysqldump command to use a transaction snapshot instead of locking tables.Not testing backups by performing a restore
Assuming pg_dump includes WAL information for point-in-time recovery
archive_mode = on and archive_command BEFORE the backup.Backing up only the database, ignoring configuration and scripts
my.cnf / postgresql.conf, schema-only dumps, and application-side database scripts (migrations) in your backup rotation.Relying on a single backup destination (same disk, same datacenter)
Interview Questions on This Topic
Explain the difference between a logical backup and a physical backup. When would you use each for a MySQL database in production?
Frequently Asked Questions
20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.
That's MySQL & PostgreSQL. Mark it forged?
12 min read · try the examples if you haven't