MySQL Replication Setup: Master-Replica Configuration, Internals & Production Gotchas
Every production database running at scale eventually hits the same wall: a single server can only handle so many reads before query times creep up, and a single point of failure is a liability no on-call engineer wants at 2 AM. MySQL replication is the foundational answer to both problems — it lets you distribute reads across multiple servers and gives you a warm standby that can take over within seconds if the primary crashes. It's not optional infrastructure. At any company running MySQL seriously, replication is day one.
The core problem replication solves is deceptively simple: how do you keep two or more MySQL instances in sync without locking everything or doubling write traffic? The answer lives in the binary log — a sequential, append-only journal of every state-changing event on the primary. Replicas connect, stream that log, and apply events in order. But the devil is in the details — GTID vs file-position tracking, parallel apply workers, semi-synchronous acknowledgment, replication filters, and lag under heavy write load are all areas where teams get burned in production.
By the end of this article you'll be able to configure a production-grade primary/replica pair from scratch, understand the exact bytes flowing between them, switch between GTID and position-based replication with confidence, monitor and diagnose replication lag, and avoid the three most expensive mistakes engineers make when they first set this up. This is not a tutorial for spinning up a throwaway demo — it's the setup you'd be comfortable putting in front of real traffic.
How MySQL Replication Actually Works Under the Hood
Before touching a config file, you need a mental model of what's actually happening. MySQL replication is driven by three threads. On the primary, the binlog dump thread streams binary log events to any replica that connects. On the replica, the I/O thread maintains a persistent TCP connection to the primary, receives those events, and writes them into a local file called the relay log. Meanwhile, the SQL thread (or parallel apply workers in modern MySQL) reads the relay log and executes each event against the replica's local storage engine.
The binary log is not a simple text log — it's a structured binary format that records events at the statement level (STATEMENT mode), the row level (ROW mode), or a hybrid of both (MIXED mode). ROW-based replication is what you want in production. It records the actual before and after image of each changed row, which makes it deterministic — no ambiguity about what a non-deterministic function like UUID() or NOW() returned. STATEMENT mode can cause silent data divergence when queries contain functions whose output differs between servers.
GTID (Global Transaction Identifier) is a unique ID assigned to every committed transaction on the primary, formatted as source_uuid:transaction_sequence. GTIDs make failover and replica reconfiguration dramatically safer because a replica can tell any server 'here are the GTIDs I already have — send me everything else,' without needing to know a specific file name and byte offset. For any new setup in 2024, GTID mode is the right default.
-- ============================================================ -- Run these on the PRIMARY to inspect binlog state -- ============================================================ -- Show all binary log files and their sizes on disk SHOW BINARY LOGS; -- Output tells you how many logs exist and how much disk they consume -- Inspect the live stream of binary log events (human-readable) -- Replace 'mysql-bin.000003' with your current binlog file SHOW BINLOG EVENTS IN 'mysql-bin.000003' LIMIT 20; -- Confirm GTID mode is enabled and see executed GTID sets SHOW VARIABLES LIKE 'gtid_mode'; -- Should return ON SHOW VARIABLES LIKE 'enforce_gtid_consistency'; -- Should return ON SHOW MASTER STATUS\G -- gtid_executed column shows every GTID this server has committed -- ============================================================ -- Run these on the REPLICA to inspect the three replication threads -- ============================================================ -- The single most important replication health command SHOW REPLICA STATUS\G -- Key fields to check: -- Replica_IO_Running: Yes -- I/O thread is connected to primary -- Replica_SQL_Running: Yes -- SQL thread is applying relay log -- Seconds_Behind_Source: 0 -- 0 means replica is caught up -- Last_IO_Error: (empty) -- Any value here means connection trouble -- Last_SQL_Error: (empty) -- Any value here means apply-side failure -- Retrieved_Gtid_Set -- GTIDs received from primary -- Executed_Gtid_Set -- GTIDs applied to this replica's data -- Check how many parallel SQL apply workers are configured SHOW VARIABLES LIKE 'replica_parallel_workers'; -- 0 = single-threaded SHOW VARIABLES LIKE 'replica_parallel_type'; -- LOGICAL_CLOCK is the modern choice
+------------------+-----------+-----------+
| Log_name | File_size | Encrypted |
+------------------+-----------+-----------+
| mysql-bin.000001 | 177 | No |
| mysql-bin.000002 | 52428800 | No |
| mysql-bin.000003 | 1024312 | No |
+------------------+-----------+-----------+
-- SHOW MASTER STATUS\G output (example with GTID):
*************************** 1. row ***************************
File: mysql-bin.000003
Position: 1024312
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: a1b2c3d4-0000-0000-0000-000000000001:1-58291
-- SHOW REPLICA STATUS\G (healthy replica):
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 0
Last_IO_Error:
Last_SQL_Error:
Retrieved_Gtid_Set: a1b2c3d4-0000-0000-0000-000000000001:1-58291
Executed_Gtid_Set: a1b2c3d4-0000-0000-0000-000000000001:1-58291
Step-by-Step: Configuring a GTID-Based Primary/Replica Pair
Let's build a working replication setup. We'll use two servers: primary at 10.0.1.10 and replica at 10.0.1.11, both running MySQL 8.0+. The configuration changes require a restart, so plan your maintenance window accordingly — or use MySQL's SET PERSIST for variables that support online changes.
The primary needs the binary log enabled with a unique server ID, and it needs a dedicated replication user with minimal privileges. The replication user should only have REPLICATION SLAVE privilege — nothing else. Least privilege matters here because this account's credentials live in plaintext in older versions, and even with mysql.slave_master_info table encryption in 8.0, you don't want a compromised replica credential giving an attacker DML access.
On the replica side, you set a distinct server ID, point it at the primary using CHANGE REPLICATION SOURCE TO (the modern syntax replacing CHANGE MASTER TO), and start the replication threads. With GTID mode, you don't need to know a binlog file or position — you just tell the replica 'start from the beginning of GTIDs I don't have.' That's the magic of GTID: the protocol figures out the gap automatically.
For an initial data load, use mysqldump with --single-transaction --master-data=2 --gtids, or for large datasets, Percona XtraBackup which takes a hot physical copy without locking tables. Never skip the initial consistent snapshot — if you do, the replica starts with wrong data and GTID tracking gives you a false sense of correctness.
# ============================================================ # PRIMARY SERVER: /etc/mysql/mysql.conf.d/mysqld.cnf # Apply to server at 10.0.1.10, then restart MySQL # ============================================================ [mysqld] # Every server in a replication topology MUST have a unique ID server-id = 1 # Enable the binary log — replication cannot work without this log-bin = mysql-bin binlog_format = ROW # Deterministic row-level events binlog_row_image = FULL # Log complete before+after row images # Retain binlogs for 7 days — long enough for replicas to catch up after downtime binlog_expire_logs_seconds = 604800 # GTID settings — both must be ON together gtid_mode = ON enforce_gtid_consistency = ON # Crash safety: flush binlog to disk on every commit (safer, slightly slower) sync_binlog = 1 innodb_flush_log_at_trx_commit = 1 # ============================================================ # REPLICA SERVER: /etc/mysql/mysql.conf.d/mysqld.cnf # Apply to server at 10.0.1.11, then restart MySQL # ============================================================ [mysqld] server-id = 2 # Must differ from primary's ID # Relay log stores events received from primary before they're applied relay-log = relay-bin relay_log_recovery = ON # Auto-recover relay log on crash # Write replica's own changes to its binlog — essential if this replica # might ever become a primary (chained or cascaded replication) log-bin = mysql-bin log_replica_updates = ON # GTID — must match primary gtid_mode = ON enforce_gtid_consistency = ON # Parallel apply workers dramatically reduce lag under heavy write workloads # LOGICAL_CLOCK preserves transaction ordering from the primary's commit order replica_parallel_workers = 4 replica_parallel_type = LOGICAL_CLOCK # Make replica read-only — prevents accidental writes that cause divergence read_only = ON super_read_only = ON # Also blocks SUPER privilege users
# On PRIMARY:
mysql> SHOW VARIABLES LIKE 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 1 |
+---------------+-------+
mysql> SHOW VARIABLES LIKE 'gtid_mode';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| gtid_mode | ON |
+---------------+-------+
# On REPLICA:
mysql> SHOW VARIABLES LIKE 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 2 |
+---------------+-------+
mysql> SHOW VARIABLES LIKE 'read_only';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | ON |
+---------------+-------+
Bootstrapping Replication, Monitoring Lag, and Handling Failover
Configuration files are just the foundation. Actually starting replication requires three steps: create the replication user on the primary, take a consistent snapshot of primary data and load it into the replica, then issue CHANGE REPLICATION SOURCE TO on the replica and START REPLICA.
Lag monitoring is where teams get caught off guard. Seconds_Behind_Source in SHOW REPLICA STATUS is calculated as the difference between the current clock time and the timestamp embedded in the relay log event being applied. This sounds fine until you realize it resets to NULL when the SQL thread stalls, reports 0 when the I/O thread is lagging (the replica thinks it's caught up because it hasn't received the new events yet), and can spike from 0 to thousands of seconds instantly during a large transaction. Production monitoring should combine Seconds_Behind_Source with the GTID gap and heartbeat tables.
For failover, the clean approach with GTIDs is to run STOP REPLICA on the old replica, verify Executed_Gtid_Set matches the primary's gtid_executed, then promote with STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=OFF. Tools like Orchestrator or MySQL Router automate this entire flow and handle edge cases like errant transactions that exist on a replica but not the primary — a situation that can permanently break GTID-based replication if you try to re-add that server.
-- ============================================================ -- STEP 1: Run on PRIMARY — create the replication user -- ============================================================ CREATE USER 'replication_user'@'10.0.1.11' IDENTIFIED WITH caching_sha2_password -- Modern auth plugin for MySQL 8.0 BY 'StrongR3plPass!2024'; -- Only grant the minimum required privilege GRANT REPLICATION SLAVE ON *.* TO 'replication_user'@'10.0.1.11'; FLUSH PRIVILEGES; -- Verify the user exists with the right privileges SHOW GRANTS FOR 'replication_user'@'10.0.1.11'; -- ============================================================ -- STEP 2: Take a consistent snapshot on PRIMARY -- Run this in a shell, NOT in mysql client -- --single-transaction avoids table locks for InnoDB -- --source-data=2 adds CHANGE REPLICATION SOURCE comment to dump -- --triggers --routines --events captures full schema -- ============================================================ -- $ mysqldump \ -- --single-transaction \ -- --source-data=2 \ -- --set-gtid-purged=ON \ -- --triggers \ -- --routines \ -- --events \ -- --all-databases \ -- -u root -p > /backup/primary_snapshot_$(date +%Y%m%d_%H%M%S).sql -- ============================================================ -- STEP 3: Load snapshot on REPLICA (shell command) -- ============================================================ -- $ mysql -u root -p < /backup/primary_snapshot_2024XXXX_XXXXXX.sql -- ============================================================ -- STEP 4: Run on REPLICA — connect to primary and start replication -- GTID mode means we don't need SOURCE_LOG_FILE or SOURCE_LOG_POS -- ============================================================ CHANGE REPLICATION SOURCE TO SOURCE_HOST = '10.0.1.10', SOURCE_PORT = 3306, SOURCE_USER = 'replication_user', SOURCE_PASSWORD = 'StrongR3plPass!2024', SOURCE_AUTO_POSITION = 1, -- This is the GTID magic: find my own gap automatically SOURCE_SSL = 1, -- Always use SSL for replication in production SOURCE_SSL_CA = '/etc/mysql/ssl/ca.pem', SOURCE_SSL_CERT = '/etc/mysql/ssl/replica-cert.pem', SOURCE_SSL_KEY = '/etc/mysql/ssl/replica-key.pem', SOURCE_CONNECT_RETRY = 10, -- Retry connection every 10 seconds on disconnect SOURCE_RETRY_COUNT = 86400; -- Retry for up to 24 hours before giving up -- Start both replication threads START REPLICA; -- Confirm everything is healthy SHOW REPLICA STATUS\G -- ============================================================ -- ONGOING MONITORING: Detect real lag using a heartbeat approach -- Run on PRIMARY every 5 seconds (via cron or application job) -- ============================================================ CREATE TABLE IF NOT EXISTS monitoring.replication_heartbeat ( server_id INT NOT NULL, heartbeat_ts DATETIME(6) NOT NULL, PRIMARY KEY (server_id) ) ENGINE=InnoDB; -- Insert/update the heartbeat timestamp INSERT INTO monitoring.replication_heartbeat (server_id, heartbeat_ts) VALUES (@@server_id, NOW(6)) ON DUPLICATE KEY UPDATE heartbeat_ts = NOW(6); -- Run this on REPLICA to calculate true lag in seconds SELECT TIMESTAMPDIFF( MICROSECOND, heartbeat_ts, NOW(6) ) / 1000000.0 AS replication_lag_seconds, heartbeat_ts AS last_primary_write FROM monitoring.replication_heartbeat WHERE server_id = 1; -- Primary's server_id -- ============================================================ -- FAILOVER: Promote replica to primary (after primary failure) -- ============================================================ -- Verify replica has all transactions before promoting SHOW REPLICA STATUS\G -- Confirm Seconds_Behind_Source: 0 -- Stop replica threads cleanly STOP REPLICA; -- Disconnect from old primary and clear replication config RESET REPLICA ALL; -- Lift read-only restrictions SET GLOBAL super_read_only = OFF; SET GLOBAL read_only = OFF; -- Verify this server is now writable and has all GTIDs SHOW MASTER STATUS\G SHOW VARIABLES LIKE 'read_only';
+------------------------------------------------------------------+
| Grants for replication_user@10.0.1.11 |
+------------------------------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO `replication_user`@`10.0.1.11`|
+------------------------------------------------------------------+
-- SHOW REPLICA STATUS\G (after START REPLICA — healthy state):
Replica_IO_State: Waiting for source to send event
Source_Host: 10.0.1.10
Source_Port: 3306
Connect_Retry: 10
Source_Log_File: mysql-bin.000003
Read_Source_Log_Pos: 1024312
Relay_Log_File: relay-bin.000004
Relay_Log_Pos: 487
Relay_Source_Log_File: mysql-bin.000003
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 0
Last_IO_Error:
Last_SQL_Error:
Retrieved_Gtid_Set: a1b2c3d4-...:1-58291
Executed_Gtid_Set: a1b2c3d4-...:1-58291
Auto_Position: 1
-- Heartbeat lag query output (replica fully caught up):
+------------------------+----------------------------+
| replication_lag_seconds | last_primary_write |
+------------------------+----------------------------+
| 0.000312 | 2024-03-15 14:32:01.887441 |
+------------------------+----------------------------+
| Feature / Aspect | GTID-Based Replication | Position-Based (File+Offset) Replication |
|---|---|---|
| Failover complexity | Simple — CHANGE REPLICATION SOURCE with AUTO_POSITION=1 | Complex — must find exact binlog file + byte position on new primary |
| Errant transaction risk | High visibility — GTID gap is detectable and blocks re-add | Silent — divergence can go undetected until queries return wrong data |
| Multi-source replication | Straightforward — GTIDs are globally unique across all sources | Error-prone — file/position namespaces can collide |
| Replication filter compatibility | Requires care — filtered transactions still consume a GTID | Filters work without GTID bookkeeping complications |
| Initial setup complexity | Slightly more config (enforce_gtid_consistency restrictions) | Simpler initial config |
| Tooling support (Orchestrator, ProxySQL) | First-class support — topology-aware failover built around GTIDs | Supported but requires more manual coordination |
| mysqldump compatibility | Requires --set-gtid-purged=ON flag | Works with default --master-data=2 flag |
| Crash recovery | Automatic GTID recovery — no manual position hunting | Manual: must check binlog position from InnoDB recovery |
| Production recommendation (2024) | ✅ Default choice for all new setups | Legacy setups only — migrate to GTID when possible |
🎯 Key Takeaways
- Always use ROW-based binary logging in production — STATEMENT mode causes silent data divergence with non-deterministic functions like UUID() and NOW(), with zero error output to alert you.
- GTID mode with SOURCE_AUTO_POSITION=1 is the only replication setup worth building in 2024 — failover drops from a manual 20-step process to 3 commands, and errant transactions become detectable instead of silent.
- Seconds_Behind_Source is an unreliable lag metric on its own — pair it with a heartbeat table and GTID set comparison to get a lag number you can actually alert on in production.
- super_read_only=ON must live in my.cnf, not just SET GLOBAL at runtime — a replica restart without it in the config file silently drops read protection, and the next errant write breaks your GTID topology at the worst possible moment.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Duplicate server-id across the topology — Symptom: Replication connects but events are silently skipped or looped; SHOW REPLICA STATUS shows IO thread connecting and disconnecting in a loop with error 'Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids' — Fix: Every server in the entire topology (primaries, replicas, and intermediate replicas) must have a globally unique server-id. Audit all servers with SELECT @@server_id before adding a new node, and document IDs in your infrastructure config management system.
- ✕Mistake 2: Writing directly to a replica — Symptom: Data exists on the replica that never existed on the primary; GTID replication breaks with 'Error 1236: The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires' on the next failover attempt — Fix: Set both read_only=ON AND super_read_only=ON in the replica's my.cnf, not just at runtime. Runtime-only settings are lost on restart. Errant transactions that already exist must be skipped by injecting an empty transaction with the errant GTID: SET GTID_NEXT='errant-gtid:N'; BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC';
- ✕Mistake 3: Setting replica_parallel_workers without LOGICAL_CLOCK — Symptom: Replica lag is high under write-heavy workloads even with parallel workers enabled; or worse, replica applies transactions out of order causing intermittent constraint violations and SQL thread errors — Fix: If you enable replica_parallel_workers > 0, also set replica_parallel_type=LOGICAL_CLOCK and replica_preserve_commit_order=ON. LOGICAL_CLOCK uses the primary's binary log group commit timestamps to apply transactions in parallel only when it's safe to do so, preserving the original commit ordering. The default DATABASE parallelism mode only parallelizes across different schemas, which is nearly useless for single-database applications.
Interview Questions on This Topic
- QA replica is showing Seconds_Behind_Source: 0 in SHOW REPLICA STATUS, but your application is reading stale data from it. How do you diagnose the true replication lag, and what could explain this discrepancy?
- QExplain the difference between a retrieved GTID set and an executed GTID set on a replica, and describe a scenario where they would differ significantly. What does that difference tell you operationally?
- QYour team needs to promote a replica to primary after an unexpected primary failure. You notice the replica's Executed_Gtid_Set is missing 50 transactions compared to the crashed primary's last known gtid_executed. The crashed primary is unrecoverable. How do you handle the missing transactions, and what are the tradeoffs of your decision?
Frequently Asked Questions
How do I add a replica to an existing MySQL primary without downtime?
Use Percona XtraBackup to take a hot physical backup of the primary — it streams a consistent snapshot without locking tables or interrupting writes. Restore the backup on the new replica server, note the GTID position embedded in the xtrabackup_binlog_info file, configure the replica's my.cnf with a unique server-id and GTID settings, then run CHANGE REPLICATION SOURCE TO with SOURCE_AUTO_POSITION=1 and START REPLICA. The replica will automatically request only the GTIDs it's missing.
What's the difference between MySQL semi-synchronous and asynchronous replication?
In standard asynchronous replication (the default), the primary commits a transaction and returns success to the client before confirming that any replica received the binlog events — so in a crash, you can lose committed data. Semi-synchronous replication (rpl_semi_sync_source_enabled=ON) makes the primary wait for at least one replica to acknowledge receipt of the binlog event before returning success to the client, guaranteeing zero data loss on primary failure. The tradeoff is added write latency equal to your network round-trip time to the nearest replica, typically 1-5ms in the same datacenter.
Can I replicate only specific tables or databases instead of everything?
Yes, using replication filters — set replicate-do-db, replicate-do-table, replicate-ignore-db, or replicate-ignore-table in the replica's my.cnf. However, these filters interact dangerously with GTID mode: even filtered-out transactions consume a GTID on the primary but aren't applied on the replica, creating a permanent GTID gap that causes errors if you ever try to use that replica as a new primary. The safer pattern for partial replication is to use logical replication tools like Debezium or to restructure your schema so independent data lives on separate MySQL instances entirely.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.