Home Database MySQL Replication Setup: Master-Replica Configuration, Internals & Production Gotchas

MySQL Replication Setup: Master-Replica Configuration, Internals & Production Gotchas

In Plain English 🔥
Imagine a chef in a busy restaurant writing every order on a notepad as it comes in. Now imagine a second chef in the back room who reads that same notepad in real time and prepares identical dishes. If the first chef gets sick, the second chef can step in immediately and nobody notices. MySQL replication works exactly like that — every write on your primary database is recorded in a log, and one or more replica servers read that log and replay the same changes. You get redundancy, read scaling, and a safety net, all from one elegant mechanism.
⚡ Quick Answer
Imagine a chef in a busy restaurant writing every order on a notepad as it comes in. Now imagine a second chef in the back room who reads that same notepad in real time and prepares identical dishes. If the first chef gets sick, the second chef can step in immediately and nobody notices. MySQL replication works exactly like that — every write on your primary database is recorded in a log, and one or more replica servers read that log and replay the same changes. You get redundancy, read scaling, and a safety net, all from one elegant mechanism.

Every production database running at scale eventually hits the same wall: a single server can only handle so many reads before query times creep up, and a single point of failure is a liability no on-call engineer wants at 2 AM. MySQL replication is the foundational answer to both problems — it lets you distribute reads across multiple servers and gives you a warm standby that can take over within seconds if the primary crashes. It's not optional infrastructure. At any company running MySQL seriously, replication is day one.

The core problem replication solves is deceptively simple: how do you keep two or more MySQL instances in sync without locking everything or doubling write traffic? The answer lives in the binary log — a sequential, append-only journal of every state-changing event on the primary. Replicas connect, stream that log, and apply events in order. But the devil is in the details — GTID vs file-position tracking, parallel apply workers, semi-synchronous acknowledgment, replication filters, and lag under heavy write load are all areas where teams get burned in production.

By the end of this article you'll be able to configure a production-grade primary/replica pair from scratch, understand the exact bytes flowing between them, switch between GTID and position-based replication with confidence, monitor and diagnose replication lag, and avoid the three most expensive mistakes engineers make when they first set this up. This is not a tutorial for spinning up a throwaway demo — it's the setup you'd be comfortable putting in front of real traffic.

How MySQL Replication Actually Works Under the Hood

Before touching a config file, you need a mental model of what's actually happening. MySQL replication is driven by three threads. On the primary, the binlog dump thread streams binary log events to any replica that connects. On the replica, the I/O thread maintains a persistent TCP connection to the primary, receives those events, and writes them into a local file called the relay log. Meanwhile, the SQL thread (or parallel apply workers in modern MySQL) reads the relay log and executes each event against the replica's local storage engine.

The binary log is not a simple text log — it's a structured binary format that records events at the statement level (STATEMENT mode), the row level (ROW mode), or a hybrid of both (MIXED mode). ROW-based replication is what you want in production. It records the actual before and after image of each changed row, which makes it deterministic — no ambiguity about what a non-deterministic function like UUID() or NOW() returned. STATEMENT mode can cause silent data divergence when queries contain functions whose output differs between servers.

GTID (Global Transaction Identifier) is a unique ID assigned to every committed transaction on the primary, formatted as source_uuid:transaction_sequence. GTIDs make failover and replica reconfiguration dramatically safer because a replica can tell any server 'here are the GTIDs I already have — send me everything else,' without needing to know a specific file name and byte offset. For any new setup in 2024, GTID mode is the right default.

check_replication_internals.sql · SQL
123456789101112131415161718192021222324252627282930313233343536
-- ============================================================
-- Run these on the PRIMARY to inspect binlog state
-- ============================================================

-- Show all binary log files and their sizes on disk
SHOW BINARY LOGS;
-- Output tells you how many logs exist and how much disk they consume

-- Inspect the live stream of binary log events (human-readable)
-- Replace 'mysql-bin.000003' with your current binlog file
SHOW BINLOG EVENTS IN 'mysql-bin.000003' LIMIT 20;

-- Confirm GTID mode is enabled and see executed GTID sets
SHOW VARIABLES LIKE 'gtid_mode';          -- Should return ON
SHOW VARIABLES LIKE 'enforce_gtid_consistency'; -- Should return ON
SHOW MASTER STATUS\G
-- gtid_executed column shows every GTID this server has committed

-- ============================================================
-- Run these on the REPLICA to inspect the three replication threads
-- ============================================================

-- The single most important replication health command
SHOW REPLICA STATUS\G
-- Key fields to check:
--   Replica_IO_Running: Yes        -- I/O thread is connected to primary
--   Replica_SQL_Running: Yes       -- SQL thread is applying relay log
--   Seconds_Behind_Source: 0       -- 0 means replica is caught up
--   Last_IO_Error: (empty)         -- Any value here means connection trouble
--   Last_SQL_Error: (empty)        -- Any value here means apply-side failure
--   Retrieved_Gtid_Set            -- GTIDs received from primary
--   Executed_Gtid_Set             -- GTIDs applied to this replica's data

-- Check how many parallel SQL apply workers are configured
SHOW VARIABLES LIKE 'replica_parallel_workers'; -- 0 = single-threaded
SHOW VARIABLES LIKE 'replica_parallel_type';    -- LOGICAL_CLOCK is the modern choice
▶ Output
-- SHOW BINARY LOGS output (example):
+------------------+-----------+-----------+
| Log_name | File_size | Encrypted |
+------------------+-----------+-----------+
| mysql-bin.000001 | 177 | No |
| mysql-bin.000002 | 52428800 | No |
| mysql-bin.000003 | 1024312 | No |
+------------------+-----------+-----------+

-- SHOW MASTER STATUS\G output (example with GTID):
*************************** 1. row ***************************
File: mysql-bin.000003
Position: 1024312
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: a1b2c3d4-0000-0000-0000-000000000001:1-58291

-- SHOW REPLICA STATUS\G (healthy replica):
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 0
Last_IO_Error:
Last_SQL_Error:
Retrieved_Gtid_Set: a1b2c3d4-0000-0000-0000-000000000001:1-58291
Executed_Gtid_Set: a1b2c3d4-0000-0000-0000-000000000001:1-58291
⚠️
Watch Out: STATEMENT vs ROW ReplicationIf your binlog_format is set to STATEMENT and any query calls UUID(), RAND(), NOW(), or a user-defined function, the replica will silently execute that function again and get a different result. You'll have data drift with zero error messages. Set binlog_format=ROW in production — always. The extra disk usage is worth the determinism guarantee.

Step-by-Step: Configuring a GTID-Based Primary/Replica Pair

Let's build a working replication setup. We'll use two servers: primary at 10.0.1.10 and replica at 10.0.1.11, both running MySQL 8.0+. The configuration changes require a restart, so plan your maintenance window accordingly — or use MySQL's SET PERSIST for variables that support online changes.

The primary needs the binary log enabled with a unique server ID, and it needs a dedicated replication user with minimal privileges. The replication user should only have REPLICATION SLAVE privilege — nothing else. Least privilege matters here because this account's credentials live in plaintext in older versions, and even with mysql.slave_master_info table encryption in 8.0, you don't want a compromised replica credential giving an attacker DML access.

On the replica side, you set a distinct server ID, point it at the primary using CHANGE REPLICATION SOURCE TO (the modern syntax replacing CHANGE MASTER TO), and start the replication threads. With GTID mode, you don't need to know a binlog file or position — you just tell the replica 'start from the beginning of GTIDs I don't have.' That's the magic of GTID: the protocol figures out the gap automatically.

For an initial data load, use mysqldump with --single-transaction --master-data=2 --gtids, or for large datasets, Percona XtraBackup which takes a hot physical copy without locking tables. Never skip the initial consistent snapshot — if you do, the replica starts with wrong data and GTID tracking gives you a false sense of correctness.

mysql_primary_replica_config.cnf · INI
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
# ============================================================
# PRIMARY SERVER: /etc/mysql/mysql.conf.d/mysqld.cnf
# Apply to server at 10.0.1.10, then restart MySQL
# ============================================================
[mysqld]
# Every server in a replication topology MUST have a unique ID
server-id                    = 1

# Enable the binary log — replication cannot work without this
log-bin                      = mysql-bin
binlog_format                = ROW          # Deterministic row-level events
binlog_row_image             = FULL         # Log complete before+after row images

# Retain binlogs for 7 days — long enough for replicas to catch up after downtime
binlog_expire_logs_seconds   = 604800

# GTID settings — both must be ON together
gtid_mode                    = ON
enforce_gtid_consistency     = ON

# Crash safety: flush binlog to disk on every commit (safer, slightly slower)
sync_binlog                  = 1
innodb_flush_log_at_trx_commit = 1

# ============================================================
# REPLICA SERVER: /etc/mysql/mysql.conf.d/mysqld.cnf
# Apply to server at 10.0.1.11, then restart MySQL
# ============================================================
[mysqld]
server-id                    = 2            # Must differ from primary's ID

# Relay log stores events received from primary before they're applied
relay-log                    = relay-bin
relay_log_recovery           = ON           # Auto-recover relay log on crash

# Write replica's own changes to its binlog — essential if this replica
# might ever become a primary (chained or cascaded replication)
log-bin                      = mysql-bin
log_replica_updates          = ON

# GTID — must match primary
gtid_mode                    = ON
enforce_gtid_consistency     = ON

# Parallel apply workers dramatically reduce lag under heavy write workloads
# LOGICAL_CLOCK preserves transaction ordering from the primary's commit order
replica_parallel_workers     = 4
replica_parallel_type        = LOGICAL_CLOCK

# Make replica read-only — prevents accidental writes that cause divergence
read_only                    = ON
super_read_only              = ON           # Also blocks SUPER privilege users
▶ Output
# After restarting MySQL on both servers, verify with:
# On PRIMARY:
mysql> SHOW VARIABLES LIKE 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 1 |
+---------------+-------+

mysql> SHOW VARIABLES LIKE 'gtid_mode';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| gtid_mode | ON |
+---------------+-------+

# On REPLICA:
mysql> SHOW VARIABLES LIKE 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id | 2 |
+---------------+-------+

mysql> SHOW VARIABLES LIKE 'read_only';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | ON |
+---------------+-------+
⚠️
Pro Tip: super_read_only Is Your Safety NetSetting only read_only=ON still allows users with the SUPER privilege to write to the replica — and in MySQL 8.0 almost every application user has been granted it carelessly at some point. Set super_read_only=ON on every replica. It blocks writes from ALL users including root, and is the only reliable guard against accidental replica writes that cause silent data divergence. You can disable it temporarily in an emergency with SET GLOBAL super_read_only=OFF.

Bootstrapping Replication, Monitoring Lag, and Handling Failover

Configuration files are just the foundation. Actually starting replication requires three steps: create the replication user on the primary, take a consistent snapshot of primary data and load it into the replica, then issue CHANGE REPLICATION SOURCE TO on the replica and START REPLICA.

Lag monitoring is where teams get caught off guard. Seconds_Behind_Source in SHOW REPLICA STATUS is calculated as the difference between the current clock time and the timestamp embedded in the relay log event being applied. This sounds fine until you realize it resets to NULL when the SQL thread stalls, reports 0 when the I/O thread is lagging (the replica thinks it's caught up because it hasn't received the new events yet), and can spike from 0 to thousands of seconds instantly during a large transaction. Production monitoring should combine Seconds_Behind_Source with the GTID gap and heartbeat tables.

For failover, the clean approach with GTIDs is to run STOP REPLICA on the old replica, verify Executed_Gtid_Set matches the primary's gtid_executed, then promote with STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=OFF. Tools like Orchestrator or MySQL Router automate this entire flow and handle edge cases like errant transactions that exist on a replica but not the primary — a situation that can permanently break GTID-based replication if you try to re-add that server.

bootstrap_and_monitor_replication.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
-- ============================================================
-- STEP 1: Run on PRIMARY — create the replication user
-- ============================================================
CREATE USER 'replication_user'@'10.0.1.11'
    IDENTIFIED WITH caching_sha2_password  -- Modern auth plugin for MySQL 8.0
    BY 'StrongR3plPass!2024';

-- Only grant the minimum required privilege
GRANT REPLICATION SLAVE ON *.* TO 'replication_user'@'10.0.1.11';
FLUSH PRIVILEGES;

-- Verify the user exists with the right privileges
SHOW GRANTS FOR 'replication_user'@'10.0.1.11';

-- ============================================================
-- STEP 2: Take a consistent snapshot on PRIMARY
-- Run this in a shell, NOT in mysql client
-- --single-transaction avoids table locks for InnoDB
-- --source-data=2 adds CHANGE REPLICATION SOURCE comment to dump
-- --triggers --routines --events captures full schema
-- ============================================================
-- $ mysqldump \
--     --single-transaction \
--     --source-data=2 \
--     --set-gtid-purged=ON \
--     --triggers \
--     --routines \
--     --events \
--     --all-databases \
--     -u root -p > /backup/primary_snapshot_$(date +%Y%m%d_%H%M%S).sql

-- ============================================================
-- STEP 3: Load snapshot on REPLICA (shell command)
-- ============================================================
-- $ mysql -u root -p < /backup/primary_snapshot_2024XXXX_XXXXXX.sql

-- ============================================================
-- STEP 4: Run on REPLICA — connect to primary and start replication
-- GTID mode means we don't need SOURCE_LOG_FILE or SOURCE_LOG_POS
-- ============================================================
CHANGE REPLICATION SOURCE TO
    SOURCE_HOST      = '10.0.1.10',
    SOURCE_PORT      = 3306,
    SOURCE_USER      = 'replication_user',
    SOURCE_PASSWORD  = 'StrongR3plPass!2024',
    SOURCE_AUTO_POSITION = 1,   -- This is the GTID magic: find my own gap automatically
    SOURCE_SSL       = 1,       -- Always use SSL for replication in production
    SOURCE_SSL_CA    = '/etc/mysql/ssl/ca.pem',
    SOURCE_SSL_CERT  = '/etc/mysql/ssl/replica-cert.pem',
    SOURCE_SSL_KEY   = '/etc/mysql/ssl/replica-key.pem',
    SOURCE_CONNECT_RETRY = 10,  -- Retry connection every 10 seconds on disconnect
    SOURCE_RETRY_COUNT  = 86400; -- Retry for up to 24 hours before giving up

-- Start both replication threads
START REPLICA;

-- Confirm everything is healthy
SHOW REPLICA STATUS\G

-- ============================================================
-- ONGOING MONITORING: Detect real lag using a heartbeat approach
-- Run on PRIMARY every 5 seconds (via cron or application job)
-- ============================================================
CREATE TABLE IF NOT EXISTS monitoring.replication_heartbeat (
    server_id        INT          NOT NULL,
    heartbeat_ts     DATETIME(6)  NOT NULL,
    PRIMARY KEY (server_id)
) ENGINE=InnoDB;

-- Insert/update the heartbeat timestamp
INSERT INTO monitoring.replication_heartbeat (server_id, heartbeat_ts)
    VALUES (@@server_id, NOW(6))
    ON DUPLICATE KEY UPDATE heartbeat_ts = NOW(6);

-- Run this on REPLICA to calculate true lag in seconds
SELECT
    TIMESTAMPDIFF(
        MICROSECOND,
        heartbeat_ts,
        NOW(6)
    ) / 1000000.0 AS replication_lag_seconds,
    heartbeat_ts   AS last_primary_write
FROM monitoring.replication_heartbeat
WHERE server_id = 1;  -- Primary's server_id

-- ============================================================
-- FAILOVER: Promote replica to primary (after primary failure)
-- ============================================================
-- Verify replica has all transactions before promoting
SHOW REPLICA STATUS\G  -- Confirm Seconds_Behind_Source: 0

-- Stop replica threads cleanly
STOP REPLICA;

-- Disconnect from old primary and clear replication config
RESET REPLICA ALL;

-- Lift read-only restrictions
SET GLOBAL super_read_only = OFF;
SET GLOBAL read_only       = OFF;

-- Verify this server is now writable and has all GTIDs
SHOW MASTER STATUS\G
SHOW VARIABLES LIKE 'read_only';
▶ Output
-- SHOW GRANTS output:
+------------------------------------------------------------------+
| Grants for replication_user@10.0.1.11 |
+------------------------------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO `replication_user`@`10.0.1.11`|
+------------------------------------------------------------------+

-- SHOW REPLICA STATUS\G (after START REPLICA — healthy state):
Replica_IO_State: Waiting for source to send event
Source_Host: 10.0.1.10
Source_Port: 3306
Connect_Retry: 10
Source_Log_File: mysql-bin.000003
Read_Source_Log_Pos: 1024312
Relay_Log_File: relay-bin.000004
Relay_Log_Pos: 487
Relay_Source_Log_File: mysql-bin.000003
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 0
Last_IO_Error:
Last_SQL_Error:
Retrieved_Gtid_Set: a1b2c3d4-...:1-58291
Executed_Gtid_Set: a1b2c3d4-...:1-58291
Auto_Position: 1

-- Heartbeat lag query output (replica fully caught up):
+------------------------+----------------------------+
| replication_lag_seconds | last_primary_write |
+------------------------+----------------------------+
| 0.000312 | 2024-03-15 14:32:01.887441 |
+------------------------+----------------------------+
🔥
Interview Gold: Why Seconds_Behind_Source Can LieInterviewers love this one. Seconds_Behind_Source measures the timestamp difference between now and the relay log event being applied — but if the I/O thread is lagging (primary is far ahead but events haven't been received yet), the metric shows 0 while you're actually minutes behind. Always pair it with GTID set comparison or a heartbeat table for trustworthy lag detection in production alerting.
Feature / AspectGTID-Based ReplicationPosition-Based (File+Offset) Replication
Failover complexitySimple — CHANGE REPLICATION SOURCE with AUTO_POSITION=1Complex — must find exact binlog file + byte position on new primary
Errant transaction riskHigh visibility — GTID gap is detectable and blocks re-addSilent — divergence can go undetected until queries return wrong data
Multi-source replicationStraightforward — GTIDs are globally unique across all sourcesError-prone — file/position namespaces can collide
Replication filter compatibilityRequires care — filtered transactions still consume a GTIDFilters work without GTID bookkeeping complications
Initial setup complexitySlightly more config (enforce_gtid_consistency restrictions)Simpler initial config
Tooling support (Orchestrator, ProxySQL)First-class support — topology-aware failover built around GTIDsSupported but requires more manual coordination
mysqldump compatibilityRequires --set-gtid-purged=ON flagWorks with default --master-data=2 flag
Crash recoveryAutomatic GTID recovery — no manual position huntingManual: must check binlog position from InnoDB recovery
Production recommendation (2024)✅ Default choice for all new setupsLegacy setups only — migrate to GTID when possible

🎯 Key Takeaways

  • Always use ROW-based binary logging in production — STATEMENT mode causes silent data divergence with non-deterministic functions like UUID() and NOW(), with zero error output to alert you.
  • GTID mode with SOURCE_AUTO_POSITION=1 is the only replication setup worth building in 2024 — failover drops from a manual 20-step process to 3 commands, and errant transactions become detectable instead of silent.
  • Seconds_Behind_Source is an unreliable lag metric on its own — pair it with a heartbeat table and GTID set comparison to get a lag number you can actually alert on in production.
  • super_read_only=ON must live in my.cnf, not just SET GLOBAL at runtime — a replica restart without it in the config file silently drops read protection, and the next errant write breaks your GTID topology at the worst possible moment.

⚠ Common Mistakes to Avoid

  • Mistake 1: Duplicate server-id across the topology — Symptom: Replication connects but events are silently skipped or looped; SHOW REPLICA STATUS shows IO thread connecting and disconnecting in a loop with error 'Fatal error: The slave I/O thread stops because master and slave have equal MySQL server ids' — Fix: Every server in the entire topology (primaries, replicas, and intermediate replicas) must have a globally unique server-id. Audit all servers with SELECT @@server_id before adding a new node, and document IDs in your infrastructure config management system.
  • Mistake 2: Writing directly to a replica — Symptom: Data exists on the replica that never existed on the primary; GTID replication breaks with 'Error 1236: The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires' on the next failover attempt — Fix: Set both read_only=ON AND super_read_only=ON in the replica's my.cnf, not just at runtime. Runtime-only settings are lost on restart. Errant transactions that already exist must be skipped by injecting an empty transaction with the errant GTID: SET GTID_NEXT='errant-gtid:N'; BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC';
  • Mistake 3: Setting replica_parallel_workers without LOGICAL_CLOCK — Symptom: Replica lag is high under write-heavy workloads even with parallel workers enabled; or worse, replica applies transactions out of order causing intermittent constraint violations and SQL thread errors — Fix: If you enable replica_parallel_workers > 0, also set replica_parallel_type=LOGICAL_CLOCK and replica_preserve_commit_order=ON. LOGICAL_CLOCK uses the primary's binary log group commit timestamps to apply transactions in parallel only when it's safe to do so, preserving the original commit ordering. The default DATABASE parallelism mode only parallelizes across different schemas, which is nearly useless for single-database applications.

Interview Questions on This Topic

  • QA replica is showing Seconds_Behind_Source: 0 in SHOW REPLICA STATUS, but your application is reading stale data from it. How do you diagnose the true replication lag, and what could explain this discrepancy?
  • QExplain the difference between a retrieved GTID set and an executed GTID set on a replica, and describe a scenario where they would differ significantly. What does that difference tell you operationally?
  • QYour team needs to promote a replica to primary after an unexpected primary failure. You notice the replica's Executed_Gtid_Set is missing 50 transactions compared to the crashed primary's last known gtid_executed. The crashed primary is unrecoverable. How do you handle the missing transactions, and what are the tradeoffs of your decision?

Frequently Asked Questions

How do I add a replica to an existing MySQL primary without downtime?

Use Percona XtraBackup to take a hot physical backup of the primary — it streams a consistent snapshot without locking tables or interrupting writes. Restore the backup on the new replica server, note the GTID position embedded in the xtrabackup_binlog_info file, configure the replica's my.cnf with a unique server-id and GTID settings, then run CHANGE REPLICATION SOURCE TO with SOURCE_AUTO_POSITION=1 and START REPLICA. The replica will automatically request only the GTIDs it's missing.

What's the difference between MySQL semi-synchronous and asynchronous replication?

In standard asynchronous replication (the default), the primary commits a transaction and returns success to the client before confirming that any replica received the binlog events — so in a crash, you can lose committed data. Semi-synchronous replication (rpl_semi_sync_source_enabled=ON) makes the primary wait for at least one replica to acknowledge receipt of the binlog event before returning success to the client, guaranteeing zero data loss on primary failure. The tradeoff is added write latency equal to your network round-trip time to the nearest replica, typically 1-5ms in the same datacenter.

Can I replicate only specific tables or databases instead of everything?

Yes, using replication filters — set replicate-do-db, replicate-do-table, replicate-ignore-db, or replicate-ignore-table in the replica's my.cnf. However, these filters interact dangerously with GTID mode: even filtered-out transactions consume a GTID on the primary but aren't applied on the replica, creating a permanent GTID gap that causes errors if you ever try to use that replica as a new primary. The safer pattern for partial replication is to use logical replication tools like Debezium or to restructure your schema so independent data lives on separate MySQL instances entirely.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousPostgreSQL TriggersNext →TypeORM Basics
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged