Database Replication
- Leader Based Replication
- Single-Leader
- aka Leader-Follower or Master-Slave or Active-Passive
- Multi-Leader
- aka Master-Master or Active-Active or Peer to Peer
- Consensus Algorithm
- Paxos
- Single-Leader
- Leaderless Replication
- Single Leader is useful to ensure no write conflicts
- Multi-Leader and Leaderless Replication increases write throughput but can cause write conflicts
- Uses
- Fault tolerance
- Improve latency with Regional strategy
- Helps in Scaling
- https://stackoverflow.com/questions/3736969/master-master-vs-master-slave-database-architecture
Redundancy vs Replication
- Redundancy is process of duplication
- Duplication of critical components or functions of a system to increase reliability
- Achieves
- Create Backup data
- Improve actual performance
- Replication is process of keeping the sync
- Involves sharing information to ensure consistency between redundant resources such as multiple databases
- Improve reliability, fault-tolerance, or accessibility
Synchronous vs Asynchronous Replication
- Sync
- Guarantees up-to-date copy in followers
- If Follower fails, Leader must block all writes and wait until it is back again
- Semi-Sync
- One of the follower is replicated in sync
- If the replica fails, one of the async replica is made a sync replica
- Async
- Writes are not durable
- If followers fail, it can continue processing writes
Leader-Follower Replication
- aka Master-Slave, Single Leader
- Writes go to Leader
- Sends data change to all of its followers
- Reads go to either Leader or one of its Followers
- aka Replica
- Each followers takes the data change and apply on its DB
- Ensures no write conflicts
- Examples
- PostgresQL WAL
- MySQL
binlog(Binary Log) - MSSQL Transaction Log
- MongoDB
- Kafka
- RabbitMQ
Adding a new Follower
- Take consistent snapshot of Leader without taking lock on DB
- Copy snapshot to the new follower node
- Follower connects with the leader
- Snapshot taken earlier is associated with a position in leader’s replication log
- Postgres: Log sequence numbers
- MySQL: binlog coordinates
- Requests all data changes from Leader since the snapshot was taken
- Snapshot taken earlier is associated with a position in leader’s replication log
- When Follower is done taking all data changes, we say follower has caught up
Failures
- Failing Follower
- Follower keeps log of all data changes
- When follower recovers, it connects with leader and request changes since the outage
- Failing Leader
- aka Failover
- Determine the Leader has died
- timeouts can be used, for example 30s
- Choosing a new Leader
- A new leader is elected (consensus)
- Generally the best candidate which is most up-to-date with the leader
- Reconfigure new Leader
- Route writes to new leader
- Determine the Leader has died
- The writes the only went to Leader but not replicated can cause conflict if old leader comes back online
- generally writes are discarded
- Avoid Split brain where two nodes believe that they are leaders
- can cause conflicts and data loss
- Choosing right timeout value is difficult
- aka Failover
Replication Mechanism
- Statement based
- Keep all write statements in the log
- Issues:
- If the statements are non-deterministic, like
NOW()(current time),RAND()(random number) - If using auto-incrementing column
- If side effects like: triggers, stored procedures
- If the statements are non-deterministic, like
- Uses
- VoltDB uses this, but requires Tx to be deterministic
- Write-Ahead Log (WAL)
- In case of LSM based index, log itself stores the data
- In case of B-Tree, every modification is written to WAL, so that index can be restored in case of crash
- Issues
- Contains data of which bytes changed in which disk block
- Typically not possible to use different storage engines/versions in Leader and Followers
- Uses
- PostgresQL WAL
- MySQL
binlog(Binary Log) - MSSQL Transaction Log
- MongoDB
oplog - Kafka
- https://www.architecture-weekly.com/p/the-write-ahead-log-a-foundation
- Row (Logical) based
- aka Logical Log
- Uses different log formats for replication and storage
- Sequence of records capturing modifications at row level
- Can be kept backward compatible
- Easier to parse by external systems using CDC (Change Data Capture)
- Data warehouses
- Custom Indexes and Caches
- Uses
- MySQL
binlogwith Logical log configured
- MySQL
- Trigger based
- Implemented at the application level
- Prone to bugs and overheads
- Implemented by
- Triggers
- Stored Procedures
- Oracle Golden Gate
Replication Lag Problems
- Reading your own writes
- aka read-after-write consistency
- If user modified, read from leader
- does not work if lot of modifications
- for example: modifying user’s profile
- Wait for some-time or monitor the replication lag
- read from leader for some time, then read from followers
- client can remember the timestamp of write, read only when replication is done
- timestamp can be logical timestamp
- cross browser/devices read-after-write consistency can be challenging
- possible that different devices are routed to different datacenters
- Monotonic Reads
- If user makes several reads of an object then he should not see time go backwards
- It can happen when several reads are across replicas
- Guarantee: Eventual Consistency < Monotonic Reads < Strong Consistency
- Make sure that user always read from same replica: hashing based on user ID
- Consistent Prefix Reads
- If write happens on multiple objects then he should not see them out of order when reading
- Make sure any writes that are causally related are written on same partition
Multi-Leader Replication
- aka Master-Master, Peer to Peer
- We keep a leader in each datacenter
- Each Write goes to local datacenter, and are replicated to local followers as well as other datacenters in async
- No failover needed, request can be routed to leader in a different datacenter
- Biggest issue is Write conflicts b/w different datacenters
- Use cases
- Clients with offline operation
- writes are saved locally in the device while offline
- It acts as a leader which syncs with the datacenter when internet is available
- Apache CouchDB is made for this purpose
- Realtime Collaborative Editing
- application can obtain lock, make changes and release lock = single leader
- Avoid locking, but handle conflicts = multi leader
- Clients with offline operation
- Communication is handled by Consensus algorithms
- Examples
- MySQL circular replication
- Apache CouchDB multi-master
- MSSQL Peer to Peer replication
- Used
- LDAP servers like Active Directory and OpenLDAP
- https://en.wikipedia.org/wiki/Multi-master_replication
Multi-Leader Write Conflicts
- Ensure all writes to particular object goes to same leader
- Each user writes to single datacenter, essentially making it into single leader
- If user moves to different location, or datacenter fails, conflicts resolution needed
- Convergent Conflict Resolution
- All datacenters should converge to same state
- Each write has unique ID, write with highest ID is winner
- prone to data loss
- Each replica has unique ID, writes on higher numbered replica wins
- prone to data loss
- Merge and concatenate values, example: “B/C”
- Handle at application level, and let user decide on the conflicts
- Custom Conflict Resolution
- When write conflicts happen, conflict handler stores them
- When data is read next time, user is presented with the option to handle conflict
- Example: Apache CouchDB
Automatic Conflict Resolution
- CRDT: Conflict free Replicated Datatypes
- family of data structures like sets, maps etc. which multiple users write to
- It automatically handles conflict resolution
- 2-way merge function
- Mergable Persistent Data structures
- tracks history of writes
- 3-way merge function
- Operational Transformation
- Algorithm used by Google Docs
Multi-Leader Topologies
- Circular, Star, All to All
- MySQL only supports circular
Leaderless Replication
- aka Dynamo-style
- Amazon DynamoDB repopularized the leaderless replication
- Writes go to all DBs and reads come from all DBs
- Failover does not exist
- It is application’s responsibility to recover from errors
- Read Repair
- Client makes read from several nodes in parallel
- It can detect stale responses, for example one replica gives version 5, another version 6
- If client sees the stale values it writes newer values back
- Works well for frequently read data
- Anti-entropy process
- background process that constantly look for differences in replicas
- It copies missing data in replicas
- Writes are not copied in particular order
- Without this, if values are rarely read, it will have reduced durability
- Useful for
- High Availability (Tolerate Network Interruptions)
- Low Latency (Tolerate Latency Spikes)
- Tolerate occasional stale reads
- Tolerate conflicting writes
- Tolerating Slow replicas
- Examples
- Amazon DynamoDB
- Cassandra
- Riak
- Voldemort
- ScyllaDB
Quorums
- Strict Quorum
- Assuming:
- = number of replicas
- = number of nodes to wait for write
- = number of nodes to wait for read
- Then
- , then we can expect up to date read
- Read and Writes which follow the above are called quorum reads and writes
- The parameters , and are configurable in the DB
- Generally
- is chosen to be odd
- we choose
- Max number of nodes we can tolerate to be unavailable =
-
- we can tolerate loss of 1 node
-
- we can tolerate loss of 2 nodes
- If more nodes become unavailable then, reads and writes throw error
- Assuming:
- Issues with Quorums
- Sloppy Quorum
- Concurrent Write causes conflicts
- Concurrent Write and Read may cause stale read because write is not done on all nodes
- If less than writes happen, then
- overall write fails
- writes are not rollback on nodes where it succeeded
- read may or may not return the that write
- If node with new value fails, and it recovers from node with old value, it will violate the writes, and can cause stale read
- No guarantees for:
- reading your write
- monotonic read
- consistent prefix read
- Sloppy Quorum
- During network interruption, writes and reads require and successful responses, but may include nodes that are not home nodes
- Once n/w interruption fixed, any writes to temporary node is sent to appropriate home nodes, this is called Hinted Handoff
- It is not actually a quorum, but an assurance of durability that data is written somewhere
- Examples
- Riak: enabled by default
- Cassandra, Voldemort: disabled by default
Leaderless: Multi-Datacenter
- Leaderless replication is suitable also
- Writes go to each replica in all datacenters
- Quorum is taken from only replicas from local datacenters
- = total number of replicas
- Example: Cassandra, Voldemort
- Writes only go to each replica in local datacenter
- Replicated to other datacenters in async
- Quorum is taken from local datacenter replicas
- Example: Riak
Concurrent Writes
- Conflicts can happen:
- Strict Quorums
- Keys are overwritten
- Sloppy Quorums
- Read Repair
- Hinted Handoff
- Strict Quorums
- Last Write Wins (LWW)
- Attach timestamp to each write
- In case of conflicts, always pick the latest write
- achieves eventual convergence
- It causes lost writes (less durable)
- If losing data is not acceptable, LWW is a poor choice
- Use cases
- Caching
- Cassandra only supports LWW
- we generally set UUID as a key instead
- Hence each write becomes unique
- Riak supports LWW but is optional
- Happens-Before Relationship
- An operation A happens-before another operation B, if B knows about A, or depends on A, or builds upon A in some way
- An operation A is concurrent with operation B, if neither knows about the other
- For any two operations A and B, one of the following must hold:
- A happened before B
- B happened before A
- A and B are concurrent
- Riak supports it
Happens-Before Relationship: Causal Context
- https://docs.riak.com/riak/kv/latest/developing/usage/conflict-resolution/index.html
- https://docs.riak.com/riak/kv/2.2.3/developing/usage/conflict-resolution/java.1.html
- Better than LWW since no lost data, but client has to resolve merge conflict between concurrent values
- In Riak concurrent values are called as siblings
- Resolving merge conflict is same as in Multi-Leader Replication
- This is error-prone, Hence Riak supports CRDTs datatypes: Automatic Conflict Resolution
- Algorithm
- Each key has a version associated with it
- Each write causes a generation of new version in the DB
- Each time Client writes, it attaches a version against which it wants to update
- For any key, If the client version >= server version, then all those keys are overwritten in a new key with the generated version
- For any key, If the client version < server version, it is considered concurrent, and is not touched
- Each writes give a response:
- List of all the keys with different versions
- Latest version generated in the DB
- In the next write, Client must merge all the keys into one key-value along with the last version read
- Example:
- Server Initially:
{} - Client 1: add
milk- set
cart = [milk], (implicit v0) - Server: generate a new version v1
- Server saves :
{ v1: [milk] }
- set
- Client 2: add
eggs- set
cart = [eggs](implicit v0) - Server: generate a new version v2
- Server saves:
{ v1: [milk], v2: [eggs] }
- set
- Client 1: add
flour- add to last response:
{ v1: [milk] } - set
cart = [milk, flour], v1 - Server: v1 already exists, overwrite and generate a new version: v3
- Server saves:
{ v3: [milk, flour], v2: [eggs] }
- add to last response:
- Client 2: add
butter- resolve merge conflict, in last response:
{ v1: [milk], v2: [eggs] } - set
cart = [milk, eggs, butter], v2 - Server: v2 already exists, overwrite and generate: v4
- Server saves:
{ v3: [milk, flour], v4: [milk, eggs, butter] }
- resolve merge conflict, in last response:
- Server Initially:
Version Vector
- The above Algorithm does not work for multiple replicas
- Instead of having version per key, We need to have version per replica per key
- Collection of version numbers from all replicas is called version vector
- Riak encodes it in a string called Causal Context
- In the Client response we will get version vector instead of single version
Monitoring Staleness
- For Leader-Based Replication, DB exposes metrics for Replication Lag
- Replication Lag = Leader current position - Follower current position
- For Leaderless Replication, Writes are not in particular order
- Hence, difficult to measure replica staleness
Comparison
| Writes go to | Tolerating Node Failure | Suitable for Multi-Datacenter | Performance | Benefits | Challenges | |
|---|---|---|---|---|---|---|
| Single Leader | Leader | Failover, Choose new leader (manual/auto) | No | Writes: High | Moderate Availability | Failover downtime, |
| Multi-Leader | Any Leader | Route to another leader | Yes | Writes: High | High Availability | Write Conflicts |
| Leaderless | Multiple Replicas | Strict Quorums, Sloppy Quorums and Hinted Handoff | Yes | Reads/Writes: Moderate??? | Very High Availability, Low Latency, Tolerate Slow Reads | Stale Reads |
Limits of Replica Numbers
- Read Replicas have limits, woaaah!!!
💡 Rule of thumb from real deployments (ChatGPT)
- High read, low write
- More replicas (e.g., 3–10), scale horizontally.
- High write, low read
- Keep fewer replicas (2–3) to reduce replication lag.
- Mission critical
- Minimum 3 total nodes for quorum (HA).