Database Replication

Leader Based Replication
- Single-Leader
  - aka Leader-Follower or Master-Slave or Active-Passive
- Multi-Leader
  - aka Master-Master or Active-Active or Peer to Peer
  - Consensus Algorithm
    - Paxos
Leaderless Replication
Single Leader is useful to ensure no write conflicts
Multi-Leader and Leaderless Replication increases write throughput but can cause write conflicts
Uses
- Fault tolerance
- Improve latency with Regional strategy
- Helps in Scaling
https://stackoverflow.com/questions/3736969/master-master-vs-master-slave-database-architecture

Redundancy vs Replication

Redundancy is process of duplication
- Duplication of critical components or functions of a system to increase reliability
- Achieves
  - Create Backup data
  - Improve actual performance
Replication is process of keeping the sync
- Involves sharing information to ensure consistency between redundant resources such as multiple databases
- Improve reliability, fault-tolerance, or accessibility

Synchronous vs Asynchronous Replication

Sync
- Guarantees up-to-date copy in followers
- If Follower fails, Leader must block all writes and wait until it is back again
Semi-Sync
- One of the follower is replicated in sync
- If the replica fails, one of the async replica is made a sync replica
Async
- Writes are not durable
- If followers fail, it can continue processing writes

Leader-Follower Replication

aka Master-Slave, Single Leader
Writes go to Leader
- Sends data change to all of its followers
Reads go to either Leader or one of its Followers
- aka Replica
- Each followers takes the data change and apply on its DB
Ensures no write conflicts
Examples
- PostgresQL WAL
- MySQL binlog (Binary Log)
- MSSQL Transaction Log
- MongoDB
- Kafka
- RabbitMQ

Adding a new Follower

Take consistent snapshot of Leader without taking lock on DB
Copy snapshot to the new follower node
Follower connects with the leader
- Snapshot taken earlier is associated with a position in leader’s replication log
  - Postgres: Log sequence numbers
  - MySQL: binlog coordinates
- Requests all data changes from Leader since the snapshot was taken
When Follower is done taking all data changes, we say follower has caught up

Failures

Failing Follower
- Follower keeps log of all data changes
- When follower recovers, it connects with leader and request changes since the outage
Failing Leader
- aka Failover
  - Determine the Leader has died
    - timeouts can be used, for example 30s
  - Choosing a new Leader
    - A new leader is elected (consensus)
    - Generally the best candidate which is most up-to-date with the leader
  - Reconfigure new Leader
    - Route writes to new leader
- The writes the only went to Leader but not replicated can cause conflict if old leader comes back online
  - generally writes are discarded
- Avoid Split brain where two nodes believe that they are leaders
  - can cause conflicts and data loss
- Choosing right timeout value is difficult

Replication Mechanism

Statement based
- Keep all write statements in the log
- Issues:
  - If the statements are non-deterministic, like NOW() (current time), RAND() (random number)
  - If using auto-incrementing column
  - If side effects like: triggers, stored procedures
- Uses
  - VoltDB uses this, but requires Tx to be deterministic
Write-Ahead Log (WAL)
- In case of LSM based index, log itself stores the data
- In case of B-Tree, every modification is written to WAL, so that index can be restored in case of crash
- Issues
  - Contains data of which bytes changed in which disk block
  - Typically not possible to use different storage engines/versions in Leader and Followers
- Uses
  - PostgresQL WAL
  - MySQL binlog (Binary Log)
  - MSSQL Transaction Log
  - MongoDB oplog
  - Kafka
- https://www.architecture-weekly.com/p/the-write-ahead-log-a-foundation
Row (Logical) based
- aka Logical Log
- Uses different log formats for replication and storage
- Sequence of records capturing modifications at row level
- Can be kept backward compatible
- Easier to parse by external systems using CDC (Change Data Capture)
  - Data warehouses
  - Custom Indexes and Caches
- Uses
  - MySQL binlog with Logical log configured
Trigger based
- Implemented at the application level
- Prone to bugs and overheads
- Implemented by
  - Triggers
  - Stored Procedures
  - Oracle Golden Gate

Replication Lag Problems

Reading your own writes
- aka read-after-write consistency
- If user modified, read from leader
  - does not work if lot of modifications
  - for example: modifying user’s profile
- Wait for some-time or monitor the replication lag
  - read from leader for some time, then read from followers
- client can remember the timestamp of write, read only when replication is done
  - timestamp can be logical timestamp
- cross browser/devices read-after-write consistency can be challenging
  - possible that different devices are routed to different datacenters
Monotonic Reads
- If user makes several reads of an object then he should not see time go backwards
- It can happen when several reads are across replicas
- Guarantee: Eventual Consistency < Monotonic Reads < Strong Consistency
- Make sure that user always read from same replica: hashing based on user ID
Consistent Prefix Reads
- If write happens on multiple objects then he should not see them out of order when reading
- Make sure any writes that are causally related are written on same partition

Multi-Leader Replication

aka Master-Master, Peer to Peer
We keep a leader in each datacenter
Each Write goes to local datacenter, and are replicated to local followers as well as other datacenters in async
No failover needed, request can be routed to leader in a different datacenter
Biggest issue is Write conflicts b/w different datacenters
Use cases
- Clients with offline operation
  - writes are saved locally in the device while offline
  - It acts as a leader which syncs with the datacenter when internet is available
  - Apache CouchDB is made for this purpose
- Realtime Collaborative Editing
  - application can obtain lock, make changes and release lock = single leader
  - Avoid locking, but handle conflicts = multi leader
Communication is handled by Consensus algorithms
Examples
- MySQL circular replication
- Apache CouchDB multi-master
- MSSQL Peer to Peer replication
Used
- LDAP servers like Active Directory and OpenLDAP
https://en.wikipedia.org/wiki/Multi-master_replication

Multi-Leader Write Conflicts

Ensure all writes to particular object goes to same leader
- Each user writes to single datacenter, essentially making it into single leader
- If user moves to different location, or datacenter fails, conflicts resolution needed
Convergent Conflict Resolution
- All datacenters should converge to same state
- Each write has unique ID, write with highest ID is winner
  - prone to data loss
- Each replica has unique ID, writes on higher numbered replica wins
  - prone to data loss
- Merge and concatenate values, example: “B/C”
- Handle at application level, and let user decide on the conflicts
Custom Conflict Resolution
- When write conflicts happen, conflict handler stores them
- When data is read next time, user is presented with the option to handle conflict
- Example: Apache CouchDB

Automatic Conflict Resolution

CRDT: Conflict free Replicated Datatypes
- family of data structures like sets, maps etc. which multiple users write to
- It automatically handles conflict resolution
- 2-way merge function
Mergable Persistent Data structures
- tracks history of writes
- 3-way merge function
Operational Transformation
- Algorithm used by Google Docs

Multi-Leader Topologies

Circular, Star, All to All
MySQL only supports circular

Leaderless Replication

aka Dynamo-style
Amazon DynamoDB repopularized the leaderless replication
Writes go to all DBs and reads come from all DBs
Failover does not exist
It is application’s responsibility to recover from errors
Read Repair
- Client makes read from several nodes in parallel
- It can detect stale responses, for example one replica gives version 5, another version 6
- If client sees the stale values it writes newer values back
- Works well for frequently read data
Anti-entropy process
- background process that constantly look for differences in replicas
- It copies missing data in replicas
- Writes are not copied in particular order
- Without this, if values are rarely read, it will have reduced durability
Useful for
- High Availability (Tolerate Network Interruptions)
- Low Latency (Tolerate Latency Spikes)
- Tolerate occasional stale reads
- Tolerate conflicting writes
- Tolerating Slow replicas
Examples
- Amazon DynamoDB
- Cassandra
- Riak
- Voldemort
- ScyllaDB

Quorums

Strict Quorum
- Assuming:
  - $n$ = number of replicas
  - $w$ = number of nodes to wait for write
  - $r$ = number of nodes to wait for read
- Then
  - $w + r > n$ , then we can expect up to date read
- Read and Writes which follow the above are called quorum reads and writes
- The parameters $n$ , $w$ and $r$ are configurable in the DB
- Generally
  - $n$ is chosen to be odd
  - we choose $w = r = ⌈ \frac{n + 1}{2} ⌉$
  - Max number of nodes we can tolerate to be unavailable = $n - max (w, r)$
  - $n = 3 ⟹ w = r = 2$
    - we can tolerate loss of 1 node
  - $n = 5 ⟹ w = r = 3$
    - we can tolerate loss of 2 nodes
  - If more nodes become unavailable then, reads and writes throw error
Issues with Quorums
- Sloppy Quorum
- Concurrent Write causes conflicts
- Concurrent Write and Read may cause stale read because write is not done on all nodes
- If less than $w$ writes happen, then
  - overall write fails
  - writes are not rollback on nodes where it succeeded
  - read may or may not return the that write
- If node with new value fails, and it recovers from node with old value, it will violate the $w$ writes, and can cause stale read
- No guarantees for:
  - reading your write
  - monotonic read
  - consistent prefix read
Sloppy Quorum
- During network interruption, writes and reads require $w$ and $r$ successful responses, but may include nodes that are not home nodes
- Once n/w interruption fixed, any writes to temporary node is sent to appropriate home nodes, this is called Hinted Handoff
- It is not actually a quorum, but an assurance of durability that data is written somewhere
- Examples
  - Riak: enabled by default
  - Cassandra, Voldemort: disabled by default

Leaderless: Multi-Datacenter

Leaderless replication is suitable also
Writes go to each replica in all datacenters
- Quorum is taken from only replicas from local datacenters
- $n$ = total number of replicas
- Example: Cassandra, Voldemort
Writes only go to each replica in local datacenter
- Replicated to other datacenters in async
- Quorum is taken from local datacenter replicas
- Example: Riak

Concurrent Writes

Conflicts can happen:
- Strict Quorums
  - Keys are overwritten
- Sloppy Quorums
  - Read Repair
  - Hinted Handoff
Last Write Wins (LWW)
- Attach timestamp to each write
- In case of conflicts, always pick the latest write
- achieves eventual convergence
- It causes lost writes (less durable)
- If losing data is not acceptable, LWW is a poor choice
- Use cases
  - Caching
- Cassandra only supports LWW
  - we generally set UUID as a key instead
  - Hence each write becomes unique
- Riak supports LWW but is optional
Happens-Before Relationship
- An operation A happens-before another operation B, if B knows about A, or depends on A, or builds upon A in some way
- An operation A is concurrent with operation B, if neither knows about the other
- For any two operations A and B, one of the following must hold:
  - A happened before B
  - B happened before A
  - A and B are concurrent
- Riak supports it

Happens-Before Relationship: Causal Context

https://docs.riak.com/riak/kv/latest/developing/usage/conflict-resolution/index.html
https://docs.riak.com/riak/kv/2.2.3/developing/usage/conflict-resolution/java.1.html
Better than LWW since no lost data, but client has to resolve merge conflict between concurrent values
- In Riak concurrent values are called as siblings
- Resolving merge conflict is same as in Multi-Leader Replication
- This is error-prone, Hence Riak supports CRDTs datatypes: Automatic Conflict Resolution
Algorithm
- Each key has a version associated with it
- Each write causes a generation of new version in the DB
- Each time Client writes, it attaches a version against which it wants to update
  - For any key, If the client version >= server version, then all those keys are overwritten in a new key with the generated version
  - For any key, If the client version < server version, it is considered concurrent, and is not touched
- Each writes give a response:
  - List of all the keys with different versions
  - Latest version generated in the DB
- In the next write, Client must merge all the keys into one key-value along with the last version read
Example:
- Server Initially: {}
- Client 1: add milk
  - set cart = [milk], (implicit v0)
  - Server: generate a new version v1
  - Server saves : { v1: [milk] }
- Client 2: add eggs
  - set cart = [eggs] (implicit v0)
  - Server: generate a new version v2
  - Server saves: { v1: [milk], v2: [eggs] }
- Client 1: add flour
  - add to last response: { v1: [milk] }
  - set cart = [milk, flour], v1
  - Server: v1 already exists, overwrite and generate a new version: v3
  - Server saves: { v3: [milk, flour], v2: [eggs] }
- Client 2: add butter
  - resolve merge conflict, in last response: { v1: [milk], v2: [eggs] }
  - set cart = [milk, eggs, butter], v2
  - Server: v2 already exists, overwrite and generate: v4
  - Server saves: { v3: [milk, flour], v4: [milk, eggs, butter] }

Version Vector

The above Algorithm does not work for multiple replicas
Instead of having version per key, We need to have version per replica per key
Collection of version numbers from all replicas is called version vector
- Riak encodes it in a string called Causal Context
In the Client response we will get version vector instead of single version

Monitoring Staleness

For Leader-Based Replication, DB exposes metrics for Replication Lag
- Replication Lag = Leader current position - Follower current position
For Leaderless Replication, Writes are not in particular order
- Hence, difficult to measure replica staleness

Comparison

	Writes go to	Tolerating Node Failure	Suitable for Multi-Datacenter	Performance	Benefits	Challenges
Single Leader	Leader	Failover, Choose new leader (manual/auto)	No	Writes: High	Moderate Availability	Failover downtime,
Multi-Leader	Any Leader	Route to another leader	Yes	Writes: High	High Availability	Write Conflicts
Leaderless	Multiple Replicas	Strict Quorums, Sloppy Quorums and Hinted Handoff	Yes	Reads/Writes: Moderate???	Very High Availability, Low Latency, Tolerate Slow Reads	Stale Reads

Experiments

Explorer

Database_Replication

Database Replication

Redundancy vs Replication

Synchronous vs Asynchronous Replication

Leader-Follower Replication

Adding a new Follower

Failures

Replication Mechanism

Replication Lag Problems

Multi-Leader Replication

Multi-Leader Write Conflicts

Automatic Conflict Resolution

Multi-Leader Topologies

Leaderless Replication

Quorums

Leaderless: Multi-Datacenter

Concurrent Writes

Happens-Before Relationship: Causal Context

Version Vector

Monitoring Staleness

Comparison

Limits of Replica Numbers

💡 Rule of thumb from real deployments (ChatGPT)

Articles

Table of Contents