Database Replication

  • Leader Based Replication
    • Single-Leader
      • aka Leader-Follower or Master-Slave or Active-Passive
    • Multi-Leader
      • aka Master-Master or Active-Active or Peer to Peer
      • Consensus Algorithm
        • Paxos
  • Leaderless Replication
  • Single Leader is useful to ensure no write conflicts
  • Multi-Leader and Leaderless Replication increases write throughput but can cause write conflicts
  • Uses
    • Fault tolerance
    • Improve latency with Regional strategy
    • Helps in Scaling
  • https://stackoverflow.com/questions/3736969/master-master-vs-master-slave-database-architecture

Redundancy vs Replication

  • Redundancy is process of duplication
    • Duplication of critical components or functions of a system to increase reliability
    • Achieves
      • Create Backup data
      • Improve actual performance
  • Replication is process of keeping the sync
    • Involves sharing information to ensure consistency between redundant resources such as multiple databases
    • Improve reliability, fault-tolerance, or accessibility

Synchronous vs Asynchronous Replication

  • Sync
    • Guarantees up-to-date copy in followers
    • If Follower fails, Leader must block all writes and wait until it is back again
  • Semi-Sync
    • One of the follower is replicated in sync
    • If the replica fails, one of the async replica is made a sync replica
  • Async
    • Writes are not durable
    • If followers fail, it can continue processing writes

Leader-Follower Replication

  • aka Master-Slave, Single Leader
  • Writes go to Leader
    • Sends data change to all of its followers
  • Reads go to either Leader or one of its Followers
    • aka Replica
    • Each followers takes the data change and apply on its DB
  • Ensures no write conflicts
  • Examples
    • PostgresQL WAL
    • MySQL binlog (Binary Log)
    • MSSQL Transaction Log
    • MongoDB
    • Kafka
    • RabbitMQ

Adding a new Follower

  • Take consistent snapshot of Leader without taking lock on DB
  • Copy snapshot to the new follower node
  • Follower connects with the leader
    • Snapshot taken earlier is associated with a position in leader’s replication log
      • Postgres: Log sequence numbers
      • MySQL: binlog coordinates
    • Requests all data changes from Leader since the snapshot was taken
  • When Follower is done taking all data changes, we say follower has caught up

Failures

  • Failing Follower
    • Follower keeps log of all data changes
    • When follower recovers, it connects with leader and request changes since the outage
  • Failing Leader
    • aka Failover
      • Determine the Leader has died
        • timeouts can be used, for example 30s
      • Choosing a new Leader
        • A new leader is elected (consensus)
        • Generally the best candidate which is most up-to-date with the leader
      • Reconfigure new Leader
        • Route writes to new leader
    • The writes the only went to Leader but not replicated can cause conflict if old leader comes back online
      • generally writes are discarded
    • Avoid Split brain where two nodes believe that they are leaders
      • can cause conflicts and data loss
    • Choosing right timeout value is difficult

Replication Mechanism

  • Statement based
    • Keep all write statements in the log
    • Issues:
      • If the statements are non-deterministic, like NOW() (current time), RAND() (random number)
      • If using auto-incrementing column
      • If side effects like: triggers, stored procedures
    • Uses
      • VoltDB uses this, but requires Tx to be deterministic
  • Write-Ahead Log (WAL)
    • In case of LSM based index, log itself stores the data
    • In case of B-Tree, every modification is written to WAL, so that index can be restored in case of crash
    • Issues
      • Contains data of which bytes changed in which disk block
      • Typically not possible to use different storage engines/versions in Leader and Followers
    • Uses
      • PostgresQL WAL
      • MySQL binlog (Binary Log)
      • MSSQL Transaction Log
      • MongoDB oplog
      • Kafka
    • https://www.architecture-weekly.com/p/the-write-ahead-log-a-foundation
  • Row (Logical) based
    • aka Logical Log
    • Uses different log formats for replication and storage
    • Sequence of records capturing modifications at row level
    • Can be kept backward compatible
    • Easier to parse by external systems using CDC (Change Data Capture)
      • Data warehouses
      • Custom Indexes and Caches
    • Uses
      • MySQL binlog with Logical log configured
  • Trigger based
    • Implemented at the application level
    • Prone to bugs and overheads
    • Implemented by
      • Triggers
      • Stored Procedures
      • Oracle Golden Gate

Replication Lag Problems

  • Reading your own writes
    • aka read-after-write consistency
    • If user modified, read from leader
      • does not work if lot of modifications
      • for example: modifying user’s profile
    • Wait for some-time or monitor the replication lag
      • read from leader for some time, then read from followers
    • client can remember the timestamp of write, read only when replication is done
      • timestamp can be logical timestamp
    • cross browser/devices read-after-write consistency can be challenging
      • possible that different devices are routed to different datacenters
  • Monotonic Reads
    • If user makes several reads of an object then he should not see time go backwards
    • It can happen when several reads are across replicas
    • Guarantee: Eventual Consistency < Monotonic Reads < Strong Consistency
    • Make sure that user always read from same replica: hashing based on user ID
  • Consistent Prefix Reads
    • If write happens on multiple objects then he should not see them out of order when reading
    • Make sure any writes that are causally related are written on same partition

Multi-Leader Replication

  • aka Master-Master, Peer to Peer
  • We keep a leader in each datacenter
  • Each Write goes to local datacenter, and are replicated to local followers as well as other datacenters in async
  • No failover needed, request can be routed to leader in a different datacenter
  • Biggest issue is Write conflicts b/w different datacenters
  • Use cases
    • Clients with offline operation
      • writes are saved locally in the device while offline
      • It acts as a leader which syncs with the datacenter when internet is available
      • Apache CouchDB is made for this purpose
    • Realtime Collaborative Editing
      • application can obtain lock, make changes and release lock = single leader
      • Avoid locking, but handle conflicts = multi leader
  • Communication is handled by Consensus algorithms
  • Examples
    • MySQL circular replication
    • Apache CouchDB multi-master
    • MSSQL Peer to Peer replication
  • Used
    • LDAP servers like Active Directory and OpenLDAP
  • https://en.wikipedia.org/wiki/Multi-master_replication

Multi-Leader Write Conflicts

  • Ensure all writes to particular object goes to same leader
    • Each user writes to single datacenter, essentially making it into single leader
    • If user moves to different location, or datacenter fails, conflicts resolution needed
  • Convergent Conflict Resolution
    • All datacenters should converge to same state
    • Each write has unique ID, write with highest ID is winner
      • prone to data loss
    • Each replica has unique ID, writes on higher numbered replica wins
      • prone to data loss
    • Merge and concatenate values, example: “B/C”
    • Handle at application level, and let user decide on the conflicts
  • Custom Conflict Resolution
    • When write conflicts happen, conflict handler stores them
    • When data is read next time, user is presented with the option to handle conflict
    • Example: Apache CouchDB

Automatic Conflict Resolution

  • CRDT: Conflict free Replicated Datatypes
    • family of data structures like sets, maps etc. which multiple users write to
    • It automatically handles conflict resolution
    • 2-way merge function
  • Mergable Persistent Data structures
    • tracks history of writes
    • 3-way merge function
  • Operational Transformation
    • Algorithm used by Google Docs

Multi-Leader Topologies

  • Circular, Star, All to All
  • MySQL only supports circular

Leaderless Replication

  • aka Dynamo-style
  • Amazon DynamoDB repopularized the leaderless replication
  • Writes go to all DBs and reads come from all DBs
  • Failover does not exist
  • It is application’s responsibility to recover from errors
  • Read Repair
    • Client makes read from several nodes in parallel
    • It can detect stale responses, for example one replica gives version 5, another version 6
    • If client sees the stale values it writes newer values back
    • Works well for frequently read data
  • Anti-entropy process
    • background process that constantly look for differences in replicas
    • It copies missing data in replicas
    • Writes are not copied in particular order
    • Without this, if values are rarely read, it will have reduced durability
  • Useful for
    • High Availability (Tolerate Network Interruptions)
    • Low Latency (Tolerate Latency Spikes)
    • Tolerate occasional stale reads
    • Tolerate conflicting writes
    • Tolerating Slow replicas
  • Examples
    • Amazon DynamoDB
    • Cassandra
    • Riak
    • Voldemort
    • ScyllaDB

Quorums

  • Strict Quorum
    • Assuming:
      • = number of replicas
      • = number of nodes to wait for write
      • = number of nodes to wait for read
    • Then
      • , then we can expect up to date read
    • Read and Writes which follow the above are called quorum reads and writes
    • The parameters , and are configurable in the DB
    • Generally
      • is chosen to be odd
      • we choose
      • Max number of nodes we can tolerate to be unavailable =
        • we can tolerate loss of 1 node
        • we can tolerate loss of 2 nodes
      • If more nodes become unavailable then, reads and writes throw error
  • Issues with Quorums
    • Sloppy Quorum
    • Concurrent Write causes conflicts
    • Concurrent Write and Read may cause stale read because write is not done on all nodes
    • If less than writes happen, then
      • overall write fails
      • writes are not rollback on nodes where it succeeded
      • read may or may not return the that write
    • If node with new value fails, and it recovers from node with old value, it will violate the writes, and can cause stale read
    • No guarantees for:
      • reading your write
      • monotonic read
      • consistent prefix read
  • Sloppy Quorum
    • During network interruption, writes and reads require and successful responses, but may include nodes that are not home nodes
    • Once n/w interruption fixed, any writes to temporary node is sent to appropriate home nodes, this is called Hinted Handoff
    • It is not actually a quorum, but an assurance of durability that data is written somewhere
    • Examples
      • Riak: enabled by default
      • Cassandra, Voldemort: disabled by default

Leaderless: Multi-Datacenter

  • Leaderless replication is suitable also
  • Writes go to each replica in all datacenters
    • Quorum is taken from only replicas from local datacenters
    • = total number of replicas
    • Example: Cassandra, Voldemort
  • Writes only go to each replica in local datacenter
    • Replicated to other datacenters in async
    • Quorum is taken from local datacenter replicas
    • Example: Riak

Concurrent Writes

  • Conflicts can happen:
    • Strict Quorums
      • Keys are overwritten
    • Sloppy Quorums
      • Read Repair
      • Hinted Handoff
  • Last Write Wins (LWW)
    • Attach timestamp to each write
    • In case of conflicts, always pick the latest write
    • achieves eventual convergence
    • It causes lost writes (less durable)
    • If losing data is not acceptable, LWW is a poor choice
    • Use cases
      • Caching
    • Cassandra only supports LWW
      • we generally set UUID as a key instead
      • Hence each write becomes unique
    • Riak supports LWW but is optional
  • Happens-Before Relationship
    • An operation A happens-before another operation B, if B knows about A, or depends on A, or builds upon A in some way
    • An operation A is concurrent with operation B, if neither knows about the other
    • For any two operations A and B, one of the following must hold:
      • A happened before B
      • B happened before A
      • A and B are concurrent
    • Riak supports it

Happens-Before Relationship: Causal Context

  • https://docs.riak.com/riak/kv/latest/developing/usage/conflict-resolution/index.html
  • https://docs.riak.com/riak/kv/2.2.3/developing/usage/conflict-resolution/java.1.html
  • Better than LWW since no lost data, but client has to resolve merge conflict between concurrent values
    • In Riak concurrent values are called as siblings
    • Resolving merge conflict is same as in Multi-Leader Replication
    • This is error-prone, Hence Riak supports CRDTs datatypes: Automatic Conflict Resolution
  • Algorithm
    • Each key has a version associated with it
    • Each write causes a generation of new version in the DB
    • Each time Client writes, it attaches a version against which it wants to update
      • For any key, If the client version >= server version, then all those keys are overwritten in a new key with the generated version
      • For any key, If the client version < server version, it is considered concurrent, and is not touched
    • Each writes give a response:
      • List of all the keys with different versions
      • Latest version generated in the DB
    • In the next write, Client must merge all the keys into one key-value along with the last version read
  • Example:
    • Server Initially: {}
    • Client 1: add milk
      • set cart = [milk], (implicit v0)
      • Server: generate a new version v1
      • Server saves : { v1: [milk] }
    • Client 2: add eggs
      • set cart = [eggs] (implicit v0)
      • Server: generate a new version v2
      • Server saves: { v1: [milk], v2: [eggs] }
    • Client 1: add flour
      • add to last response: { v1: [milk] }
      • set cart = [milk, flour], v1
      • Server: v1 already exists, overwrite and generate a new version: v3
      • Server saves: { v3: [milk, flour], v2: [eggs] }
    • Client 2: add butter
      • resolve merge conflict, in last response: { v1: [milk], v2: [eggs] }
      • set cart = [milk, eggs, butter], v2
      • Server: v2 already exists, overwrite and generate: v4
      • Server saves: { v3: [milk, flour], v4: [milk, eggs, butter] }

Version Vector

  • The above Algorithm does not work for multiple replicas
  • Instead of having version per key, We need to have version per replica per key
  • Collection of version numbers from all replicas is called version vector
    • Riak encodes it in a string called Causal Context
  • In the Client response we will get version vector instead of single version

Monitoring Staleness

  • For Leader-Based Replication, DB exposes metrics for Replication Lag
    • Replication Lag = Leader current position - Follower current position
  • For Leaderless Replication, Writes are not in particular order
    • Hence, difficult to measure replica staleness

Comparison

Writes go toTolerating Node FailureSuitable for Multi-DatacenterPerformanceBenefitsChallenges
Single LeaderLeaderFailover, Choose new leader
(manual/auto)
NoWrites: HighModerate AvailabilityFailover downtime,
Multi-LeaderAny LeaderRoute to another leaderYesWrites: HighHigh AvailabilityWrite Conflicts
LeaderlessMultiple ReplicasStrict Quorums,

Sloppy Quorums and Hinted Handoff
YesReads/Writes: Moderate???Very High Availability,

Low Latency,

Tolerate Slow Reads
Stale Reads

Limits of Replica Numbers

  • Read Replicas have limits, woaaah!!!

💡 Rule of thumb from real deployments (ChatGPT)

  • High read, low write
    • More replicas (e.g., 3–10), scale horizontally.
  • High write, low read
    • Keep fewer replicas (2–3) to reduce replication lag.
  • Mission critical
    • Minimum 3 total nodes for quorum (HA).

Articles