Partitioning

  • aka Sharding
  • Sharding is a horizontal partition of data in a database
  • Sharding needs to be based on something like userId
  • We can shard a shard if shard grows big, aka Hierarchical Sharding
  • We can perform indexing on each shard based on different column to make query faster
  • We use master-slave architecture for shards. Writes go to master, reads are distributed to all the slaves. If master fails, one of the slave becomes a master
  • Uses:
    • Improves Scalability
  • Notes:
    • NoSQL already implements this concept from behind, we don’t need to worry about it
    • Sharding should be considered if really necessary, otherwise improving query and indexing is good optimization methods
  • Topics:
    • How do you perform JOINS on different shards?
    • Having dynamic number of shards is challenging, we can use hierarchical sharding to mitigate this limitation
    • It can be done by consistent hashing?
    • how about transactions??

Types of database partitioning

  • Horizontal Partitioning: Partitioning based on rows
  • Vertical Partitioning: Partitioning based on columns

Partition and Replication

  • Partition is generally combined with Replication
  • A node may store more than one partition
  • An example could be where each partition’s leader is assigned to one node
    • partition’s followers are assigned other nodes

Partitioning Strategies

  • If partitioning is unfair, partitions become skewed
    • When a partition has too much data/load it becomes hotspot
  • DBs cannot automatically fix skewed workload
  • Responsibility of the application to reduce the skewed load
  • Partitioning by random number
    • Not useful since we can’t figure out which key maps to partition during read

Partitioning by Key Range

  • Similar to Volumes of Encyclopedia
  • Good for range queries
  • Partition boundary is chosen manually or automatically
  • Certain access patterns can cause hotspots
  • Used by
    • Google Bigtable
    • Apache HBase (Bigtable Open Source)
    • RethinkDB
    • MongoDB

Partitioning by Hash of Key

  • Good Hash Function makes skewed data uniformly distributed
  • Each partition = range of hashes
    • Should not use hash mod N
    • Since if N changes most of keys will be redistributed
  • Partitioning boundaries can be evenly spaced
  • Range Queries are inefficient and may hit all the partitions
    • Cassandra uses compound primary key
      • First part of key = choose partition
      • Other part of key = use as index in the partition
      • Example: (userId, update_timestamp)
        • efficient range scan for timestamp of given user
  • Examples:
    • MongoDB MD5
    • Cassandra Murmur3
    • Voldemort Fowler-Noll-Vo

Partitioning Secondary Indexes

  • Difficult to partition secondary indexes
  • Secondary Indexes are useful for Search DBs like Solr and Elasticsearch
  • Types:
    • Document based
    • Term based

Document Based

  • aka local index
  • Each Partition maintains its own secondary index
  • Covers only the documents in that partition
  • If your query needs to fetch and retrieve from all partitions then it is called scatter/gather
    • Makes queries expensive
    • Prone to tail latency amplification
  • Examples:
    • Cassandra
    • Elasticsearch
    • SolrCloud
    • VoltDB

Term Based

  • aka global index
  • The index itself needs to be partitioned
  • The term itself determines the partition of the index
  • Can help avoid scatter/gather
  • Disadvantages
    • Writes are slow and complicated
    • Update to secondary indexes are often async
  • Partitioning of Index can be done
    • using key range
    • using hash function
  • Examples
    • Amazon DynamoDB
    • Riak’s search
    • Oracle Data Warehouse

Rebalancing Partitions

  • The process of moving load from one node to another node in the cluster
  • Requirements
    • Fairness: after rebalancing, load shared evenly
    • Availability: DB keeps serving reads/writes during process
    • Efficiency: move only necessary data between nodes

Rebalancing Strategies

Fixed Number of partitions

  • Create many Number of partitions >> Number of Nodes
    • Number of nodes are limited to number of partitions, Hence high number of partitions is chosen
    • 10 nodes maybe split into 1000 partitions
      • ~ 100 partitions assigned to each node
  • If new node added, it can steal partitions from each of the nodes
    • Only partitions are moved between nodes
    • No Change: Number of partitions, assignment of keys to partitions
    • Operationally simpler
  • Size of each partition is proportional to the size of dataset
  • Node with better hardware can store more partitions
  • Challenges
    • If dataset size is highly variable, then choosing right Number of partitions and right size of partitions is difficult
    • If high number chosen, it incurs overhead of management
    • If very large partitions chosen, then rebalancing and recovery is expensive
  • Used in:
    • Riak
    • Elasticsearch
    • Couchbase
    • Voldemort

Dynamic Partitioning

  • If partition grows to exceed configured size, it is split
  • If partition shrinks below configured size, partitions get merged
  • Number of partitions is proportional to size of dataset
  • Empty Data causes single partition/single node to be utilized until it exceeds threshold
    • HBase & MongoDB allows initial set of partitions on empty database
    • aka pre-splitting
  • In HBase the default configured threshold for splitting is 10GB
  • Used in Key-range & Hash partitioned data
  • Used in:
    • HBase
    • MongoDB

Partitioning proportional to nodes

  • Number of partitions proportional to the number of nodes
  • Hence, Keep Fixed Number of partitions per node
    • If you keep same number of nodes, partition grows proportionally to the dataset size
    • If you increase number of nodes, partition becomes smaller again
  • When node is added, randomly partitions are chosen and split and owned
  • Cassandra has 256 partitions per node by default
  • Used in Hash partitioned data because of randomization
  • Used in:
    • Cassandra
    • Ketama

Automatic vs Manual Rebalancing

  • Fully automated Rebalancing can be unpredictable
    • Can be dangerous with automatic failure detection
    • Example: overloaded node can be considered dead, and trigger rebalancing to cause cascading failures (chain reaction of failures)
  • Semi-Manual Rebalancing
    • System suggests rebalancing
    • Admin approves and takes action
    • Examples
      • Couchbase
      • Riak
      • Voldemort

Request Routing

  • Separate coordination service may be required like Zookeeper to keep track of cluster metadata
  • Zookeeper keeps map of partitions to nodes
  • Routing Tier subscribes to Zookeeper
  • Zookeeper keep track of node updates and send it to routing tier
  • HBase, SolrCloud and Kafka use Zookeeper
  • MongoDb has its own config server and mongos daemons
  • Cassandra and Riak uses gossip protocol
  • Couchbase does not rebalance automatically
    • routing tier: moxi