Partitioning

aka Sharding

Sharding is a horizontal partition of data in a database
Sharding needs to be based on something like userId
We can shard a shard if shard grows big, aka Hierarchical Sharding
We can perform indexing on each shard based on different column to make query faster
We use master-slave architecture for shards. Writes go to master, reads are distributed to all the slaves. If master fails, one of the slave becomes a master
Uses:
- Improves Scalability
Notes:
- NoSQL already implements this concept from behind, we don’t need to worry about it
- Sharding should be considered if really necessary, otherwise improving query and indexing is good optimization methods
Topics:
- How do you perform JOINS on different shards?
- Having dynamic number of shards is challenging, we can use hierarchical sharding to mitigate this limitation
- It can be done by consistent hashing?
- how about transactions??

Types of database partitioning

Horizontal Partitioning: Partitioning based on rows
Vertical Partitioning: Partitioning based on columns

Partition and Replication

Partition is generally combined with Replication
A node may store more than one partition
An example could be where each partition’s leader is assigned to one node
- partition’s followers are assigned other nodes

Partitioning Strategies

If partitioning is unfair, partitions become skewed
- When a partition has too much data/load it becomes hotspot
DBs cannot automatically fix skewed workload
Responsibility of the application to reduce the skewed load
Partitioning by random number
- Not useful since we can’t figure out which key maps to partition during read

Partitioning by Key Range

Similar to Volumes of Encyclopedia
Good for range queries
Partition boundary is chosen manually or automatically
Certain access patterns can cause hotspots
Used by
- Google Bigtable
- Apache HBase (Bigtable Open Source)
- RethinkDB
- MongoDB

Partitioning by Hash of Key

Good Hash Function makes skewed data uniformly distributed
Each partition = range of hashes
- Should not use hash mod N
- Since if N changes most of keys will be redistributed
Partitioning boundaries can be evenly spaced
- If chosen randomly, it is called Consistent Hashing
Range Queries are inefficient and may hit all the partitions
- Cassandra uses compound primary key
  - First part of key = choose partition
  - Other part of key = use as index in the partition
  - Example: (userId, update_timestamp)
    - efficient range scan for timestamp of given user
Examples:
- MongoDB ⇒ MD5
- Cassandra ⇒ Murmur3
- Voldemort ⇒ Fowler-Noll-Vo

Partitioning Secondary Indexes

Difficult to partition secondary indexes
Secondary Indexes are useful for Search DBs like Solr and Elasticsearch
Types:
- Document based
- Term based

Document Based

aka local index
Each Partition maintains its own secondary index
Covers only the documents in that partition
If your query needs to fetch and retrieve from all partitions then it is called scatter/gather
- Makes queries expensive
- Prone to tail latency amplification
Examples:
- Cassandra
- Elasticsearch
- SolrCloud
- VoltDB

Term Based

aka global index
The index itself needs to be partitioned
The term itself determines the partition of the index
Can help avoid scatter/gather
Disadvantages
- Writes are slow and complicated
- Update to secondary indexes are often async
Partitioning of Index can be done
- using key range
- using hash function
Examples
- Amazon DynamoDB
- Riak’s search
- Oracle Data Warehouse

Rebalancing Partitions

The process of moving load from one node to another node in the cluster
Requirements
- Fairness: after rebalancing, load shared evenly
- Availability: DB keeps serving reads/writes during process
- Efficiency: move only necessary data between nodes

Rebalancing Strategies

Fixed Number of partitions

Create many Number of partitions >> Number of Nodes
- Number of nodes are limited to number of partitions, Hence high number of partitions is chosen
- 10 nodes maybe split into 1000 partitions
  - ~ 100 partitions assigned to each node
If new node added, it can steal partitions from each of the nodes
- Only partitions are moved between nodes
- No Change: Number of partitions, assignment of keys to partitions
- Operationally simpler
Size of each partition is proportional to the size of dataset
Node with better hardware can store more partitions
Challenges
- If dataset size is highly variable, then choosing right Number of partitions and right size of partitions is difficult
- If high number chosen, it incurs overhead of management
- If very large partitions chosen, then rebalancing and recovery is expensive
Used in:
- Riak
- Elasticsearch
- Couchbase
- Voldemort

Dynamic Partitioning

If partition grows to exceed configured size, it is split
If partition shrinks below configured size, partitions get merged
Number of partitions is proportional to size of dataset
Empty Data causes single partition/single node to be utilized until it exceeds threshold
- HBase & MongoDB allows initial set of partitions on empty database
- aka pre-splitting
In HBase the default configured threshold for splitting is 10GB
Used in Key-range & Hash partitioned data
Used in:
- HBase
- MongoDB

Partitioning proportional to nodes

Number of partitions proportional to the number of nodes
Hence, Keep Fixed Number of partitions per node
- If you keep same number of nodes, partition grows proportionally to the dataset size
- If you increase number of nodes, partition becomes smaller again
When node is added, randomly partitions are chosen and split and owned
Cassandra has 256 partitions per node by default
Used in Hash partitioned data because of randomization
Used in:
- Cassandra
- Ketama

Automatic vs Manual Rebalancing

Fully automated Rebalancing can be unpredictable
- Can be dangerous with automatic failure detection
- Example: overloaded node can be considered dead, and trigger rebalancing to cause cascading failures (chain reaction of failures)
Semi-Manual Rebalancing
- System suggests rebalancing
- Admin approves and takes action
- Examples
  - Couchbase
  - Riak
  - Voldemort

Request Routing

Separate coordination service may be required like Zookeeper to keep track of cluster metadata
Zookeeper keeps map of partitions to nodes
Routing Tier subscribes to Zookeeper
Zookeeper keep track of node updates and send it to routing tier
HBase, SolrCloud and Kafka use Zookeeper
MongoDb has its own config server and mongos daemons
Cassandra and Riak uses gossip protocol
Couchbase does not rebalance automatically
- routing tier: moxi

Experiments

Explorer

Database_Partitioning

Partitioning

Types of database partitioning

Partition and Replication

Partitioning Strategies

Partitioning by Key Range

Partitioning by Hash of Key

Partitioning Secondary Indexes

Document Based

Term Based

Rebalancing Partitions

Rebalancing Strategies

Fixed Number of partitions

Dynamic Partitioning

Partitioning proportional to nodes

Automatic vs Manual Rebalancing

Request Routing

Table of Contents