Distributed System Partitioning
Partitioning is the action of breaking up large datasets into multiple partitions. There are several names for this:
- Shard - MongoDB, Elasticsearch, SolrCloud
- Region - HBase
- Tablet - Bigtable
- Vnode - Cassandra, Riak
Partitioning are often combined with replication, so that copies of each partitions are stored on multiple nodes. The choice of partitioning scheme are usually independent of the replication scheme.
Terminologies
- Skew - when some partitions have more data or queries compared to others
- Hot spot - a partition with disproportionately high load
Key Value Data Partitioning
Assumptions:
- Always access a record by its primary key
Partitioning by Key Range
- Create boundary between the ranges (for example for a string key, range can be the first character with range A - C, D - H, …)
- Combined with the knowledge of partition assignment to the node, the request can be direclty sent to the node
Notes:
- Range may not be evenly spaced, ranges may be skewed, so the ranges need to adapt to the data
- Range selection can be chosen manually or automatically
- Keys can be sorted within the partition (SSTables, LSM Trees)
Concerns:
- Certain access patterns can lead to hot spots
- Example: range by yyyy-mm, the latest month will be a hot spot due to frequent use of the partition
Partitioning by Hash of Key
- A good hash function takes skewed data and make it roughly uniformly distributed
- Similar inputs can lead to very different hash value
- Hash function does not need to be cryptographically strong because clashes is always required (multiple keys will need to be in the same partition)
- A partition can be assigned to a range of hashes instead of the key itself
Concerns:
- Unable to perform range queries of the key, a compound primary key with unhashed part can be used for range query given the hashed part
- Compound key enables data model for one to many relationship
- Sort ordering is lost
Skew and Relieving Hot Spots
- Even with hashing, hotspots may not be completely avoided
- At the moment no database systems can compensate for single key skewed workload, so the application has to handle the skew
- Adding randomised value is typically the way to go, however this will slow down reads
Secondary Indexes
- Secondary index usually does not identify the records uniquely, but more as a way to search for occurences of certain values
- Secondary index are complex but is useful for data modelling
- Secondary index needs to be partitioned as well
Partitioning Secondary Index by Document
Each partition is completely separate, each partition maintains its own secondary indexes. Document partitioned secondary index is also called local index for that reason.
However, to search by an index, the query needs to be sent to all partitions, and combine the results received (scatter gather). This approach is prone to tail latency amplification.
Partitioning Secondary Index by Term
Also called a global index, it stores all document references for a term in the same partition. The global index can be partitioned differently from the primary key index.
The term we are looking for determines which partition of the index we find.
Benefits:
- Possibility to perform range scans (if index is partitioned by range)
- Reads are more efficient
Drawbacks:
- Writes are slower and more complicated, writing a document can result in writes to multiple partitions / nodes
Updates to secondary index partitioned this way are often performed asynchronously, because synchronous updates will require distributed transactions across multiple nodes and partitions. The changes to index are usually not immediate.
Rebalancing Partitions
Some minimum requirements for rebalancing:
- After rebalancing, the load (storage, read, write) should be shared fairly between nodes in the cluster
- While rebalancing, the database should continue accepting reads and writes
- No more data than necessary should be moved between nodes, to make rebalancing fast and minimize network and disk I/O\
Hash Mod N
This is a bad idea, because when the number of nodes changes, more than necessary partitions are moved because of the partition node modulo.
Fixed Number of Partitions
- Over partition and assign multiple partitions to each node
- When nodes count changes, steal some partitions from existing nodes, or reassign partitions from removed nodes to remaining nodes
- Each partition has management overhead, so choosing a number that is too high can be counterproductive
- Size of each partitions can grow disproportionately, fixed number of partitions can encounter skews and hotspots without partition splitting
Dynamic Partitioning
- Key range partitioning can be impractical with fixed number of partitions, key ranges needs to be adjusted according to the dataset
- When a partition grows above a threshold, the partition can be split into two roughly even partition size
- When a partition shrinks below a certain threshold, the partition can be merged with adjacent partitions
- This scheme can also be applied to range of hashed primary key
Benefits:
- Number of partitions adapts to the total data volume
Caveats:
- Initial partitioning scheme needs to be defined manually when dataset is empty (pre-splitting with no priori information)
- Otherwise it can start with a single partition
Partitioning Proportionally to Nodes
- Fixed number of partitions per node
- Size of the partition grows proportionally to the dataset size
- When a new node joins, a fixed number of existing partitions are split, potentially resulting in unfair splitting
- When averaged over large number of partitions, the load should be roughly fairly split
- Requires hash-based partitioning
Automatic vs Manual Rebalancing
- There is an intermediate solution where the database system generates the partitioning scheme and distribution, and is committed by an administrator (similar to cruise control).
- Reduces operational surprises
- Fully automated rebalancing can be convenient but unpredictable. We do not want rebalancing to happen during peak period for instance
- Combined with automatic failure detection and resolution, it can cause cascading failure through increasing the load to other nodes by rebalancing
Request Routing
An instance of a more general problem of service discovery – partition discovery.
Approaches:
- Allow clients to contact any node, handle if the partition happens to be in the node, otherwise forwards to the appropriate node. This requires partition assignment metadata to be present in all the nodes.
- Use a routing tier, which all clients connect to. It does not handle the request, but forwards all the request accordingly (partition aware load balancer). This requires partition assignment metadata to be present in the routing tier.
- Use partition aware client. This requires partition assignment metadata to be present in the client.
How to distribute partition assignment metadata?
- Centralised services (for example zookeeper, mongo config server) with cache / change subscription where the metadata needs to be present
- Gossip protocol, where all changes are disseminated to all nodes
Parallel Query Execution
Query optimisers (spanner, Trino) can break down complex query into execution stages and partitions, that can be executed in parallel.
References
- Designing Data Intensive Applications, Martin Kleppmann