Distributed System Partitioning

Partitioning is the action of breaking up large datasets into multiple partitions. There are several names for this:

Shard - MongoDB, Elasticsearch, SolrCloud
Region - HBase
Tablet - Bigtable
Vnode - Cassandra, Riak

Partitioning are often combined with replication, so that copies of each partitions are stored on multiple nodes. The choice of partitioning scheme are usually independent of the replication scheme.

Terminologies

Skew - when some partitions have more data or queries compared to others
Hot spot - a partition with disproportionately high load

Key Value Data Partitioning

Assumptions:

Always access a record by its primary key

Partitioning by Key Range

Create boundary between the ranges (for example for a string key, range can be the first character with range A - C, D - H, …)
Combined with the knowledge of partition assignment to the node, the request can be direclty sent to the node

Notes:

Range may not be evenly spaced, ranges may be skewed, so the ranges need to adapt to the data
Range selection can be chosen manually or automatically
Keys can be sorted within the partition (SSTables, LSM Trees)

Concerns:

Certain access patterns can lead to hot spots
- Example: range by yyyy-mm, the latest month will be a hot spot due to frequent use of the partition

Partitioning by Hash of Key

A good hash function takes skewed data and make it roughly uniformly distributed
Similar inputs can lead to very different hash value
Hash function does not need to be cryptographically strong because clashes is always required (multiple keys will need to be in the same partition)
A partition can be assigned to a range of hashes instead of the key itself

Concerns:

Unable to perform range queries of the key, a compound primary key with unhashed part can be used for range query given the hashed part
- Compound key enables data model for one to many relationship
Sort ordering is lost

Skew and Relieving Hot Spots

Even with hashing, hotspots may not be completely avoided
At the moment no database systems can compensate for single key skewed workload, so the application has to handle the skew
Adding randomised value is typically the way to go, however this will slow down reads

Secondary Indexes

Secondary index usually does not identify the records uniquely, but more as a way to search for occurences of certain values
Secondary index are complex but is useful for data modelling
Secondary index needs to be partitioned as well

Partitioning Secondary Index by Document

Each partition is completely separate, each partition maintains its own secondary indexes. Document partitioned secondary index is also called local index for that reason.

However, to search by an index, the query needs to be sent to all partitions, and combine the results received (scatter gather). This approach is prone to tail latency amplification.

Partitioning Secondary Index by Term

Also called a global index, it stores all document references for a term in the same partition. The global index can be partitioned differently from the primary key index.

The term we are looking for determines which partition of the index we find.

Benefits:

Possibility to perform range scans (if index is partitioned by range)
Reads are more efficient

Drawbacks:

Writes are slower and more complicated, writing a document can result in writes to multiple partitions / nodes

Updates to secondary index partitioned this way are often performed asynchronously, because synchronous updates will require distributed transactions across multiple nodes and partitions. The changes to index are usually not immediate.

Rebalancing Partitions

Some minimum requirements for rebalancing:

After rebalancing, the load (storage, read, write) should be shared fairly between nodes in the cluster
While rebalancing, the database should continue accepting reads and writes
No more data than necessary should be moved between nodes, to make rebalancing fast and minimize network and disk I/O\

Hash Mod N

This is a bad idea, because when the number of nodes changes, more than necessary partitions are moved because of the partition node modulo.

Fixed Number of Partitions

Over partition and assign multiple partitions to each node
When nodes count changes, steal some partitions from existing nodes, or reassign partitions from removed nodes to remaining nodes
Each partition has management overhead, so choosing a number that is too high can be counterproductive
Size of each partitions can grow disproportionately, fixed number of partitions can encounter skews and hotspots without partition splitting

Dynamic Partitioning

Key range partitioning can be impractical with fixed number of partitions, key ranges needs to be adjusted according to the dataset
When a partition grows above a threshold, the partition can be split into two roughly even partition size
When a partition shrinks below a certain threshold, the partition can be merged with adjacent partitions
This scheme can also be applied to range of hashed primary key

Benefits:

Number of partitions adapts to the total data volume

Caveats:

Initial partitioning scheme needs to be defined manually when dataset is empty (pre-splitting with no priori information)
Otherwise it can start with a single partition

Partitioning Proportionally to Nodes

Fixed number of partitions per node
Size of the partition grows proportionally to the dataset size
When a new node joins, a fixed number of existing partitions are split, potentially resulting in unfair splitting
When averaged over large number of partitions, the load should be roughly fairly split
Requires hash-based partitioning

Automatic vs Manual Rebalancing

There is an intermediate solution where the database system generates the partitioning scheme and distribution, and is committed by an administrator (similar to cruise control).
- Reduces operational surprises
Fully automated rebalancing can be convenient but unpredictable. We do not want rebalancing to happen during peak period for instance
Combined with automatic failure detection and resolution, it can cause cascading failure through increasing the load to other nodes by rebalancing

Request Routing

An instance of a more general problem of service discovery – partition discovery.

Approaches:

Allow clients to contact any node, handle if the partition happens to be in the node, otherwise forwards to the appropriate node. This requires partition assignment metadata to be present in all the nodes.
Use a routing tier, which all clients connect to. It does not handle the request, but forwards all the request accordingly (partition aware load balancer). This requires partition assignment metadata to be present in the routing tier.
Use partition aware client. This requires partition assignment metadata to be present in the client.

How to distribute partition assignment metadata?

Centralised services (for example zookeeper, mongo config server) with cache / change subscription where the metadata needs to be present
Gossip protocol, where all changes are disseminated to all nodes

Parallel Query Execution

Query optimisers (spanner, Trino) can break down complex query into execution stages and partitions, that can be executed in parallel.

References

Designing Data Intensive Applications, Martin Kleppmann

HJWALT