Database Selection & Design (Part V)
— Partition / Sharding —
Audience: The article covers the some of the common properties (Partitioning & Sharding) and its implementation techniques. These concepts will be referred in the later parts of this series during the database selection and design procedures.
Proficiency Level: Intermediate to Expert
In the previous part, we saw how distributed database computing is implemented with scaling of infrastructure resources. In this section, we will review partitioning and why is it important in the distributed computing world.
What is Partitioning and Why do we need partitioning?
With distributed computing, data needs to be split across different nodes evenly and efficiently. This can be achieved using a technique called Data Partition. With partition, the data can be broken down, split and stored in multiple small databases.
What are the two types of partitioning?
- Vertical partitioning: Splitting database by columns or features. For example: Tweets can be in one database, followers in another and favorites in another. Companies like Airbnb and Twitter uses these type of partitioning for managing their set of data
- Horizontal partitioning: Splitting database by rows or organizations (data specific to a certain set of customers). It can also be location specific data that can be stored in one database and few region in another. Horizontal partitioning is also called as Sharding.
While partitioning improves the performance of the database, joins are expensive as it happens across the network along with other issues like ACID compliance issues, Sharding imbalance, data redistribution etc.
What are the different algorithms that can be used for sharding?
Hash based: Hash-based sharding takes a row key value and generates a hash value from it using a hash function. A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.
- For example, hash function could be as simple as f(x) = x or f(x) = (x2+1) or it could be a complicated algorithm like Fibonacci hashing or Radix conversion hashing. Please refer here for other types of hashing algorithms.
The hash value is then divided with the total number of servers to determine the modulus, which provides the shard the data should reside in.
- For example, if the hash value of a key H(K1) returns a value of 97, and if you have 4 servers to work with, 97 is divided with 4 to find the modulus 97%4 = 1. Therefore, this row will be stored in the 1st shard.
The example below shows the hash functions being applied to the keys, and modulus (with the number of serves) being applied to the hash function, thereby determining which server the partition has to go to.
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. With this uniform hashing algorithm, the hash function can evenly distribute data across servers, reducing the risk of hotspots.
The disadvantage of this type of sharding is the redistribution required in case of any server/node addition or removal from the cluster. When it happens, the entire set needs to be redistributed based on the new modulus calculation. In this example, if the number of server is increased to 5, then the H(K)%5 needs to be calculated for the entire dataset and the data needs to be redistributed again, which may be a costly operation for your organization.
Consistent Hashing: With consistent hash sharding, data is evenly and randomly distributed across shards using a partitioning algorithm. In this type of sharding, everything is placed in a logical ring, including the servers/nodes based on their hash. Each row of the table is placed into a shard determined by computing a consistent hash on the partition column values of that row. When the row hash is calculated to identify its position in the ring, the node clockwise (or anti-clock wise, configurable) to that position will be chosen to store the record. This is how consistent hashing works.
In the below example, the min hash is 0 and the max hash is 100. If the hash value of the Key #1 is 80, then it lands in the ring at the spot A. The node clockwise to that spot A is S5 and so the row with key#1 gets stored in that server.
This gives you a greater flexibility to add or remove servers at any time, and the records hash will just find the next available node to store its data. When adding or removing a node, only the adjacent node has to go thru the redistribution process unlike uniform hash described above, where data from all nodes has to be redistributed.
This sharding strategy is ideal for massively scalable workloads because it distributes data evenly across all the nodes in the cluster, while retaining ease of adding nodes into the cluster. With consistent hash sharding, the possibility of hot spots are further optimized by virtualizing the nodes with different hash keys for each node. In the example above, S1 can be further hashed into S1a, S1b and more and can be placed in the ring in different hash spots. This will improve the distribution efficiency even further.
Range partitioning: Range-based sharding divides data based on ranges of the data value. Unlike consistent hashing, this method doesn’t involve any hashing techniques, rather just uses the values in the key to determine the ranges. Shard keys with nearby values are more likely to fall into the same range and onto the same shards. Each shard essentially preserves the same schema from the original database. Sharding becomes as easy as identifying the data’s appropriate range and placing it on the corresponding shard. Removing a node means deleting the data in that partition. Adding a node will require defining the new range to the new node. It could also be an interim range of the existing series.
The main benefit of range based sharding is that it’s relatively simple to implement. On the other hand, range based sharding doesn’t protect data from being unevenly distributed, leading to hotspots.
Round Robin: In round-robin sharding, server assigns rows in a round-robin manner to each partition so that each partition contains a more or less equal number of rows and load balancing is achieved. Assume there are ’n’ database servers, S0, S1, S2… Sn. The ‘i’th row is stored simply by calculating “i mod n”. Example: If there are 3 servers, 1st row will be stored in 1%3 = 1st server, 9th row will be stored in 9%3=0th (S0) server. This is how writes happens. Read and updates will perform based on the ROW-ID and reverse calculate the partition to go to for completing the operations.
Because there is no partition key, rows are distributed randomly across all shards. In addition, round-robin partitioning offers: Multiple insertion points for future inserts; A way to enhance performance using parallelism; A way to perform administrative tasks, such as updating statistics and truncating data on individual shards. While this method provides almost equal distribution, point queries and range queries will be very inefficient as they have to scan thru all the disk shards.
Directory Based: To implement directory based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:
The main appeal of directory based sharding is its flexibility. Range based sharding architectures limit you to specifying ranges of values, while key based ones limit you to using a fixed hash function which, as mentioned previously, can be exceedingly difficult to change later on. Directory based sharding, on the other hand, allows you to use whatever system or algorithm you want to assign data entries to shards, and it’s relatively easy to dynamically add shards using this approach.
While directory based sharding is the most flexible of the sharding methods discussed here, the need to connect to the lookup table before every query or write can have a detrimental impact on an application’s performance. Furthermore, the lookup table can become a single point of failure: if it becomes corrupted or otherwise fails, it can impact one’s ability to write new data or access their existing data.
Geo-based: In geo-based (aka list partitioning or location-aware) sharding, data is partitioned according to a user-specified column that maps range shards to specific regions and the nodes in those regions. This is used by location-based apps that key off geography. For example, applications like Lyft and Instacart have an important tie to location. You’re not going to live in Alabama and order grocery delivery from California. And if you were to order a Lyft pick-up from California to Alabama you’ll be waiting a good little while for your pickup. Tinder is also a company that uses geo based sharding heavily.
This type of sharding is best fit for those use cases which uses geo location heavily for its operation. A good approach should ensure that the loads of these shards are balanced right. If there are locations where the traffic is expected to spike, have more shards in those locations.
Hierarchical Sharding: This is also called Sub Sharding. The problem it addresses is the issue we have with the fixed number of shards and when one of the nodes becomes overloaded and it becomes too costlier to redistribute. This process of take an overloaded node and creates a partition underneath it to accommodate the overflow.
The above algorithms helps in storing the data (write requests) based on the partition they are supposed to be in. When write happens, the partition will be selected based on the algorithm we choose from above. For example, if you choose Hash based, the incoming write request will be evaluated (applying the hash function on the key) to identify which node the data needs to be stored in. The row is then placed in that partition.
How is the read request handled in this partition situation?
In case of read, the client sends the request. Different databases handled this requests differently. Some databases will have Mater — Slave configuration, while some other databases have Masterless configuration. Master-Slave configuration is an asymmetric communication or control where one node controls other nodes as their communication hub. Masterless configuration is where a master is selected from all or group of eligible nodes, with the other nodes acting in the role of slaves
- In case of Mater-Slave configuration, there could be many configurations in the industry and based on the type of the database you chose. In some, Master takes all writes and reads are distributed, while in others Master takes all reads and writes. In case of read distributed to all slaves, the read request goes to the master node first. Master node has the details of all the slave partitions and the token ranges it serves. Based on that, the read request will be send to that node and the response will be gathered by the master and will be send back to the client.
- In case of Masterless configuration, the read request will be send to the nearest node, which then becomes the coordinator node (The coordinator is selected by the driver based on the policy you have set. Common policies are DCAwareRoundRobinPolicy and TokenAwarePolicy. Every node in the masterless configuration will have all information about the partition and the token ranges each node within the partition serves. This coordinator node works with the relevant partition/node(s) to get the data and send it back to the client.
There are more variations to this read operations and how the data will be retrieved. This again depends on the type of databases as different databases handles these requests differently.
What are benefits of partitioning / sharding?
- Scalability: With Distributed database computing, the scalability is achieved when you divide data across multiple partitions, each hosted on a separate server.
- Performance: When data is partitioned into smaller chunks and stored in multiple databases, the data access operations is much faster on this smaller volume.
- Parallelism: Operations affecting multiple partitions can execute in parallel. Parallel implementation offers detailed benefits to optimize resource utilization and lessens the implementation time too
- Security: Partitioned an be applied based on the sensitivity of your data as well. You can separate sensitive and nonsensitive data into different partitions and apply different security controls to the sensitive data.
- Availability: Separating data across multiple servers avoids a single-point-of-failure. If any one instance fails, only the data in that partition is unavailable. Operations on other partitions can continue.
What are some disadvantages of Sharding?
- Complexity & Redistribution: Implementing sharding is very complex and if not done with the right algorithm can cause performance and data integrity issues. If you encounter hotspot or performance situation, you may have to revise your sharding technique, which means data in the shards needs to be redistributed (with some algorithms). Moving data from one shard to another shard requires lot of downtime.
- Joins across shards: Often times, application / business logic may need data from different shards, which will require programmer to build complex queries. While pulling data from multiple shards, performance and availability concerns needs to be kept in mind.
- No Native Support: Sharding is not natively supported by every database engine. With many databases, the algorithms and nuances of sharding needs to be managed explicitly by the programmer.
There are pros and cons for different types of partition techniques we reviewed above. Consistent hash sharding is better for scalability and preventing hot spots, while range sharding is better for range based queries. A simple comparison of algorithm mentioned below.
There are many topics closely related to Sharding like Replication, Consistency etc. The details behind data redistribution are very important. Everything follows logically once you consider how the data is stored and retrieved. Cross-partition queries are inefficient, and many Sharding schemes attempt to minimize the number of cross-partition operations. On the other hand, partitions need to be granular enough to evenly distribute the load amongst nodes. Finding the right balance can be tricky. Of course, the best solution in software engineering is avoiding the problem altogether. There are many successful websites operating without Sharding. See if you can defer or avoid the problem altogether.
Happy Reading !
Next part in this series → Still working on it…
Previous part in this series → here