Database Selection & Design (Part VI)
— Replication Properties —
Replication Techniques and Types
Replication: Process of storing the same information in additional nodes to improve the availability of data is called replication. Different databases perform replication differently like Master-Slave, Primary-Secondary, Coordinator-Participant models perform replication differently. Replication can also be used for backup and restore purpose. The biggest advantage of replication is that it provides cross data center data transmission without the application side involvement and without any additional replication tools. One of the key watch-out item in this process is Latency and Consistency. Actually, they are related to each other. The level of consistency you choose dictates the latency. There are two types of replication, Asynchronous and Synchronous replication.
In this type, the replication happens at the background. If the write to the master / primary node happens successfully, the client gets an Ok notification. The replication of the data to the replica nodes happens internally and the client doesn’t have to wait for that. It comes with the read consistency issues and durability issues. Many databases have techniques to handle these things with tight write and read consistency protocols
In this type, the replication happens at the foreground. If the write to the master / primary node happens successfully, it waits for the other replica nodes to get the updated information before it responds back to the client with an Ok. This gives the client the guarantee that the write is 100% safe across all nodes. But this comes at a cost of Performance and Availability. Performance is with the latency due to the fact that the system has to wait for all nodes to respond. Availability kicks in when a node is down and doesn’t take the request, the write request will not be successful.
There’s some middle ground. Some databases and replication tools allow us to define a number of followers to replicate synchronously, and the others just use the asynchronous approach. This is sometimes called semi-synchronous replication.
Different ways of implementation:
Internally replication can be handled with either single leader replication model or multi leader replication model.
- Single leader replication is the most commonly used topology. Of course, this comes with its cons. The latency that comes with it since all the writes are being handled by just one leader. This also contributes to the load it goes to the leader node. Also, in case of failures, what happens to all the writes in process and how it recovers will define your selection process. There are many automatic failures recovery process, but it comes with its complication as well.
- On the other hand, Multi leader replication address some of the above concerns. This pattern has multiple leaders handling the write requests from the clients. This comes in very useful when your application has too many writes to handle and one leader cannot handle it. Placement of these leaders depends on the application and the organization, if they need leaders by regions or transactional hot spots or just all in one datacenter. The write conflict needs to be managed very closely. For example, If two clients updates the same tuple to two different leaders that are not yet in sync, both writes will succeed in their respective leaders, but we will have problems when we try to replicate that data
- There is also a leaderless replication (popularized by Amazon’s DynamoDB) is to simply have no leaders, every replica can accept writes. The basic idea is that clients will send writes not only to one replica, but to several (or, in some cases, to all of them). Big advantage of not having a designated leader is that it will be more fault tolerant. No leaders — No fail overs. This comes with many additional processes like read-repairs, hinted hand offs, quorums etc that will help with setting up this model work very efficiently.
How does replication happens internally:
Take it one level below to see how one node can actually send its data to another, after all, replication is all about copying bytes from one place to another.
- Statement-based replication: In this method, the coordinator applies the same statement to all its replicas to propagate the data. Eg: If it received a statement UPDATE CURR = “USD”, then the same statement will be sent to all its replications. One thing to watch here is that not all statements are deterministic. If you have nondeterministic functions like
RANDOM()in your statement, each replica will have different values
- Log Shipping Replication: Most database writes log before committing the transaction. This method uses shipping those logs to the replicas so that replicas can use those logs to perform the updates. The most common frequency setting is every 15 minutes (the default), but can be tailored to your application. Since this log file information is at a very low level and only databases can understand, different version of databases may not be able to read. Also, in case of multi leader replication, multiple log files need to be merged before making the updates.
- Row-based Replication: It is a combination of the 2 methods mentioned above. Instead of using the internal log, it creates a replication specific log to perform this operation. By this, this method will overcome the limitations mentioned above. But, if there is a statement that impacts multiple rows, there is 1 statement being shipped in the statement-based replication, whereas in Row-based replication, all impacted rows are shipped.
While replication is a fool proof way of having data redundancy, it comes with a cost implications. Non-relational databases offer tunable consistencies, where you can do Quorum updates to improve consistency with these replications. This comes with a performance degradation as it involves heavy latency, again depending on the type of Quorum you choose to go with.
Link to the next part in this series: