Database Selection & Design (Part IV)

— Non-Relational Properties —

Audience: The article covers the Non-Relational Properties (Distributed Computing, CAP Theorem and PACELC Theorem) and its details. These concepts will be referred in the later parts of this series during the database selection and design procedures.

Proficiency Level: Intermediate to Expert

In the previous parts, we reviewed the relational properties and the internals of the ACID implementation. In this part, let us take a look at the non-relational properties and how they are different from the relational properties. Let’s start with reviewing what distributed database computing is and its taxonomy.

What is distributed computing?

Distributed computing is the process of managing data across multiple interrelated servers (nodes) across many data centers and locations over a computer network. This improves the performance at end-user worksites by allowing transactions to be processed on many machines, instead of being limited to one.

Why do we need distributed database computing?

As we have seen in the parts before, with the explosion of data in the current world scenarios, the need for expanding the data storage presented itself. This expansion drove the need for scaling the database infrastructure to accommodate this increase in volume.

How can we implement distributed database computing?

There are two ways you can scale with the storage needs: Horizontally and Vertically.

While both these scaling techniques apply to relational databases, the non-relational databases are built ground up with horizontal scaling in mind, which enables distributing database computing.

When many servers come together for distributed database computing, connecting these servers with an optimal network topology plays a key role in the performance and efficiency. Network topologies are used to connect these multiple nodes together and there are many network topologies (Mesh, Star, Bus, Ring, Hybrid etc). The most commonly used topology in distributed database computing is the ring topology. In this topology, each node is connected to two other nodes in a ring form.

Distributed computing comes with the challenge of adhering to ACID properties. With data in multiple nodes and network connecting them, there is always a probability of delays or failures across network. This prevents the ACID properties from staying intact. In the diagram above, assume we have a user information stored in node 3 and 4. When there is a network failure during the time of an operation, these two nodes will end up with different information, thereby losing the consistency property within ACID. Since these properties are hard to accomplish in the distributing database computing environment, innovation kicked off to live with these compromised properties to come up with a new set of rules called BaSE properties.

Transactional Management (BaSE) Requirements

CAP Theorem Properties

CAP Theorem: Fundamental theorem for any distributed storage system pertaining to achievable properties

Before looking at the definition of CAP theorem, let’s see what CAP (Consistency, Availability & Partition Tolerance) is.

With this understanding, let us look at the definition of CAP theorem:

It states that a sequentially consistent read/write register that eventually responds to every request cannot be realized in an asynchronous system that is prone to network partitions. In other words, it can guarantee at most two of the above three properties at the same time. Remember, when there is a partition, the failures and delays are inevitable. The below segments explains the possible combinations and how it can / can’t be achieved in relational / non-relational databases.

Availability — Partition Tolerant (AP) Systems:

If you set your system to be Available and Partition Tolerant: When there is a write you make to a certain set of nodes (red) during a time of partition failure, the writes will not be propagated to the set of nodes (blue) across the system partition. In this case, the read performed from those nodes (blue) will not be consistent, or we should say, be eventually consistent. These kinds of systems are called AP (Available and Partition Tolerant) systems. Eg: Cassandra, CouchDB, Dynamo DB, Riak etc

Consistency — Partition Tolerant (CP) Systems:

If you set your system to be Consistent and Partition Tolerant: When there is a write you make to a certain set of nodes (red) during a time of partition failure, the writes will not be propagated to the set of nodes (blue) across the system partition. In this case, if a read is performed in those nodes (blue), they will not respond (unavailable) due to inconsistent data. These kinds of systems are called CP (Consistent and Partition Tolerant) systems. Eg: Mango DB, HBase, Redis etc

Consistency — Availability (CA) Systems:

If you set your system to be Consistent and Available: When there is a write you make to a certain set of nodes, you can access consistently with availability from other nodes. The minute you throw a partition in to the mix, you are now forced into one of the above situations. So, this type is possible only when there is no partition failure in between, which in real world situation is inevitable. These kinds of systems are called CA (Consistent and Available) systems. Eg: RDBMS (Oracle, MySQL, Vertica, Postgres, etc

Remember, when there is a partition, the failures and delays are inevitable.

We listed the CA type last for a reason — in a distributed system, partitions can’t be avoided. So, while we can discuss a CA distributed database in theory, for all practical purposes, a CA distributed database can’t exist. However, this doesn’t mean you can’t have a CA database for your distributed application if you need one. Many relational databases, such as PostgreSQL, deliver consistency and availability and can be deployed to multiple nodes using replication.

The CAP theorem explains how the system behaves when there is a partition. What happens it there is no partition tolerance ?

Please note that the CAP Theorem (explained above) fails to capture the trade-off between latency and consistency during normal operation; it merely tells us whether a system favors availability or consistency in the face of a network partition. On the other hand, the PACELC theorem provides options to see how the system performs during failures and also during normal conditions.

PACELC (passELK) Requirements — An alternate CAP formulation

The X-axis defines the options available when there is network partition. The Y-axis defines parameters when there is no network partition. In case of a Partition, there is an Availability-Consistency trade-off; Else, in normal operation, there is a Latency-Consistency trade-off.

PA/EC systems:

PA/EL systems:

(PC/EC) systems:

(PC/EL) systems:

Both CAP and PACELC theorem states that all of Consistency, Availability and Partition cannot be achieved at the same time. So, your selection process has to trade one of these parameters as it fits your use case and your ecosystem.

Conclusion:

BaSE properties are not very stringent like the ACID properties. It is all about the trade-offs you have to make between the related properties. These trade-offs come from your business requirements and your organizational policies. While these theoretical concepts provide the direction, there are lot of these properties backed into different databases. When it comes to decisioning, any database evaluation means analyzing the above properties and see how they can help you achieve your business goals.

Happy Reading !

Next part in this serieshere

Previous part in this series → here