Distributed systems. We live in a world of distributed systems. After all, if your solution resides on more than a single machine, you are already dealing with a distributed system. Even if you have this “mothership” in a form of central, single, ACID compliant SQL database, you probably have a cache somewhere in between.
And even if you don’t, there are probably some external systems you need to integrate with over e.g. SOAP/JSON api. Not to mention so called NoSql databases, where partitioning is often the main reason they came into existence in the first place.
All of this communication happens over the network, so like it or not, but most of us are already building distributed systems. These kind of systems have constrains we need to be aware of when designing, otherwise we will face anomalies in the system which are realy hard to troubleshoot. This is what CAP theorem helps us with - with reasoning about distributed system and knowing it’s limitations.
What’s it about?
In year 2000 , berkley professor named Eric Brewer, in an attempt to describe distributed systems and their fallacies in more formal way, gave a keynote speech at the ACM Symposium on the Principle Of Distrubted Computing (PODC) giving a birth to a theory, which we now call CAP theorem. Two years later, two MIT folks named Seth Gilbert and Nancy Lynch proven the theory to be formally correct.
Cap theorem describes three properties possible to achieve in a given distributed system. These are: Consistency, Availability and Partitioning - this is where the acronym came from. Now, the theorem states that any system can only achieve two out of three (btw. you better choose wisely!). This is often illustrated with a triangle, where properties form the verticies of a triangle. Why is it so? Why cannot you have all three of them?
To understand this, we need to consider a sitation where a network partition happens, sometimes also called split brain. This is a really important concept in distributed systems theory.
What’s a Network partition ?
Formally speaking, network partition happens when one subnet gets split into multiple distinct ones, because of an error. However, I think we could better understand the concept by bringing a more concrete example.
Let’s imagine a system consisting of three machines, connected by network, living in the same subnet. This could form a network topology of, for example, distributed database like MongoDb or Cassandra.
Under normal operation, the clients of this DB can talk freely with any of it’s nodes and the nodes can talk with each other. So, if one of the nodes receives a write it can tell it’s fellows there has been a write, in order to maintain consistency. Everything is fine and everybody is happy.. still…
Now, somebody pulls the plug on a network device or messes up route tables, effectively breaking communication from one machine to the other two. So, we have a situation where machines [A,B] cannot communicate with [C] and [C] cannot communicate with neither A nor B. What happened is, two new subnets just got born introducing network partition. Probabbly there are other services/clients living in C’s subnet still able to talk to him. And there are processes living in [A,B] subnet, still able to talk with [A,B] database nodes but not with C anymore. That’s what we call a network partition, where literally our network topology got split hence some people refer to this as split brain, bringing medical analogy.
What happens if we try to write now?
Following the thorem, when faced with a network partition, the system can only sustain two out of three properties described by the thorem. Let’s take a closer look at each of them and consider out options in context of the situation I just described:
When subject to network partition, the system could figure out some nodes in the cluster cannot be reached and therefore reject the write in attempt to maintain data consistency. This ensures, that no client of our distributed database is going to receive stale values by reading values of one of the nodes from another partiton. In our example, this means that if there is a write to [C] node, it needs to reject it since it cannot propagate it to [A, B]. Therefore, we are sacrificing Availability in favour of Consistency.
In other words, Consistency could be described as “every request is served with up to date values”
Under the same situation, we might as well accept the write and deal with the fact that some clients might receive stale values. In our example, this means, that if we update value V1 to V2 on node C, nodes [A,B] have no way of knowing value V2 exist and will continue to serve value V1 to the clients. Therefore, we just sacrificed Consistency in favour of Availability.
In other words, Availability could be described as “every request receives a response”
If we would like our data to be both consistent and available we might as well store it all on a single machine. This way, there is no network communication happening on writes and our values are guaranteed to be consistent and available. However, since all the data lies on a single machine now, we sacrificed Partitioning in favour of Consistency and Availability.
In other words, Partitioning could be described as “the data is spread across the cluster, living on more than one machine”
Distributing our system can bring us advantages in terms of scalability and data partioning and sometimes can even be the only option to deal with huge data sets. It can also help scaling engineering work and independent deployments, as per the microservices. However, we need to be really careful, having fallacies of distributed computing in mind and carefully choosing which properties (and why) would you like to achieve. I hope this post brought you some intuition on what CAP theorem is about. If you are interested in the topic, I really recommend reading “NoSQL Distilled” by Pramod J. Sadalage and Martin Fowler.