Introduction
Licensed under Creative Commons Attribution-Share Alike 4.0 International license. Source
Understanding the basics. Distributed Systems
A distributed system is a set of nodes where two or more of them work co-ordinately in order to achieve a common outcome when carrying out a certain task. Each node is capable of having autonomous behavior, i. e. each node has its own memory and processor, and ideally, each one could fail individually without affecting the global performance of the network. Also, nodes function in a way that allows final users to perceive them as a single logical unit.
For example, we perceive Ethereum network as an entity onto which Smart Contracts are coded and executed harmoniously, but it is the sum of the individual work of each node what makes possible for Smart Contracts to operate.
Distributed Systems have the following features:
- Components within distributed systems have resource sharing, both hardware and software are commonly employed to fill the needs of the network's tasks.
- Any distributed system allows horizontal-scaling. Instead of upgrading the system by investing in hardware with greater capabilities, hosters can just add another computer to improve the network's performance.
Main drawbacks of distributed systems
The main challenge of running a distributed system is synchronization and fault tolerance. For example, it might be the case that nodes within the network are unable to send and receive messages to and from each other. Thus, nodes can't update their state machines to keep updated with the global state of the network. Even worse, nodes can behave erratically due to malfunctioning or malicious intended behavior. Be that as it may, the system must tolerate these faults and must continue to work flawlessly.
The CAP theorem
The last statement should be further clarified, what does it mean that a network should keep functioning in spite of several nodes having faulty or malicious behavior? The CAP theorem provides the basic foundation for the understanding of distributed systems' operation mechanism. But before reviewing the CAP theorem, some basic concepts must be taken into account:
- Consistency:
We say that a distributed network has consistency if it always gives back a response fitting the expected service specifications. - Availability:
If each not-faulty node within the network eventually gives an answer, we state that the network has availability. Nevertheless, in real systems, a response delayed for too long is equivalent to the situation where the network does not respond at all. - Partition-tolerance:
The partition Tolerance is a characteristic inherent to the system itself, it is intended to address the situation where communication between elements of a network (without including users) is not reliable. For example, when there are two nodes geographically apart and they are unable to communicate with each other to synchronize the shared state, we say that the system is partitioned. If such a system is able to keep functioning flawlessly in that scenario, it is said that the system is partition tolerant.
The CAP theorem states that in a network where communication failures occur (i. e. the system is partitioned) it is impossible to have both consistency and availability at the same time. Let's consider a system of two nodes running a distributed database service with reading/write functions allowed for users. If while receiving users requests the connection of the nodes dies, there is no way for them to know what the other node is doing. In this situation, if they attempt to respond to every request, they could start working with stale information (loss of consistency). But if a node stops responding requests in order to avoid giving a wrong response, then the network loses availability.
How Cap theorem's limitations are handled?
Consistency trade for Availability
Availability trade for Consistency
This approach is taken when there is usually a reliable network communication, data servers located in one data center are good examples of this situation. When prioritizing consistency a slave-master structure is often chosen to design the network architecture. Therefore, we would have a master node that will synch with shared replicas in order to update the global state when receiving a request. The network will first update the shared state before giving responses to the subsequent request, thus high levels of consistency are achieved.
Gluing pieces with different trade-offs
As stated earlier, some system designers choose to create a network adding up pieces with different trade-offs, some guarantee high levels of consistency whether others have high levels of availability. The system functions with several partitions, here we will summarize some of the most common ones:
- Data partitioning: In some cases, it is acceptable that some data update is delayed, for example, stock disponibility in an e-commerce service. But in other cases data must be always consistent, the clearest example should be the balance of a bank account or the bills payout.
- Operation partitioning: An example of operation partitioning happens when on a database read-only operation are prioritized out the bulk traffic to enhance the levels of availability while writing/append operations could stop functioning when the network experiments poor inner communication.
- Functional Partitioning: This strategy is adopted when a service is subdivided into several sub-services, hence each sub-service will have its particular trade-off best fitting its requirement.
- User Partitioning: Designers tend to choose this approach when a service is provided across great extensions of land or when the network should bear massive workload as in social media services. Users from different spots relay on servers located close to them, this guarantees high availability and at the same consistency is readily achieved if the communication between the servers is highly reliable.
Final thoughts
It is important to have a clear comprehension of distributed systems properties. When designing one such system the use case for what it is intended will determine the convenience for a particular level of trade-off between consistency and availability in each component of the distributed system.
In the next post we will consider the more general and theoretical cases, we will see that consistency-availability trade-off is a particular case of the more general trade-off between safety and liveness in an unreliable system. Also, we will further elaborate on the details of the distributed systems characteristics.
References
- Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SigAct News, June 2002.
- Seth Gilbert and Nancy Lynch. Perspectives on the CAP Theorem. IEEE Computer Society Press, February 2012.