I was going through one of the official introduction videos of Google Spanner. It mentions "Google Spanner is a mission-critical relational database service built from the ground up and battle-tested at Google for Strong Consistency and High Availability at a global scale".
A few questions popped up into my mind after this statement:
How does a database guarantee high Availability and Strong Consistency on a global scale?
Ensuring Partition Tolerance is necessary in building Distributed Systems. On top of that, how does Spanner provide High Availability and Strong Consistency simultaneously?
If it provides all three guarantees, does this break the CAP theorem?
The short answer is Google Spanner does not break the CAP theorem. Before going deep, let us revisit the CAP theorem. As per Wikipedia, any distributed data store can provide two of the following three guarantees:
Consistency - Every read receives the most recent write or an error. Each read operation returns either the most recent write or an error.
Availability - Every request receives a (non-error) response, without the guarantee that it contains the most recent write. The system is said to be available if it sends any non-error response. The response might or might not be the most recent one.
Partition Tolerance - The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. The system consists of multiple nodes or servers that communicate with each other through a specified protocol, forming a network. Partitioning occurs when these nodes are unable to communicate with each other, whether due to network disruptions, software or hardware issues, etc. This creates two/more disjoint subsets of networks that cannot communicate with each other. Partition Tolerance denotes the system's capacity to withstand and operate effectively despite these network partitions.
In a network partition, one is left with two options: Consistency or Availability. When a Network Partition failure happens, the system must decide whether to do one of the following:
When choosing consistency over availability, the system will return an error or a time-out if particular information cannot be guaranteed to be the most recent. The system is aware that it has multiple subsets of networks(due to network partition) that cannot communicate with each other. As a result, up-to-date data is not guaranteed. The system will either return an error or time out.
When choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee up-to-date data due to network partitioning.
An AP system provides 100% availability for reads and writes. Achieving a 100% available system is practically impossible. However, if a system can deliver availability that is so high that most users don't worry about its outages, then users need not worry about it. In practice, Spanner does meet this standard, boasting Availability exceeding five nines (less than one failure in 10^5 requests).
How does Spanner attain such a high level of Availability? Spanner runs on the Google Private Network. Unlike most wide-area networks, and especially the public internet, Google's complete control over its private network allows for hardware and path redundancy, as well as the management of upgrades and overall operations. Say even if one of the optical fibers in the ocean breaks, redundant paths have been established to ensure that communication between nodes remains unaffected. Spanner is deployed on exceptionally high-quality specialized hardware, complemented by redundancy and fallbacks at the hardware level. While occasional disruptions such as optical fiber cuts and equipment failures may occur, the overall system remains highly robust.
Google ensures that Network Partition is exceedingly rare using the infrastructure developed through years of operational enhancements. This leads to very high Availability. When there is no network partition and the system is in a state of high availability, data consistently adheres to Strong Consistency. That is how Spanner provides High Availability, Strong Consistency, and Partition Tolerance.
However, in the rare event of a Network Partition, Spanner gives up its Availability to ensure Consistency of data. So, technically Spanner is a CP System. However, for all practical purposes, it appears to defy the CAP theorem by delivering Strong Consistency, High Availability, and Partition Tolerance.
Reference:
https://en.wikipedia.org/wiki/CAP_theorem
https://cloud.google.com/blog/products/databases/inside-cloud-spanner-and-the-cap-theorem
https://www.youtube.com/watch?v=amcf6W2Xv6M
Comments
Post a Comment