1

I can have multiple replicas of my database to ensure high availability. Then I can have a quorum such that R+W > N to ensure consistency.

So does strategies like quorum base read/write or consensus algorithms circumvent fundamental limitations posed by the CAP theorem ? I am sure answer is no, but somehow people saying "Consensus algorithms enable systems to maintain strong consistency while still offering fault tolerance and availability.." is not clicking me.

Rather I believe it would be right to say that these strategies help to choose right spot within a range of spectrum between availability & consistency in distributed system.

Can someone please help understand this better with some example ?

1
  • 2
    Consensus protocols are used to prioritize Consistency and Partition-Tolerance while sacrificing Availability of some nodes. When a partition or fault occurs, the majority/leader remains available and consistent – this is what you care about in high-availability scenarios. However, the minority/followers become unavailable to avoid entering an inconsistent state. If there is no remaining partition eligible to meet the quorum, the entire cluster becomes unavailable – this is where the CAP theorem becomes unavoidable. Commented Nov 24, 2024 at 11:59

3 Answers 3

5

I think not.

A Quorum Protocol can't guarantee progress. This would fall into the Partition of CAP. ie when a partition occurs the QP activates to try to decide what to do, but can fall into a loop with no progress being made, thus meaning the system doesn't continue to run normally, or be forced to exit early sacrificing Consistence or Availablity.

If you think about it though even a super simple fail over system "..enable systems to maintain strong consistency while still offering fault tolerance and availability.."

QPs just try to optimise failover to have less error conditions. Perhaps to the point where theoretical errors are not practically going to happen unless you are cold war planning.

https://en.m.wikipedia.org/wiki/Consensus_(computer_science)#The_FLP_impossibility_result_for_asynchronous_deterministic_consensus

2

...strong consistency while still offering fault tolerance and availability...

That's not the three items of the CAP theorem.

"Fault tolerance" in the context of "consensus algorithms" does not mean what "partition tolerance whilst also preserving full consistency and full availability" would mean in the context of the CAP theorem.

Consensus algorithms are designed to tolerate faults in the processing of inputs that are available in principle to all the processes participating in the quorum. Specifically, it tolerates wrong-processing occuring in a minority of processes, all of which are intended to duplicate the same processing of the same inputs.

A key point in this is that the same input must (by some means) already be available to each process at the outset, and the outputs must (by some means) be available to some single deciding process at the conclusion.

A consensus algorithm can tolerate the complete "partition" of a minority of the processes, so that either this minority do not start processing for lack of inputs, or their outputs do not feed into the final deciding process, but only because the work they are doing is already fully duplicated by other processes.

Consensus algorithms are used in cases where the risk of errors can be mitigated by doing some of the same activities more than once.

The CAP theorem concerns something completely different. It concerns the terms on which non-duplicated inputs and outputs must work.

Specifically, CAP says that if you accept inputs and/or outputs in more than one place in a system, and inputs and/or outputs at different places should influence each other, then that system must either be up, down, or it's state must be inconsistent (i.e. an influence at one place has not propagated to all the other places it should). This is very closely related to basic physics, it's not unique to computing.

A consensus algorithm presumes the inputs are already in one shared place, and that the majority of outputs can get to one shared place again. A consensus algorithm won't correctly deal with any faults that undermine those two assumptions.

4
  • This is a textbook example of "A proof proves exactly what the proof says it proves, no more, no less". The CAP theorem has very specific definitions of the terms "consistency", "availability", and "partition tolerance" that may or may not be congruent with your intuitive understanding of those terms, and the "C" is definitely not congruent with the "C" from "ACID". Commented Nov 24, 2024 at 11:14
  • surely the point here is that consensus algorithms are used to decide which replicas are in load, not to directly process the sql Commented Nov 24, 2024 at 11:52
  • @JörgWMittag, yes CAP and ACID ultimately proceed from very different assumptions about the context. ACID is an older concept, perhaps dating to the 70s, concerning the properties of a single central data store. CAP dates from the 90s and concerns the properties of a system consisting of multiple distributed data stores which nevertheless need to synchronise. Implicitly, ACID database engines always consist of internal parts configured to guarantee "consistency" in the CAP sense, and so either in an available or partitioned state. Commented Nov 24, 2024 at 13:51
  • @Ewan, to be honest I wasn't even thinking of SQL engines, my mind turned to thinking of the Apollo space computers and their resilience standards, and I drew my answer from thinking about that. The essential point is still the same though, that any kind of "consensus" process relies on the work being done repeatedly from shared inputs and the minority result assumed to be in error when the results are re-concentrated. But when a system is suffering partition, the point is that the inputs available in any one place cannot assumed to be shared, nor the outputs reconcentrable (as it were). Commented Nov 24, 2024 at 14:11
0

https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed/

A critical point from Brewers follow up article is:

high availability (A) of that data (for updates);

Hence from a CAP perspective a quorum (R+W > N) is always CP.

If you were to have N=5, R=2, W=4 it's possible that you have a 3-2 network partition, in this mode the database is effectively read-only - all the nodes are able to read the last value, but writes are not possible.

Your system is NOT available from a pure CAP perspective (because writes are not possible) however from a practical sense, it may be enough that your website stays up in read-only mode - which some people might say is "available", "fault tolerant", etc...

You may be able to add queues to batch process updates later, but when you take a broader view encompassing your database and the queue you now find you have an AP system, as there is no longer a single consistent "latest" value. However this may be the optimal solution for providing the best overall experience to end users - given the various potential failure modes.

In the real world you might chose a different trade of, for example N=5, R=3, W=3 - that would perform really well with respect to hardware failure of 1 (or 2) servers, but may result in more impact on end users if a network partition occurred.

CAP is a tool for quickly evaluating what is possible, but at the end of the day if an asteroid takes out all life on earth, your web site doesn't need to stay up - the real world always has corner cases/loopholes you can use.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.