On Liveness
Often there are misconceptions about the concept of “liveness”, especially when first studying Distributed Systems. Here are a few quick clarifications and intuitions.
Let us address the following:
- What exactly is liveness?
- In asynchronous systems, there is no upper bound on message delays. So isn’t liveness trivially impossible? Just keep delaying the messages, no node talks to any other node → no consensus possible (safety, no safety doesn’t matter). But this is NOT correct. Else why would FLP (Fischer-Lynch-Paterson theorem) be that big a deal, right?
- Why does Paxos not satisfy liveness?
- Is there something like “more live” and “less live”? Like, does smaller quorums mean “more live” because the algorithm will converge faster?
- How to prove or disprove liveness of a system?
Answers
-
First, broadly liveness means “something good will eventually happen” and applies to a number of problems.
- In consensus specifically, a protocol being “live” means that the “protocol will end within a finite amount of time”. Note that the time may be UNBOUNDED — i.e. you can’t say for sure that it will end within 5 min or 1 hour or 10 days etc., but it WILL end at some time.
- So what is an example of a protocol that is live? Consider a modified consensus algorithm: The nodes attempt Paxos, but if a node doesn’t hear from the leader after a timeout T after receiving the election-phase message, it will output 0.
- This is live because every node is GUARANTEED to output a value and end the protocol. The time it takes to end is (time for election-phase message to reach) + T.
- But obviously, this protocol is not safe because some nodes may output 0 and some may output 1 (i.e. if they heard from the leader and the leader was proposing ‘1’).
-
The “un-achievability” of liveness isn’t as trivial as saying “just delay all messages infinitely and no work gets done, protocol never ends”. Because the definition of an asynchronous system is that message delays are “unbounded but FINITE”.
- “Finite” is the important word here. Async systems do not guarantee a time limit on when messages reach, but they do allow messages to eventually reach.
- Now we see why FLP is so powerful. It says that even in this reasonable definition of async systems, where messages DO eventually reach, it is impossible to design a protocol that will end 100% of the time and is safe.
-
Why is Paxos not live?
- Consider this: Some node A starts the election phase, but its propose-phase messages get so delayed that some other node B decides to start the election phase (because it thinks A might have failed). All of node A’s messages are useless even if they reach after node B’s election phase has started. And now after node B’s election phase, all of node B’s messages get delayed too and node C starts election.
- You can keep doing this forever — i.e. you can construct a sequence of message delays such that Paxos keeps running forever in an attempt to reach consensus. This makes Paxos not live.
- But obviously, the probability of this happening forever is supremely low if your timeouts are configured well. So it works out in practice. But there is no guarantee and hence no provable liveness.
- More generally, FLP has to counter any claims of liveness, and not just in Paxos (e.g. someone may claim they have a consensus algorithm that doesn’t use timeouts like we did above), and that is why it is so elaborate.
(Another common adage is consensus can’t have liveness because “you cannot distinguish a failed node from arbitrary message delays.”)
- As you can see, liveness isn’t defined by the “speed” of progress of the algorithm. It is decided by the eventual result. So, there is nothing like “more live” or “less live”.
- Proving liveness is very hard in general. More common is disproving liveness by showing some sequence of events such that the algorithm never converges. So any question of liveness, ask yourself “Can I construct such a sequence?”. That event may have infinitesimal probability, but if you can show one exists, you can disprove the liveness property of an algorithm.
On “Eventual” Liveness
Question: Sometimes we refer to protocols like Paxos as “eventually live”. What exactly does it mean? How is “eventual liveness” different from simply “liveness”? After all, isn’t liveness already defined informally as “eventually something good happens”?
Short answer: Unless otherwise stated, for liveness, use the strict definition below. If asked for “eventual liveness”, append the definition with “If bad things stop happening, …”
Strict Definition: Liveness means that the algorithm in question is guaranteed to end.
Long answer:
The key operational word here is “guaranteed” (note: no mention of the word “eventual”). It takes just one counterexample, within given assumptions of the system, to disprove liveness. (Assumptions are things like: are messages guaranteed to reach, is it FIFO delivery, what about node failures, etc. These may not all be explicitly spelled out, so we should know what algorithms were designed to work in what settings.)
Now about the heavily abused word “eventual”. What’s the point of mentioning “Paxos is eventually live”?
Firstly, Paxos is not live because when bad things like failures/message delays happen, Paxos can keep going on forever. One can construct an adversarial sequence of failures/delays such that a Paxos run never results in a consensus and it keeps trying. But, by God’s grace, if bad things stop happening and the system behaves well, is Paxos guaranteed to converge?
Well, it is not so obvious. Badly designed algorithms can livelock/deadlock because of deficiencies in the algorithm’s design itself, for some specific inputs or message patterns, EVEN when the system behaves well.
In saying “Paxos is eventually live”, we are saying Paxos is NOT so. It is well-designed enough that when bad things stop, good things are guaranteed to happen. We can’t call Paxos strictly live, but we still need a name for this slightly diminished guarantee. So we call it “eventually live”. This “eventually” is different from the one in “eventually good things happen”. This “eventually” encompasses “we hope that eventually bad things stop happening”.
The “eventually” in the informal liveness definition of “eventually good things happen” is more raw. It asks: no matter what happens, will an algorithm converge if I am willing to wait until that “eventuality”? For Paxos, and any consensus algorithm in asynchronous systems, the answer is No (FLP). Thus Paxos is not live.
(It can be very confusing if we overthink it.)
However, if you like to overthink:
Liveness can be violated generally in two ways: Livelocks and Deadlocks.
But there’s a third way a system/algorithm can be not live. Bad configurations: e.g. you set timeouts much less than message delays and the algorithm is just timing out and restarting, doing no useful work even when the system is well-behaved. Do you call this a liveness violation? I don’t know.
Other resource: dinhtta.github.io/flpcap — the first half has very nice intuitions.
Liveness / Safety Summary
- When a decision is made, all nodes make the same decision → Safety.
- But will a decision always be reached or will the nodes be debating forever? If it is guaranteed to be reached, then you have Liveness. But we know this is not possible, because FLP.
- Will a decision be eventually reached with very high probability → Eventual Liveness. Although strict liveness is not possible, with very high probability nodes will agree on a value.
“Liveness” is a very overloaded term in Distributed Systems and can be confusing. People say liveness means “eventually good things happen”. But then, does it mean liveness automatically implies safety? What about program termination with safety violated? etc.
So, one has to judge based on the given context what liveness means. Theoretical papers usually define liveness formally to mention what exact meaning of liveness they are referring to.