A Proof of Correctness for CASPaxos

10 Apr 2018

This post was co-written with Ilia Chtcherbakov.

CASPaxos is a recent and interesting consensus algorithm invented by Denis Rystsov.

The paper describing CASPaxos provides a proof of correctness in its appendix, provided here is an alternative proof.

First, let’s review the protocol. There are several participants.

Clients

A client submits a transformation function to a proposer.

The proposer might reply with a success message consisting of the output of the function applied to the previous value of the register, along with the claim that the value returned was placed into the register.

Alternatively, the client might not receive a response, meaning the outcome is indeterminate.

Proposers

Proposers receive messages from clients. When a proposer receives a request containing a transformation function $f$ from a client, it:

Generates a new ballot number $b$ distinct from any other ballot number. Ballot numbers are natural numbers which are unique across replicas. The method by which unique ballot numbers are generated is left undefined.
Submits a PREPARE message containing $b$ to each of the $2F+1$ acceptors.
Waits until it receives $F+1$ PREPARE OK responses from acceptors. Each PREPARE OK comes with
- a value $v_i$ and
- a ballot number $b_i$.
Determines the proposed new value of the register by applying $f$ to the $v_i$ associated with the largest of all the $b_i$s.
Submits a ACCEPT message containing the proposed new value and the largest $b_i$ to each of the $2F+1$ acceptors.
Waits until it receives $F+1$ ACCEPT OK responses from acceptors.
Submits an acknowledgement to the client along with proposed new value.

Acceptors

Acceptors receive messages from proposers. An acceptor maintains three pieces of state:

its current value (initialized to the special value $\varnothing$),
its current ballot (initialized to 0), and
its current promise (also initialized to 0).

When an acceptor receives a PREPARE message from a proposer containing ballot number $b$:

If $b$ is less than either of ballot or promise, do nothing.
Otherwise, set promise to $b$ and reply to the message with a PREPARE OK containing value and ballot.

When an acceptor receives an ACCEPT message from a proposer containing ballot number $b$ and value $v$:

If $b$ is less than either of ballot or promise, do nothing.
Otherwise, set ballot to $b$, set value to $v$, and reply to the message with ACCEPT OK.

Correctness

Before we can show this protocol is correct, we need to establish what “correct” means. A CASPaxos cluster implements a linearizable compare-and-swap register.

“Linearizable” has a precise meaning that we don’t need to get into yet. We will prove a slightly weaker fact about the register which is still fairly convincing as far as correctness goes.

We use the same setting described by Leslie Lamport in Paxos Made Simple:

Agents operate at arbitrary speed, may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted. Messages can take arbitrarily long to be delivered, can be duplicated, and can be lost, but they are not corrupted.

Before we define a condition which we will consider “correct”, we need to set our expectations of what we can hope to achieve.

It’s important to note that even in the absence of any concurrent requests, the state of the cluster does not imply a state of the register. Consider a cluster of three nodes in the following (legal) state:

A: (value = x, ballot = 3, promise = 3)
B: (value = y, ballot = 2, promise = 3)
C: (value = y, ballot = 2, promise = 3)

If a proposer successfully executes a prepare cycle while only talking to B and C, the value chosen at the end of said cycle will be y, while if A is part of the prepare, the value chosen will be x. Thus, we can’t define correctness in terms of any kind of canonical sequence of transitions of the cluster from one state to another.

Even worse, we can have arbitrarily long chains of values which succeed in a prepare phase and partially fail in an accept phase, only to get picked up in a subsequent prepare phase, despite eventually getting thrown away in lieu of a different value which did manage to complete an accept phase.

Therefore, we must define correctness in terms of the only externally visible output of the system: acks to clients.

We would like the states the system passes through to be reflected in a sequence of state transitions, each transition defined by the application of some function submitted by a client.

Unfortunately, not all state transitions will be reflected in an ack to the client. Consider the most obvious case, where a proposer fails immediately before sending its ack to the client. The state of the system (must) be the same as if that ack had been sent before the failure occurred, but the client has no confirmation that their message went through.

This brings us to the condition which I claim implies “correctness” (modulo linearizability, which we will get to later):

Let $B$ be the nonempty set of ballot numbers which are associated with an ack to a client in the operation of a CASPaxos cluster (if no values were ever acked, we make no guarantees).

Then there exists \[ C = (0, b_1, …, b_n) \] a sequence of ballot numbers such that

$b_n$ is in $B$,
the value associated with the ballot $b_i$ was the result of $b_{i+1}$’s prepare phase for $i < n$, and
every $b \in B$ appears in $C$.

Intuitively, the sequence $C$ is intended to represent the sequence of states the value of the register passes through, each state represented by the corresponding ballot number of the proposal which led to it. Condition 1 says that we can only commit to a sequence which ends on an acked ballot. Until a ballot or one which descends from it has been acked, we make no guarantees. Condition 2 describes what it means for one state of the register to succeed another. Condition 3 says that if we acked a value, the register had that value at some point in time. Remember that while we require every acked ballot to have been a value of the register at some point, the converse is not true; the register can have taken on values which were not acked.

We will show there exists $C$ which satisfies conditions 1, 2, and 3.

Define \[ C = (b_0, b_1, …, b_n) \] as follows:

$b_0$ = 0. Let $b_n$ be the highest ballot in $B$ (this satisfies condition 1). For $i < n$, let $P$ be the proposer which submitted $b_{i+1}$ in an accept phase. Then define $b_i$ to be the largest ballot $P$ saw during the associated prepare phase (this satisfies condition 2).

To see $C$ is well-defined, we make two observations:

For every ballot ever accepted there was exactly one prepare phase, thus, there is exactly one possible value of $b_i$ given a $b_{i+1}$.
Since there is a finite number of ballots, and ballots are natural numbers, the process of defining successively earlier $b_i$s must end.

We now show that $C$ satisfies condition 3.

First, note that $C$ is increasing. This follows from the fact that for $b_{i+1}$ to have succeeded its prepare phase, no acceptor who replied PREPARE OK could have seen a value greater than $b_{i+1}$, but the highest value returned from any acceptor in that prepare phase was $b_i$, so $b_i < b_{i+1}$.

If $b = b_n$, then $b$ is in $C$ by definition, so say $b < b_n$. Pick the least $i$ such that $b < b_i$, so that $b_{i-1} \le b < b_i$.

Let $M$ be the set of acceptors which participated in $b$’s accept phase. Since $b$ was acked, $M$ contains at least $F+1$ acceptors.

Let $M^\prime$ be the set of acceptors which participated in $b_i$’s prepare phase. Since by definition $b_i$ was either the result of its successor’s prepare phase or was $b_n$, it must have been accepted by at least one acceptor, and thus had a successful prepare phase, so $M'$ contains at least $F+1$ acceptors.

As $M$ and $M^\prime$ each form a majority of replicas, there is some acceptor common to both. Let $A$ be such an acceptor. Since $b < b_i$, $A$ must have accepted $b$ before it prepared $b_i$. By definition, the largest value seen during $b_i$’s prepare phase was $b_{i-1}$. This gives $b \le b_{i-1}$, and so $b = b_{i-1}$, which appears in $C$.

Thus, $C$ as we have defined it satisfies the three conditions and the proof is complete.

Linearizability

It’s my opinion that the previous section is sufficiently convincing that CASPaxos is correct (read: feel free to stop reading here). If we want to show that the resulting register is linearizable, we need to get a bit into the weeds.

Informally, a history is linearizable if it respects causality relationships. That is, if, in “real-time”, some operation concludes before another begins, the second operation observes the effects of the first. If the two operations are concurrent, meaning neither concludes before the other begins, no guarantees are made about their relative ordering.

We now have some reasonable notion of the set of states passed through by the register ($C$). It’s reasonable to define the history (in the sense used by Herlihy and Wing (HW)) of the register as the sequence $$ e_1, e_2, \ldots, e_n $$ where each $e_i$ is either

an invocation $I_x$ (a request sent by a client to a proposer) being sent from a client,
a successful response $A_x$ (ack from a proposer to a client) to some earlier invocation being received by a client, or
an indeterminate response $\mathord{?}_x$,

ordered by “wall-clock”, or “real” time.¹

Note that since the clients never technically receive an indeterminate response (we only insert them at the end of the history for the sake of the model, as you will see in a moment), no client will have an invocation following an unacked invocation.

To show that $H$ is linearizable, we must show that it can be extended to $H^\prime$ such that:

L1: $\text{complete}(H^\prime)$ is equivalent to some legal sequential history $S$, and
L2: ${<_H} \subseteq {<_S}$.

First, get $H^\prime$ by appending to $H$ a $\mathord{?}_x$ event for each invocation $I_x$ in $H$ with no matching $A_x$. Let $B$, $C$, and the $b_i$s be as in the previous section.

$\text{complete}(H^\prime)$ is $H^\prime$ with invocations without a response removed, however, by our construction of $H^\prime$ it contains no such invocations, so $\text{complete}(H^\prime) = H^\prime$.

${<_H} \subseteq {<_S}$ means that $S$’s ordering respects $H$’s (partial) ordering. The way to think about this is that if two operations $a$ and $b$ have the “$a$ ends before $b$ begins” relationship in $H$, they have it in $S$.

Let $$ U = \{u_1, \ldots, u_n\} $$ be the set of all client messages not appearing in $C$.

Next, define $$ S = (I_{b_1}, R_{b_1}, \ldots, I_{b_n}, R_{b_n}, I_{u_1}, ?_{u_1},\ldots,I_{u_k},\mathord{?}_{u_k}) $$

With $R_{b_i} = A_{b_i}$ if $b_i$ was acked, and $\mathord{?}_{b_i}$ otherwise.

Now $S$ is a sequential history constructed by taking the “logical” order we derived in the previous section, and appending to it all unacked messages.

To show that $H$ is linearizable, we now must show that

$S$ is equivalent to $H^\prime$. “Equivalent” in this context means that any individual process cannot distinguish between $S$ and $H^\prime$, even though $S$ and $H^\prime$ are (probably) not identical.
$S$ is legal with respect to the register (running through $S$ should result in the same sequence of states as running through $H$).
${<_H} \subseteq {<_S}$.

$S$’s equivalence to $H^\prime$ follows from the facts that

every event in $H^\prime$ is in $S$ and
since every process $P$ is well-formed, none of their requests precede in $H$ the preceding ack according to $C$, which is how $S$ is ordered.

That $S$ is legal follows from the previous section and how $C$ was constructed (that is, more or less by definition).

That $S$’s ordering respects $H$’s ordering is slightly trickier.

First, since every operation not appearing in $C$ was unacked, it is concurrent with every operation occuring after its invocation, and hence can be safely placed anywhere in $S$ after those operations. Since we place them all at the end, these relationships are satisfied.

So consider two events occurring in $C$, $c_1$ and $c_2$, with $c_1 <_H c_2$. Since $c_1 <_H c_2$, $c_1$’s accept precedes $c_2$’s prepare on the acceptor common to both associated quorums, and thus $c_2$’s prepare involved a higher ballot number than $c_1$’s, and so $c_1$ comes before $c_2$ in $C$ and thus in $S$.

This concludes the proof that the history is linearizable.

Thanks to Tobias Schottdorf for reading over this post.

^1. There's nothing inherent to the definition of linearizability that relates to any notion of "real" or "wall-clock" time. It's just that when we choose to look at histories that we define in terms of our understanding of "real-time", linearizability is a useful property for such histories to have. Herlihy and Wing allude to this in their original paper, but this subtlety is elided from many discussions.

Justin Jaffray