Justin Jaffray

blog notes

Why Consensus?

15 May 2018

Depending on which computer scientists and greek lawmakers you listen to, Paxos could either be taught to a 3-year old, or requires a team of PhDs “almost a year” to fully grasp. I think the discrepancy here comes from a poorly motivated setting. I couldn’t have provided a coherent explanation of what consensus was a year ago, but in the time since then I’ve slowly pieced together an understanding of it thanks to coworkers who put up with my questions. As one does, I’ve decided in the interim that society is the problem and thus the solution to the confusion is for me to write up my newfound understanding.

Most importantly, I think the thing that makes consensus unintuitive in the first place is that it naively doesn’t even seem like a problem that needs solving. The problem to which Paxos is a solution is quite poorly communicated in most literature. Who are these servers? Why do they all seem to have differing opinions about this value in the first place? I’m not going to explain any actual consensus algorithms in this post, but will just try to give some intuition for the setting in which they operate.

Let’s get some scary-sounding concepts out of the way. For the purposes of this post:

I think it’s most coherent to think of the problem universe as having some object which various servers [replicas] all share. In single-degree Paxos, it’s a slot which can be filled exactly one time. For CockroachDB, it’s a slice of the keyspace (called a range). For a simpler system, it might be an entire small key-value database. Functionally, we would like this object to behave as if it is not replicated across several different replicas (we’ll get into what this actually means later).

The reason we might find ourselves in this situation is that we want to have this object stored on a server somehow, but we don’t want to have a single point of failure. If we replicate an object across \(2n+1\) servers in vanilla Paxos, for instance, the system can keep operating when up to \(n\) of those servers fail. This is the only reason to use consensus - if you don’t need to tolerate failures, it’s simpler and more efficient to just have one server be the single source of truth.

What makes this difficult is that in a distributed setting, we can make very few assumptions about how the replicas can communicate. Messages can be lost, reordered, delayed indefinitely, and replicas can go offline for arbitrary periods of time. Importantly, though, we assume messages are never “tampered with”. This is what is meant by “fail-stop”: the only method by which replicas fail is by stopping. Modelling more mischievous failure modes opens up a deep can of worms, which is not of much practical interest unless your aim is to raise the temperature of the planet via cryptographic investment vehicles.

This shared “object” is the “replicated state machine” that people speak of. I’m going to call it a “[replicated] object” because I think it’s a more intuitive way to think about it, but I will also refer to possible values that the object can take on as “states” (an integer can take on states 0, 1, -1, 2, -2 …, a map can take on states {"a": 1, "b": 4}, {"a": 1, "b": 7}, …), etc. This is because I think it’s also valuable to think of this object as passing through various “states” throughout the operation of the system.

Already we have an unspoken ambiguity that we need to address. We speak of this “object” which is shared among several replicas, but if each replica has its own version of the object stored locally, how can there be a single canonical instance of the object? If A somehow believes the value (state of the object) is 1 and B somehow believes the value is 2, how do we even define in an absolute sense what the value is? I’m going to call this single canonical version of the object The One True Object (TOTO).

The answer to how we define TOTO is this: it doesn’t really matter, as long as all clients see a TOTO which is consistent. In the previous example, the value could turn out the be 1 or 2, as long as there’s no client which sees it as 1 when another client sees it as 2 (this is sometimes called a “split-brain scenario”). Clients should be able to always reason about the system only as a TOTO, despite these other, possibly conflicting versions physically existing on various machines at the same time. The existence of a TOTO is what people mean when they say that a distributed system should appear as a single machine, rather than a collection of machines.

This brings us to Desirable Property #1: there should be a TOTO. Clients should never see two incompatible versions of the object (where compatible roughly means that we passed through one state to get to the other state).

Desirable Property #1 is generally obtained via “quorums”, which usually means that a majority of replicas need to acknowledge a change to the TOTO for it to be accepted. The core idea here is that quorums are picked such that any two of them will intersect, which prevents a situation where two groups of replicas are operating independently of one another (the simplest way to design such a quorum is to require a majority). Another implication of Desirable Property #1 is that replicas can’t even serve reads to clients without performing some kind of coordination, or else latent conflicting versions might be exposed.

Naturally, clients need to interact with and possibly modify the TOTO. Consider a client who wants to perform an operation on such an object. A slot (register) can be filled with a value, or a key-value pair can be inserted, updated, or deleted. The client does this by submitting messages to one of the replicas, which then attempts to carry out the requested operation. An operation, of course, can fail. In the case of the slot that can only be filled once, perhaps the slot has already been filled. The client doesn’t know this when they submit their message. The server they talk to might not even know this when they receive the message! Thus, here are the possible outcomes of a message:

Now we have Desirable Property #2: the system should never lie about the effect of a client’s operation. If the system positively acknowledges a client operation, any future state of the TOTO should reflect that operation, and likewise for negative acknowledgement (if my flippant use of the word “future” bothers you, please speak to my lawyer).

As an example, a system could provide Desirable Property #1 but not #2 by immediately acknowledging client operations before acquiring a quorum. It’s possible that the replica the client is communicating with will accept the operation, but there is no quorum of other replicas which will, perhaps because they’ve already applied some incompatible operation unknown to the gateway replica. This gives better latency (because the gateway replica need not eat a network round-trip to a quorum of replicas) at the cost of possibly lying to the client.

Conversely, a system could provide Desirable Property #2 but not #1 by always being able to accept operations (say, via CRDTs) at the cost of having temporarily divergent TOTOs (ideally in a system designed around this they should re-converge quickly).

If you want to get more precise about the Desirable Properties and the model they provide for thinking about consensus, I advise you to look at my earlier post A Proof of Correctness for CASPaxos which addresses correctness for consensus more formally.

This gives us a rough idea of what a consensus algorithm is: loosely, a consensus algorithm is an algorithm for replicating an object, which provides the two Desirable Properties.

There’s another important aspect to consensus I didn’t understand initially: how come consensus algorithms seem so interested in logs specifically (“log”, to be clear, as in “log file”, or “ordered sequence of events”, not as in “logarithm”). Multi-Paxos and Raft both exclusively provide a log as their replicated object. Why not provide something else? CASPaxos is an example of a consensus algorithm which provides a replicated object which is not a log, but it’s the exception, rather than the rule, here.

The answer is that a log is sufficient (though possibly not necessary), for every kind of object one could ever want to replicate. Since a log is precisely just an ordered sequence of events, we make our log entries describe whatever operations we would like to perform on some other object. For instance, say I want to replicate some string → integer map. Rather than having to deal with replication logic specific to such a data structure, I can just replicate log entries as operations:

and since this log has been replicated faithfully, replicas must simply apply the sequence of operations in order and they will end up replicating the value we are representing the same way. When I asked about this, Ben Darnell compared a replicated log to something of a universal Turing machine for replicated objects, which I think is a particularly evocative description of the core idea at play here.

It’s also convenient that using a log gives us a convenient way to “compact” the updates - once we’re sure that every replica has applied all the entries up to some point, we can just allow them all to truncate those entries, since they’re not needed anymore.

Obviously this is not the full story around consensus, but in this post I’ve tried to outline some of the assumptions that Everybody Knows But You (but took me roughly a year of asking questions to piece together).

Further reading: