Justin Jaffray

blog notes

Merge Join is Weird

27 Oct 2019

Since I’ve made myself a pariah at work over this, I figured I should explain in more detail what I mean.

I’m going to start with a very brief explanation of what I mean by “join.” An “inner join” (henceforce “join”) is the most fundamental query primitive in relational databases. It’s not an exaggeration to say that SQL derives most of its benefits from the fact that it allows users to write joins and have them executed relatively fast.

A “relation”, or “table” (depending on whether you’re a nerd or a jock) is a set or multiset (again, depending on whether you’re a nerd or a jock) of “tuples” or “rows” (again, …).

A “join” of two relations \(A\) and \(B\) on a predicate \(p\) (written \(A \bowtie_p B\)) is a new relation whose rows are the concatenations of each pair of rows in \(A\) and \(B\) that satisfy \(p\).

Symbolically: \[ A \bowtie_p B = \sigma_p(A \times B) = \left\{ ab \:|\: a \in A, b \in B, p(ab) \right\} \]

Where \(\times\) is the “cross product” operator that just gives you every pair of rows, and \(\sigma_p\) is the “select” or “filter” operator.

This definition has some useful properties.

Commutativity (note that we don’t care about the ordinalities of the various n-tuples, usually we just care about the names of their entries):

\[ A \bowtie B = B \bowtie A \]

Associativity (don’t worry about fiddling around with the predicates here, it works):

\[ A \bowtie (B \bowtie C) = (A \bowtie B) \bowtie C \]

And less often talked about, distributivity:

\[ A \bowtie (B + C) = A \bowtie B + A \bowtie C \]

(where \(+\) is the union operator where you smash two tables together). It doesn’t look like it, but this one is actually extra spicy and justifies a lot of non-trivial algorithms.

Anyone with a cursory knowledge of query processing will tell you that the three musketeers of binary join algorithms are the hash join, nested loops join, and merge join. But I’m going to argue that there’s really only one join algorithm: the one that encompasses hash join and nested loop join (and some other cool stuff). And then also merge join happens to exist for some reason.

So what’s the one true join algorithm? Well, the definition suggests an algorithm:

for each row a in A:
  for each row b in B:
    if p(ab):
      emit(ab)

A cute way of interpreting this algorithm is that

if p(ab):
  emit(ab)

is a join algorithm that only works on relations when each has a single row. We then rewrite \(A\) to \(a_1 + a_2 + \ldots\) and \(B\) to \(b_1 + b_2 + \ldots\) (where each \(a_i\) and \(b_j\) are a single row from \(A\) and \(B\) respectively).

Then \(A \bowtie B\) is \((a_1 + a_2 + \ldots) \bowtie (b_1 + b_2 + \ldots)\) and by the distributive property, \[ A \bowtie B = a_1 \bowtie b_1 + a_1 \bowtie b_2 + \ldots = \sum_{a \in A, b \in B} a \bowtie b \] and we’ve successfully bootstrapped a real join algorithm from our baby one that only works on singletons.

This is the “nested loops join” and it can be mildly generalized by replacing the inner loop with a lookup into an index or something (I’ve heard this variation called “lookup join”).

We can pull this trick to derive hash join too. If \(p\) is an equality predicate, say \(A.x = B.y\), and \(V\) is the domain of \(A.x\) and \(B.y\), partition \(A\) as \[ A = A_{v_1} + A_{v_2} + \ldots = \sum_{v \in V}A_v \] where each \(A_{v}\) contains all the tuples in \(A\) having \(x = v\). Then we do the same for \(B\): \[ B = B_{v_1} + B_{v_2} + \ldots = \sum_{v \in V}B_v \] with each \(B_{v}\) contains all the tuples in \(B\) having \(y = v\).

Then rewrite: \[ A \bowtie_{x=y} B = \left(\sum_{v \in V}A_v\right) \bowtie_{x = y} \left(\sum_{v \in V}B_v\right) \]

use the distributive property:

\[ \sum_{v_x \in V} \sum_{v_y \in V} A_{v_x} \bowtie_{x=y} B_{v_y} \]

iterations of this sum where \(v_1 \not= v_2\) are just empty, and ones where they are equal don’t need to be filtered, so

\[ \sum_{v \in V} A_{v} \times B_{v} \]

Which pretty directly gives the algorithm for hash join:

m := {}
for a in A:
  m[a.x].append(a)
for b in B:
  for a in m[b.y]:
    emit(ab)

This strategy of exploiting distributivity can get us even more esoteric join algorithms.

Consider a setup where a table \(A\) is sharded among a number of machines and we’d like to join it with a small table \(B\). Naively, to join these two tables we’d have to bring them both to a single machine and use a traditional algorithm, but distributivity justifies the following algorithm: send each machine that has rows from \(A\) its own copy of \(B\). Have it perform the join locally and then emit the results to be collected elsewhere (this is often called “broadcast join”).

Grace hash join is also easily justified by distributivity.

I’m not going to go into the even more esoteric modern “worst-case-optimal” join algorithms, but we can do basically the same thing to justify their correctness (with some notable exceptions).

So where does merge join fit into all this?

As a refresher, merge join works in the following way: we have an equality condition, \(A.x = B.y\), and we have \(A\) stored ordered on \(x\) and \(B\) stored ordered on \(y\). We can then iterate the two tables in lockstep taking the cross product between contiguous chunks of the tables where the equality columns match.

None of these strategies (that, IMO, are very natural and follow pretty directly from the definitions) really work to justify it. I think this alone suggests that there’s something unique going on here.

Look at all the concepts we had to add to our vocabulary in order to even define this! We now have a notion of how our relations are “stored”, iteration order is important, even “ordering” at all was nowhere to be seen before this. To me, exploiting the structure of our predicate feels like an optimization in hash join, here it’s central to the structure of the algorithm.

You could argue that merge sort is very natural and merge join directly descends from it, but I think that’s kind of suspect too: in merge sort, ordering is inherent to the problem. The total order you’ve defined on the things you’re sorting is an input, and without it not only does the algorithm not make sense, the problem doesn’t either. This isn’t true of merge join, nowhere in the definition of the join do we define “order” over the things we’re joining on, it’s purely an implementation detail.

It’s tempting to attribute the effectiveness here as “ordering” functionally acting as a “partitioning,” but that doesn’t really work. Consider that the two relations being joined need to “agree” on the total order they’re using for the algorithm to work. It feels to me as though there’s some kind of implicit communication going on between the two tables being joined in that they both decide ahead of time what “ordered” means to them. Also note that any total ordering suffices, but generally we’re limited to the “natural” ordering of a given data type.

Anyway, thanks for reading my email Larry. I trust merge join will be removed from the 2020 release of Oracle.