JOIN: The Ultimate Projection

13 Jun 2022

Or: the core idea of how database query planners decorrelate subqueries.

Joins

In a relational database, to compute the join of \(R\) and \(S\) on \(p\), for each row \(r\) in \(R\), find all the rows \(s\) in \(S\) where \(p(r, s)\) is true, and emit the concatenation of \(r\) and \(s\). Readers of this blog might note this is a different definition than we’ve used in the past, but this one is well-suited to the way we’ll be using joins today.

Joins are an important operation when working with normalized data. If I have a table of users along with region IDs, plus a table which maps those IDs to the names of the regions:

> SELECT u_name, u_region_id FROM users;

  u_name | u_region_id
---------+--------------
  Bilal  |           1
  Irfan  |           2
  Jordan |           3
  Justin |           1
  Paul   |           3

> SELECT r_id, r_name FROM regions;

  r_id |   r_name
-------+--------------
     1 | Mississauga
     2 | Toronto
     3 | New York

To match people with the name of their region, we can join the two tables on having matching region ids:

> SELECT u_name, r_name FROM users JOIN regions ON u_region_id = r_id;

  u_name |   r_name
---------+--------------
  Bilal  | Mississauga
  Irfan  | Toronto
  Jordan | New York
  Justin | Mississauga
  Paul   | New York

From our definition, this is “for each user, find all regions such that u_region_id = r_id” (in this particular case, there’s only one, since r_id is a key for the regions table).

Projections

A projection is an operation on a relation which either trims away or adds columns based on existing ones. In SQL, we can compute a new expression derived from other columns by simply writing out that expression in our SELECT list:

> SELECT a FROM ab;

  a
-----
  1
  2
  3

> SELECT a, a*a AS aa FROM ab;

  a | aa
----+-----
  1 |  1
  2 |  4
  3 |  9

This adds a new column aa which is the square of the a column.

Correlated Subqueries

SQL allows a very special kind of projection: an entire query.

Rather than writing out the join explicitly above, one might prefer to think of that computation how they would in a normal programming language: for each user, I want to look up their region:

> SELECT u_name, (SELECT r_name FROM regions WHERE r_id = u_region_id)
    FROM users;

  u_name |   r_name
---------+--------------
  Bilal  | Mississauga
  Irfan  | Toronto
  Jordan | New York
  Justin | Mississauga
  Paul   | New York

This is called a subquery. Since the subquery refers to columns from its outer scope (that is: it has free variables) it’s called a correlated subquery.

Such queries are notoriously tricky to implement in relational databases. This is unfortunate, as they’re often very useful. In particular, it’s often much easier to construct queries of this form programmatically than it is to explicitly construct a join programmatically.

Remember, databases like joins a lot. Not just because they’ve been optimized very aggressively, but the symmetry of joins gives the query optimizer a lot of freedom to pick the best possible way to execute them. The subquery form of the query only really implies one implementation: run the subquery for each row. Can we do better?

While this thing is not obviously a join, we might suspect it’s pretty close, given that it has the same output as our join from before. Which raises the question: can we teach a query planner to transform a correlated subquery like this into a join?

Representing Functions

In high school you might have learned that a good way to represent a function is with a table of values, or function table. We can represent the function \[ f(x) = x^2 \]

with the (infinite) table

\(x\)	\(x^2\)
…	…
-1	1
0	0
1	1
2	4
3	9
…	…

Where the input to the function is the left column, and the corresponding output is the right column. The thing that makes this a function and not merely a relation is that it passes the “vertical line test:” no value occurs in the left side of the table more than once.

However, it’s also a relation, so we can represent some finite subset of it in a relational database:

> CREATE TABLE squares AS
    (SELECT x, x*x AS xx FROM generate_series(-1000, 1000) x);
> SELECT * FROM squares WHERE x > -10 AND x < 10;

  x  | xx
-----+-----
  -9 | 81
  -8 | 64
  -7 | 49
  -6 | 36
  -5 | 25
  -4 | 16
  -3 |  9
  -2 |  4
  -1 |  1
   0 |  0
   1 |  1
   2 |  4
   3 |  9
   4 | 16
   5 | 25
   6 | 36
   7 | 49
   8 | 64
   9 | 81

These are cool for us because they give us a relational way of thinking about applying a function: to find \(f(x)\) you just find \(x\) in the left column and see what the value of \(f(x)\) in the right column is.

Projections are Joins

Recall our earlier query to project the square of a number:

> SELECT a, a*a AS aa FROM ab;

  a | aa
----+-----
  1 |  1
  2 |  4
  3 |  9

One way to think about joins is that they “look stuff up” in a relation. Rather than compute a*a explicitly, we could also just look it up in squares:

> SELECT a, xx AS aa FROM ab JOIN squares ON a = x;

  a | aa
----+-----
  1 |  1
  2 |  4
  3 |  9

Since we have a function table (squares), we can always do this rewrite:

SELECT a, a*a AS aa FROM ab;

\[ \rightarrow \]

SELECT a, xx AS aa FROM ab JOIN squares ON a = x;

In general, if we are trying to project f(a_1, ..., a_n) and we have a function table for f with columns x_1, ..., x_n, o (where the xs are the inputs to the function and o is the output), the following transformation is valid:

SELECT ..., f(a_1, ..., a_n), ... FROM a

\[ \rightarrow \]

SELECT ..., o, ... FROM
    a JOIN f ON a_1 = x_1 AND ... AND a_n = x_n

Decorrelation

Decorrelation is the act of taking a correlated subquery and turning it into a flat join.

Recall our correlated subquery from earlier:

SELECT u_name, (SELECT r_name FROM regions WHERE r_id = u_region_id)
    FROM users;

The projection

(SELECT r_name FROM regions WHERE r_id = u_region_id)

has one free variable: the column u_region_id, so let’s pull this out into a function:

f(u_region_id) = (SELECT r_name FROM regions WHERE r_id = u_region_id)

Now our query is

SELECT u_name, f(u_region_id) FROM users;

Now, do we have a function table for f? Yes: it’s just regions! So we can do our rewrite:

SELECT u_name, r_name FROM users JOIN regions ON u_region_id = r_id;

Which gives us a completely decorrelated query that our database is free to evaluate however it sees fit.

Conclusion

This was a very simplified example. There are a handful of caveats:

We only handled projections. Of course, it’s also necessary to handle subqueries in other positions (projections do get you a decent amount of the way there, though).
If our function table isn’t total, we need to use a left join rather than an inner join so we don’t throw away inputs with no match. The right way to do this is to start with a left join that might decay into an inner join if the optimizer can prove that f is indeed total, which might be possible if, say, we had a foreign key constraint.
In general you likely need to do quite a bit more work than this to find a function table, since our correlated subquery can be arbitrarily complex and eliminating the correlation even after rewriting as a join can be difficult.
It’s an error in SQL if a subquery returns more than one row. This can be problematic once it becomes a join, and sometimes requires inserting assertion operators into the plan to make sure that errors don’t get elided by the transformation.

This is the core of the idea though, and I would argue, the conceptual “hard part.” Once you understand this, you more-or-less get the major idea of how decorrelation works.

In a follow up I’ll actually get into some details about we can do this for more complex subqueries. Here we could kind of just spot that regions was a valid function table, but that approach doesn’t scale and isn’t obvious how to automate.

References

Orthogonal Optimization of Subqueries and Aggregation: the canonical source as far as I’m aware. See here for a full discussion of Apply and the various rewrite rules required to eliminate it.
Unnesting Arbitrary Queries: a generalization of the above, discusses some issues with the Microsoft paper and how to fix them.
How Materialize and other databases optimize SQL subqueries: an overview of how various systems, and another discussion of the flaws with the Microsoft approach.

Justin Jaffray