JOIN: The Ultimate Projection
13 Jun 2022Or: the core idea of how database query planners decorrelate subqueries.
Joins
In a relational database, to compute the join of \(R\) and \(S\) on \(p\), for each row \(r\) in \(R\), find all the rows \(s\) in \(S\) where \(p(r, s)\) is true, and emit the concatenation of \(r\) and \(s\). Readers of this blog might note this is a different definition than we’ve used in the past, but this one is well-suited to the way we’ll be using joins today.
Joins are an important operation when working with normalized data. If I have a table of users along with region IDs, plus a table which maps those IDs to the names of the regions:
> SELECT u_name, u_region_id FROM users;
u_name | u_region_id
---------+--------------
Bilal | 1
Irfan | 2
Jordan | 3
Justin | 1
Paul | 3
> SELECT r_id, r_name FROM regions;
r_id | r_name
-------+--------------
1 | Mississauga
2 | Toronto
3 | New York
To match people with the name of their region, we can join the two tables on having matching region ids:
> SELECT u_name, r_name FROM users JOIN regions ON u_region_id = r_id;
u_name | r_name
---------+--------------
Bilal | Mississauga
Irfan | Toronto
Jordan | New York
Justin | Mississauga
Paul | New York
From our definition, this is “for each user
, find all region
s such that u_region_id = r_id
” (in this particular case, there’s only one, since
r_id
is a key for the regions
table).
Projections
A projection is an operation on a relation which either trims away or adds
columns based on existing ones.
In SQL, we can compute a new expression derived from other columns by simply writing out that expression in our SELECT
list:
> SELECT a FROM ab;
a
-----
1
2
3
> SELECT a, a*a AS aa FROM ab;
a | aa
----+-----
1 | 1
2 | 4
3 | 9
This adds a new column aa
which is the square of the a
column.
Correlated Subqueries
SQL allows a very special kind of projection: an entire query.
Rather than writing out the join explicitly above, one might prefer to
think of that computation how they would in a normal programming language:
for each user
, I want to look up their region
:
> SELECT u_name, (SELECT r_name FROM regions WHERE r_id = u_region_id)
FROM users;
u_name | r_name
---------+--------------
Bilal | Mississauga
Irfan | Toronto
Jordan | New York
Justin | Mississauga
Paul | New York
This is called a subquery. Since the subquery refers to columns from its outer scope (that is: it has free variables) it’s called a correlated subquery.
Such queries are notoriously tricky to implement in relational databases. This is unfortunate, as they’re often very useful. In particular, it’s often much easier to construct queries of this form programmatically than it is to explicitly construct a join programmatically.
Remember, databases like joins a lot. Not just because they’ve been optimized very aggressively, but the symmetry of joins gives the query optimizer a lot of freedom to pick the best possible way to execute them. The subquery form of the query only really implies one implementation: run the subquery for each row. Can we do better?
While this thing is not obviously a join, we might suspect it’s pretty close, given that it has the same output as our join from before. Which raises the question: can we teach a query planner to transform a correlated subquery like this into a join?
Representing Functions
In high school you might have learned that a good way to represent a function is with a table of values, or function table. We can represent the function \[ f(x) = x^2 \]
with the (infinite) table
\(x\) | \(x^2\) |
---|---|
… | … |
-1 | 1 |
0 | 0 |
1 | 1 |
2 | 4 |
3 | 9 |
… | … |
Where the input to the function is the left column, and the corresponding output is the right column. The thing that makes this a function and not merely a relation is that it passes the “vertical line test:” no value occurs in the left side of the table more than once.
However, it’s also a relation, so we can represent some finite subset of it in a relational database:
> CREATE TABLE squares AS
(SELECT x, x*x AS xx FROM generate_series(-1000, 1000) x);
> SELECT * FROM squares WHERE x > -10 AND x < 10;
x | xx
-----+-----
-9 | 81
-8 | 64
-7 | 49
-6 | 36
-5 | 25
-4 | 16
-3 | 9
-2 | 4
-1 | 1
0 | 0
1 | 1
2 | 4
3 | 9
4 | 16
5 | 25
6 | 36
7 | 49
8 | 64
9 | 81
These are cool for us because they give us a relational way of thinking about applying a function: to find \(f(x)\) you just find \(x\) in the left column and see what the value of \(f(x)\) in the right column is.
Projections are Joins
Recall our earlier query to project the square of a number:
> SELECT a, a*a AS aa FROM ab;
a | aa
----+-----
1 | 1
2 | 4
3 | 9
One way to think about joins is that they “look stuff up” in a relation.
Rather than compute a*a
explicitly, we could also just look it up in squares
:
> SELECT a, xx AS aa FROM ab JOIN squares ON a = x;
a | aa
----+-----
1 | 1
2 | 4
3 | 9
Since we have a function table (squares
), we can always do this rewrite:
SELECT a, a*a AS aa FROM ab;
\[ \rightarrow \]
SELECT a, xx AS aa FROM ab JOIN squares ON a = x;
In general, if we are trying to project f(a_1, ..., a_n)
and we have a function table for f
with columns x_1, ..., x_n, o
(where the x
s are the inputs to
the function and o
is the output), the following
transformation is valid:
SELECT ..., f(a_1, ..., a_n), ... FROM a
\[ \rightarrow \]
SELECT ..., o, ... FROM
a JOIN f ON a_1 = x_1 AND ... AND a_n = x_n
Decorrelation
Decorrelation is the act of taking a correlated subquery and turning it into a flat join.
Recall our correlated subquery from earlier:
SELECT u_name, (SELECT r_name FROM regions WHERE r_id = u_region_id)
FROM users;
The projection
(SELECT r_name FROM regions WHERE r_id = u_region_id)
has one free variable: the column u_region_id
, so let’s pull this out into a function:
f(u_region_id) = (SELECT r_name FROM regions WHERE r_id = u_region_id)
Now our query is
SELECT u_name, f(u_region_id) FROM users;
Now, do we have a function table for f
?
Yes: it’s just regions
!
So we can do our rewrite:
SELECT u_name, r_name FROM users JOIN regions ON u_region_id = r_id;
Which gives us a completely decorrelated query that our database is free to evaluate however it sees fit.
Conclusion
This was a very simplified example. There are a handful of caveats:
- We only handled projections. Of course, it’s also necessary to handle subqueries in other positions (projections do get you a decent amount of the way there, though).
- If our function table isn’t total, we need to use a left join rather than an inner join so we don’t
throw away inputs with no match. The right way to do this is to start with a left join that might decay into
an inner join if the optimizer can prove that
f
is indeed total, which might be possible if, say, we had a foreign key constraint. - In general you likely need to do quite a bit more work than this to find a function table, since our correlated subquery can be arbitrarily complex and eliminating the correlation even after rewriting as a join can be difficult.
- It’s an error in SQL if a subquery returns more than one row. This can be problematic once it becomes a join, and sometimes requires inserting assertion operators into the plan to make sure that errors don’t get elided by the transformation.
This is the core of the idea though, and I would argue, the conceptual “hard part.” Once you understand this, you more-or-less get the major idea of how decorrelation works.
In a follow up I’ll actually get into some details about we can do this for more
complex subqueries.
Here we could kind of just spot that regions
was a valid function table, but that approach doesn’t scale
and isn’t obvious how to automate.
References
- Orthogonal Optimization of Subqueries and Aggregation:
the canonical source as far as I’m aware. See here for a full discussion of
Apply
and the various rewrite rules required to eliminate it. - Unnesting Arbitrary Queries: a generalization of the above, discusses some issues with the Microsoft paper and how to fix them.
- How Materialize and other databases optimize SQL subqueries: an overview of how various systems, and another discussion of the flaws with the Microsoft approach.