Justin Jaffray

blog notes

Functional Dependencies and Armstrong's Axioms For Query Optimization

28 Jan 2022

Consider this SQL query:

SELECT a, sum(b) FROM (SELECT a, a+1 AS a2, b FROM ab) GROUP BY a, a2

In English, for each row in ab, compute a2 as a + 1. Then, for every set of rows having the same value of (a, a2), compute a and the sum of their b values.

With a little thought, you might be able to convince yourself that we can simplify this query to

SELECT a, sum(b) FROM ab GROUP BY a

because any two rows that share a value of (a) will also share a value of (a, a2) and vice versa, so changing the grouping columns has no effect on the computation.

The art of query planning is making these observations then figuring out how to mechanize them so a database can apply them automatically. The way we automate this particular kind of observation is via a surprisingly powerful tool called Functional Dependencies (FDs).

Functional Dependencies

Let’s play a game! You and I are going to agree on a relation. Say, this one, which tells us where various people live.

name region town
Forte CA Pasadena
Paul NY New York
Ilia ON Oakville
Bilal ON Mississauga
Justin ON Mississauga
Irfan ON Toronto

I’m going to pick a row and tell you the value of name, and you have to tell me the value of town. This is a pretty easy game for you: you always have a winning strategy since you can always tell exactly which row I was looking at. If I tell you “Ilia,” you know the value of town: it has to be “Oakville.” Likewise, if I tell you “Justin,” you know the value of town: it has to be “Mississauga.”

Let’s flip the game on its head. I pick a row and tell you the value of town, and you have to tell me the value of name for that row. This one is harder: if I tell you the value of town is Mississauga, you don’t know if I’m looking at the “Bilal” row or the “Justin” row.

In the first case, you could win because the value of name determines the value of town, in the second case, you couldn’t, because town does not determine name.

Let’s try a third case. I tell you the value of town, and you tell me the value of region. You can actually win this one too. Even though you can’t tell me which row I’m looking at, it doesn’t matter, since every row I could be looking at has the same value of region. town determines region.

In some relation \(R\), if by knowing the values of some set of columns \(X\), we can determine the values of some set of columns \(Y\), we say the following:

The expression \(X \rightarrow Y\) is called a functional dependency.

Let’s return to the query we opened on:

SELECT a, sum(b) FROM (SELECT a, a+1 AS a2, b FROM ab) GROUP BY a, a2

Once we turn this query into an initial query plan, it will look something like this:

Group By {Output: [a, sum(b)], GroupingCols: [a, a2]}
                        ^
                        |
                Map {[a + 1 AS a2]}
                        ^
                        |
                     Scan ab

Our query planner will then annotate each operator with the FDs it is able to infer for its output. Since a + 1 AS a2 is computed deterministically from a, we can infer the FD \(\{\texttt{a}\} \rightarrow \{\texttt{a2}\}\) and annotate our Map operator with it (if we were clever we might be able to infer the reverse direction as well):

Group By {Output: [a, sum(b)], GroupingCols: [a, a2]}
                        ^
                        |
                Map {[a + 1 AS a2]}
                  - FDs:[{a} -> {a2}]
                        ^
                        |
                     Scan ab

We then have an optimization rule on our GroupBy operator that looks at the FDs satisfied by the input relation and checks if our grouping columns have any redundancies: given a set \(X\), we might be able to find a smaller set \(X'\) such that \(X \rightarrow X'\) and \(X' \rightarrow X\), meaning for any pair of rows, if they match on one, they match on both, which is exactly the condition we need to change our grouping columns. Once we determine that \(\{\texttt{a}\} \rightarrow \{\texttt{a}, \texttt{a2}\}\) and \(\{\texttt{a}, \texttt{a2}\} \rightarrow \{\texttt{a}\}\), we can simplify our query:

Group By {Output: [a, sum(b)], GroupingCols: [a]}
                        ^
                        |
                Map {[a + 1 AS a2]}
                  - FDs:[{a} -> {a2}]
                        ^
                        |
                     Scan ab

and as one final step, we will recognize that as a2 is no longer needed, we can avoid computing it altogether:

Group By {Output: [a, sum(b)], GroupingCols: [a]}
                        ^
                        |
                     Scan ab

This is just one application of FDs but there are many others. We will see more about how to manipulate FDs mechanically a little later.

Keys

A key of a relation is a set of columns which uniquely identify a row. The term “key” is meant to evoke the idea that we given a set of values for one, we could, in theory, “look up” a row by them.

More concretely, a set \(K\) of columns is a key in \(R\) if no two rows in \(R\) have the same values for \(K\).

name region town
Forte CA Pasadena
Paul NY New York
Ilia ON Oakville
Bilal ON Mississauga
Justin ON Mississauga
Irfan ON Toronto

In this relation, \(\{\texttt{name}\}\) is a key, as are \(\{\texttt{name}, \texttt{region}\}\), \(\{\texttt{name}, \texttt{town}\}\), and \(\{\texttt{name}, \texttt{region}, \texttt{town}\}\). \(\{\texttt{region}\}\) is not a key, because several rows share the same value of \(\texttt{region}\).

In SQL, keys are enforced via integrity constraints like UNIQUE or PRIMARY KEY, but keys can easily arise in the middle of computation in a variety of ways, for instance, the grouping columns of a Group By operation will always form a key in its output.

Mechanically Manipulating FDs

If we’re going to use these constructions to mechanically optimize our queries, we need tools to manipulate them in a mechanical way. The tools used to manipulate FDs are a set of rules called Armstrong’s Axioms, and they are as follows (set union, normally written \(A \cup B\), in this field is written \(AB\), for some reason):

The Juice

Armstrong’s Axioms give us the ability to generate new FDs from ones we are already aware of. This manifests in a handful of ways, but the most useful is an algorithm to determine if a given functional dependency holds.

The \(\mathcal F\)-closure of a set of columns \(X\) is written \(X^\mathcal F\) and is the set of columns which are functionally dependent on \(X\). Put another way, “if you know the values of \(X\), what other values do you also know?”

There is a very simple algorithm to compute \(X^\mathcal F\): you start with \(X\) and just keep adding columns implied by it until you can do so no longer:

  1. Let \(Y = X\).
  2. For each FD in \(\mathcal F\), if the left side is contained in \(Y\) and the right side is not, then add the right side to \(Y\) and start over.
  3. Output \(X^ \mathcal F = Y\).

To check if \(X \rightarrow Y\) is implied by \(\mathcal F\), check if \(Y \subseteq X^{\mathcal F}\).

If you are building a query planner and looking for the actionable, useful thing to take away here, this is it: this is how you manipulate FDs in your code.

We’ll define this more precisely and also see that this algorithm is indeed correct later on.

The Squeeze

Now, let’s figure out if Armstrong’s Axioms are indeed appropriate for reasoning about FDs. When we have a set of axioms for manipulating statements like this, we care that they satisfy two important properties:

It’s very easy to satisfy either one of these in isolation. If we say “you are not allowed to derive any new statements,” we are guaranteed to be sound, because we certainly will not derive something false if we can derive nothing at all. If we say “you can derive anything you want,” we are guaranteed to be complete, because if we can prove anything, we certainly can prove anything which happens to be true.

So clearly, we care about satisfying them both at the same time.

It should be not too hard to convince yourself Armstrong’s Axioms are sound, they’re all just true facts about relations. That Armstrong’s Axioms are complete is less obvious.

Completeness of Armstrong’s Axioms

To prove Armstrong’s Axioms are “good,” our general approach is going to be:

Defining Semantics

If \(\mathcal{F}\) is a set of functional dependencies, we write \(R \models \mathcal F\) ("\(R\) satisfies \(\mathcal F\)") if \(R \models f\) for all \(f \in \mathcal F\) (note that we’re overloading \(\models\) here).

For \(R \models \mathcal F\) to be true, even though we only require specifically the things in \(\mathcal F\), there might be some additional consequences. For instance, say \(\mathcal F = \{\{\texttt{a}\} \rightarrow \{\texttt{b}\}, \{\texttt{b}\} \rightarrow \{\texttt{c}\} \}\). If \(R \models \mathcal F\), then it’s actually also the case that \(R \models \{\texttt{a}\} \rightarrow \{\texttt{c}\} \) (knowing \(\texttt{a}\) means you know \(\texttt{b}\), knowing \(\texttt{b}\) means you know \(\texttt{c}\), hence, knowing \(\texttt{a}\) means you know \(\texttt{c}\)).

If every relation \(R\) that satisfies \(\mathcal F\) also satisfies \(X \rightarrow Y\), we say \(\mathcal F \models X \rightarrow Y\).

Defining Syntax

A functional dependency is a string \(X \rightarrow Y\), where \(X\) and \(Y\) are subsets of some attribute set \(\mathcal A\).

A set \(\mathcal F\) of functional dependencies is axiom-closed if AAs hold within it:

The closure of a set \(\mathcal F\), written \(\mathcal F^+\), is the smallest axiom-closed set that contains \(\mathcal F\).

Now \(X^{\mathcal{F}}\) is now the largest \(Y\) such that \(X \rightarrow Y \in \mathcal F^+\).

Completeness of Armstrong’s Axioms

This post came to be because I was lamenting to Forte that I didn’t understand why Armstrong’s Axioms were complete, and he came up with and showed me about four different proofs.

This is my attempt at paraphrasing the one that made the most sense to me.

Claim 1:

\[ \mathcal{F} \models X \rightarrow Y \text{ implies } Y \subseteq X^\mathcal{F} \]

Proof of claim 1

Contrapositive: \[ Y \not\subseteq X^\mathcal{F} \text{ implies } \mathcal{F} \not\models X \rightarrow Y \]

We must show there is some relation \(R\) such that \(R \models \mathcal F\) and \(R \not\models X \rightarrow Y\).

The following is such a relation:

        X^F         |       not(X^F)
-------------------------------------------
1 | 1 | ... | 1 | 1 | 1 | 1 | ... | 1 | 1 |
1 | 1 | ... | 1 | 1 | 0 | 0 | ... | 0 | 0 |

This satisfies \(\mathcal{F}\) (not hard to check) but not \(X \rightarrow Y\) since \(Y\) intersects \(\overline{X^\mathcal F}\) and the two rows differ there, but not in \(X\).

Claim 2:

\[ Y \subseteq X^\mathcal{F} \text{ implies } \mathcal{F} \vdash X \rightarrow Y \]

Proof of claim 2

Since \(Y \subseteq X^\mathcal{F}\), \(\mathcal{F} \vdash X^\mathcal{F} \rightarrow Y\) by Ax1. By the definition of \(X^\mathcal{F}\), \(\mathcal{F} \vdash X \rightarrow X^\mathcal{F}\), and then by transitivity \(\mathcal{F} \vdash X \rightarrow Y\).

Mechanically Manipulating Keys

We can extend Armstrong’s Axioms to handle keys by adding the following two axioms:

I’m sure it’s not novel, but I haven’t seen this particular presentation of keys before.

Extending our FD language with these two new rules is also sound and complete. It’s not too hard to show using the fact that Armstrong’s Axioms themselves are complete.

Conclusion

There’s two steps to any optimization we want to apply:

  1. Observe some kind of redundancy or unnecessary work according to our understanding of how relations work.
  2. Figure out how to translate that into some kind of mechanical system that can be reasoned about by a computer.

These two steps correspond very roughly to semantics and syntax from logic: understanding what something means, how we represent it, and being confident those two ideas are in alignment, are all important, and we need a wide arsenal of these tricks for a query optimizer to do its best work.

Thanks Forte, Madhav, Ilia, Paul, and Jordan for looking over this post.