Functional Dependencies and Armstrong's Axioms For Query Optimization
28 Jan 2022Consider this SQL query:
SELECT a, sum(b) FROM (SELECT a, a+1 AS a2, b FROM ab) GROUP BY a, a2
In English, for each row in ab
, compute a2
as a + 1
.
Then, for every set of rows having the same value of (a, a2)
,
compute a
and the sum of their b
values.
With a little thought, you might be able to convince yourself that we can simplify this query to
SELECT a, sum(b) FROM ab GROUP BY a
because any two rows that share a value of (a)
will also share a value of (a, a2)
and vice versa, so changing the grouping columns has no effect on the
computation.
The art of query planning is making these observations then figuring out how to mechanize them so a database can apply them automatically. The way we automate this particular kind of observation is via a surprisingly powerful tool called Functional Dependencies (FDs).
Functional Dependencies
Let’s play a game! You and I are going to agree on a relation. Say, this one, which tells us where various people live.
name |
region |
town |
---|---|---|
Forte | CA | Pasadena |
Paul | NY | New York |
Ilia | ON | Oakville |
Bilal | ON | Mississauga |
Justin | ON | Mississauga |
Irfan | ON | Toronto |
I’m going to pick a row and tell you the value of name
, and you have to tell me the value of town
.
This is a pretty easy game for you: you always have a winning strategy since you can
always tell exactly which row I was looking at.
If I tell you “Ilia,” you know the value of town
: it has to be “Oakville.”
Likewise, if I tell you “Justin,” you know the value of town
: it has to be “Mississauga.”
Let’s flip the game on its head. I pick a row and tell you the value of town
, and you have to tell
me the value of name
for that row.
This one is harder:
if I tell you the value of town
is Mississauga
, you don’t know if I’m looking at the “Bilal” row or the “Justin” row.
In the first case, you could win because the value of name
determines the value of town
,
in the second case, you couldn’t, because town
does not determine name
.
Let’s try a third case.
I tell you the value of town
, and you tell me the value of region
.
You can actually win this one too.
Even though you can’t tell me which row I’m looking at, it doesn’t matter,
since every row I could be looking at has the same value of region
.
town
determines region
.
In some relation \(R\), if by knowing the values of some set of columns \(X\), we can determine the values of some set of columns \(Y\), we say the following:
- \(X\) functionally determines \(Y\) in \(R\),
- \(Y\) functionally depends on \(X\) in \(R\), or
- \(R \models X \rightarrow Y\) (read “\(R\) satisfies \(X \rightarrow Y\)").
The expression \(X \rightarrow Y\) is called a functional dependency.
Let’s return to the query we opened on:
SELECT a, sum(b) FROM (SELECT a, a+1 AS a2, b FROM ab) GROUP BY a, a2
Once we turn this query into an initial query plan, it will look something like this:
Group By {Output: [a, sum(b)], GroupingCols: [a, a2]}
^
|
Map {[a + 1 AS a2]}
^
|
Scan ab
Our query planner will then annotate each operator with the FDs it is able to infer for its output.
Since a + 1 AS a2
is computed deterministically from a
, we can infer the FD
\(\{\texttt{a}\} \rightarrow \{\texttt{a2}\}\) and annotate our Map
operator with it
(if we were clever we might be able to infer the reverse direction as well):
Group By {Output: [a, sum(b)], GroupingCols: [a, a2]}
^
|
Map {[a + 1 AS a2]}
- FDs:[{a} -> {a2}]
^
|
Scan ab
We then have an optimization rule on our GroupBy
operator that looks at the
FDs satisfied by the input relation and checks if our
grouping columns have
any redundancies: given a set \(X\), we might be able to find a smaller set \(X'\) such that
\(X \rightarrow X'\) and \(X' \rightarrow X\), meaning for any pair of rows, if they match on one,
they match on both, which is exactly the condition we need to change our grouping columns.
Once we determine that
\(\{\texttt{a}\} \rightarrow \{\texttt{a}, \texttt{a2}\}\)
and
\(\{\texttt{a}, \texttt{a2}\} \rightarrow \{\texttt{a}\}\),
we can simplify our query:
Group By {Output: [a, sum(b)], GroupingCols: [a]}
^
|
Map {[a + 1 AS a2]}
- FDs:[{a} -> {a2}]
^
|
Scan ab
and as one final step, we will recognize that as a2
is no longer needed, we can avoid computing it altogether:
Group By {Output: [a, sum(b)], GroupingCols: [a]}
^
|
Scan ab
This is just one application of FDs but there are many others. We will see more about how to manipulate FDs mechanically a little later.
Keys
A key of a relation is a set of columns which uniquely identify a row. The term “key” is meant to evoke the idea that we given a set of values for one, we could, in theory, “look up” a row by them.
More concretely, a set \(K\) of columns is a key in \(R\) if no two rows in \(R\) have the same values for \(K\).
name |
region |
town |
---|---|---|
Forte | CA | Pasadena |
Paul | NY | New York |
Ilia | ON | Oakville |
Bilal | ON | Mississauga |
Justin | ON | Mississauga |
Irfan | ON | Toronto |
In this relation, \(\{\texttt{name}\}\) is a key, as are \(\{\texttt{name}, \texttt{region}\}\), \(\{\texttt{name}, \texttt{town}\}\), and \(\{\texttt{name}, \texttt{region}, \texttt{town}\}\). \(\{\texttt{region}\}\) is not a key, because several rows share the same value of \(\texttt{region}\).
In SQL, keys are enforced via integrity constraints like UNIQUE
or PRIMARY KEY
,
but keys can easily arise in the middle of computation in a variety of ways, for instance, the grouping columns
of a Group By
operation will always form a key in its output.
Mechanically Manipulating FDs
If we’re going to use these constructions to mechanically optimize our queries, we need tools to manipulate them in a mechanical way. The tools used to manipulate FDs are a set of rules called Armstrong’s Axioms, and they are as follows (set union, normally written \(A \cup B\), in this field is written \(AB\), for some reason):
- If \(Y \subseteq X\), then \(X \rightarrow Y\) [Ref].
- If \(X \rightarrow Y\), then \(XZ \rightarrow YZ\) [Aug].
- If \(X \rightarrow Y\) and \(Y \rightarrow Z\), then \(X \rightarrow Z\) [Tra].
The Juice
Armstrong’s Axioms give us the ability to generate new FDs from ones we are already aware of. This manifests in a handful of ways, but the most useful is an algorithm to determine if a given functional dependency holds.
The \(\mathcal F\)-closure of a set of columns \(X\) is written \(X^\mathcal F\) and is the set of columns which are functionally dependent on \(X\). Put another way, “if you know the values of \(X\), what other values do you also know?”
There is a very simple algorithm to compute \(X^\mathcal F\): you start with \(X\) and just keep adding columns implied by it until you can do so no longer:
- Let \(Y = X\).
- For each FD in \(\mathcal F\), if the left side is contained in \(Y\) and the right side is not, then add the right side to \(Y\) and start over.
- Output \(X^ \mathcal F = Y\).
To check if \(X \rightarrow Y\) is implied by \(\mathcal F\), check if \(Y \subseteq X^{\mathcal F}\).
If you are building a query planner and looking for the actionable, useful thing to take away here, this is it: this is how you manipulate FDs in your code.
We’ll define this more precisely and also see that this algorithm is indeed correct later on.
The Squeeze
Now, let’s figure out if Armstrong’s Axioms are indeed appropriate for reasoning about FDs. When we have a set of axioms for manipulating statements like this, we care that they satisfy two important properties:
- They should not let us derive any statements which are false. This is called soundness.
- They should let us derive any statements which are true. This is called completeness.
It’s very easy to satisfy either one of these in isolation. If we say “you are not allowed to derive any new statements,” we are guaranteed to be sound, because we certainly will not derive something false if we can derive nothing at all. If we say “you can derive anything you want,” we are guaranteed to be complete, because if we can prove anything, we certainly can prove anything which happens to be true.
So clearly, we care about satisfying them both at the same time.
It should be not too hard to convince yourself Armstrong’s Axioms are sound, they’re all just true facts about relations. That Armstrong’s Axioms are complete is less obvious.
Completeness of Armstrong’s Axioms
To prove Armstrong’s Axioms are “good,” our general approach is going to be:
- Define what it means for an FD to “hold” (logicians call this semantics).
- If \(X \rightarrow Y\) holds in \(\mathcal F\), we write \(\mathcal F \models X \rightarrow Y\).
- Define what it means for an FD to be “provable” from Armstrong’s Axioms (logicians call this syntax).
- If \(X \rightarrow Y\) is provable from \(\mathcal F\), we write \(\mathcal F \vdash X \rightarrow Y\).
- Prove “soundness,” (we won’t bother showing this explicitly) meaning \(\mathcal F \vdash X \rightarrow Y\) implies \(\mathcal F \models X \rightarrow Y\) (“if it’s provable, it’s true”).
- Prove “completeness,” meaning \(\mathcal F \models X \rightarrow Y\) implies \(\mathcal F \vdash X \rightarrow Y\) (“if it’s true, it’s provable”).
- Once we’ve shown that \(\models\) and \(\vdash\) always agree, we conclude that we’re on the same page as our computer when it comes to FDs.
Defining Semantics
If \(\mathcal{F}\) is a set of functional dependencies, we write \(R \models \mathcal F\) ("\(R\) satisfies \(\mathcal F\)") if \(R \models f\) for all \(f \in \mathcal F\) (note that we’re overloading \(\models\) here).
For \(R \models \mathcal F\) to be true, even though we only require specifically the things in \(\mathcal F\), there might be some additional consequences. For instance, say \(\mathcal F = \{\{\texttt{a}\} \rightarrow \{\texttt{b}\}, \{\texttt{b}\} \rightarrow \{\texttt{c}\} \}\). If \(R \models \mathcal F\), then it’s actually also the case that \(R \models \{\texttt{a}\} \rightarrow \{\texttt{c}\} \) (knowing \(\texttt{a}\) means you know \(\texttt{b}\), knowing \(\texttt{b}\) means you know \(\texttt{c}\), hence, knowing \(\texttt{a}\) means you know \(\texttt{c}\)).
If every relation \(R\) that satisfies \(\mathcal F\) also satisfies \(X \rightarrow Y\), we say \(\mathcal F \models X \rightarrow Y\).
Defining Syntax
A functional dependency is a string \(X \rightarrow Y\), where \(X\) and \(Y\) are subsets of some attribute set \(\mathcal A\).
A set \(\mathcal F\) of functional dependencies is axiom-closed if AAs hold within it:
- \(Y \subseteq X\) implies \(X \rightarrow Y \in \mathcal F\),
- \(X \rightarrow Y \in \mathcal F\) implies \(XZ \rightarrow YZ \in \mathcal F\),
- \(X \rightarrow Y \in \mathcal F\) and \(Y \rightarrow Z \in \mathcal F\) together imply \(X \rightarrow Z \in \mathcal F\).
The closure of a set \(\mathcal F\), written \(\mathcal F^+\), is the smallest axiom-closed set that contains \(\mathcal F\).
Now \(X^{\mathcal{F}}\) is now the largest \(Y\) such that \(X \rightarrow Y \in \mathcal F^+\).
Completeness of Armstrong’s Axioms
This post came to be because I was lamenting to Forte that I didn’t understand why Armstrong’s Axioms were complete, and he came up with and showed me about four different proofs.
This is my attempt at paraphrasing the one that made the most sense to me.
Claim 1:
\[ \mathcal{F} \models X \rightarrow Y \text{ implies } Y \subseteq X^\mathcal{F} \]
Proof of claim 1
Contrapositive: \[ Y \not\subseteq X^\mathcal{F} \text{ implies } \mathcal{F} \not\models X \rightarrow Y \]
We must show there is some relation \(R\) such that \(R \models \mathcal F\) and \(R \not\models X \rightarrow Y\).
The following is such a relation:
X^F | not(X^F)
-------------------------------------------
1 | 1 | ... | 1 | 1 | 1 | 1 | ... | 1 | 1 |
1 | 1 | ... | 1 | 1 | 0 | 0 | ... | 0 | 0 |
This satisfies \(\mathcal{F}\) (not hard to check) but not \(X \rightarrow Y\) since \(Y\) intersects \(\overline{X^\mathcal F}\) and the two rows differ there, but not in \(X\).
Claim 2:
\[ Y \subseteq X^\mathcal{F} \text{ implies } \mathcal{F} \vdash X \rightarrow Y \]
Proof of claim 2
Since \(Y \subseteq X^\mathcal{F}\), \(\mathcal{F} \vdash X^\mathcal{F} \rightarrow Y\) by Ax1. By the definition of \(X^\mathcal{F}\), \(\mathcal{F} \vdash X \rightarrow X^\mathcal{F}\), and then by transitivity \(\mathcal{F} \vdash X \rightarrow Y\).
Mechanically Manipulating Keys
We can extend Armstrong’s Axioms to handle keys by adding the following two axioms:
- If \(\text{key}(K)\), then \(K \rightarrow \mathcal A\) (where \(\mathcal A\) is the set of all columns) [K1].
- If \(\text{key}(A)\) and \(B \rightarrow A\), then \(\text{key}(B)\) [K2].
I’m sure it’s not novel, but I haven’t seen this particular presentation of keys before.
Extending our FD language with these two new rules is also sound and complete. It’s not too hard to show using the fact that Armstrong’s Axioms themselves are complete.
Conclusion
There’s two steps to any optimization we want to apply:
- Observe some kind of redundancy or unnecessary work according to our understanding of how relations work.
- Figure out how to translate that into some kind of mechanical system that can be reasoned about by a computer.
These two steps correspond very roughly to semantics and syntax from logic: understanding what something means, how we represent it, and being confident those two ideas are in alignment, are all important, and we need a wide arsenal of these tricks for a query optimizer to do its best work.
Thanks Forte, Madhav, Ilia, Paul, and Jordan for looking over this post.