Justin Jaffray

blog notes

Hard and Soft Statistics

28 Aug 2023

This post first appeared in my new newsletter: Null Bitmap. If you’re interested in getting smaller, more rambly posts directly in your inbox please consider signing up.

One lens I’ve found satisfying for “what is a query planner” is that it’s a tool for “approximating domain knowledge.”

By that I mean, someone who knows Python could probably write a pretty solid program to generate an end-of-month report for an airline company. But to write a good version of that program probably requires both knowledge of the distribution of flights, passengers, and the internal IDs used by the company. Such things (how many of each there are, how they’re related to each other, etc.) influence lots of programming decisions, like how to order lookups or what data structures to use in various scenarios.

To put it simply, I’d say writing a program for this kind of problem requires:

  1. general-purpose programming knowledge, and

  2. domain-specific knowledge for the task at hand.

One of the big wins we get when we move queries from an imperative language like Python to a declarative language like SQL is that we give the database a chance to express its opinion on things in category (2).

One of my favourite specific examples of the ambiguity in planning is the following query:

SELECT * FROM t WHERE x = 3 ORDER BY y

If we have an index on x and an index on y (where an “index” here specifically means we have fast sequential access and fast random access to that column), there’s two obvious ways to serve this query:

  1. We can look up in the x index to find rows that match our x = 3 condition, and then sort the results by y. This way avoids a full table scan, but requires a sort. Or,

  2. we can scan the y index and filter out the rows where x = 3. This way avoids a sort, but requires a full table scan.

Which way is better depends on the density of x = 3 values in the table:

How do we know? Like before, if you’re a programmer writing a program to answer this query, you might just kind of like, know. “Oh yeah, almost all of our users have this particular configuration option set, so doing the lookup like this won’t be useful.” If you’re a computer, you need a more automated way to reason about this kind of thing.

What we want to know is, what proportion of the rows in the table satisfy our predicate x = 3. We could answer this by just scanning over the entire table, but that doesn’t really help us in our endeavour to avoid, uh, scanning over the whole table.

The common solution to this problem is to compute some kind of derived information called a statistic, that we can use to estimate the answer to this question cheaply. There’s a lot of variations on this, databases often use tools like sketches to estimate cardinality, histograms to estimate distribution, and samples to estimate arbitrary predicates.

In many databases, we can see these in action by running EXPLAIN. Here’s a shell for a personal app I run on CockroachDB:

defaultdb=> explain select * from entities where consumed > 0;
                        info
----------------------------------------------------------
 • filter
 │ estimated row count: 632
 │ filter: consumed > 0.0
 │
 └── • scan
       estimated row count: 2,119
        (100% of the table;
         stats collected 2 days ago;
         using stats forecast for 28 days in the future)
       table: entities@entities_pkey
       spans: FULL SCAN

You can see that CockroachDB is using its statistics to estimate the number of rows in the table, along with the estimated selectivity of the filter (it’s pretty close, the actual number is 615).

But an interesting thing to note is that these stats were collected “2 days ago.”

So any statistics we have on this database are just approximate—we can’t rely on them for any correctness-critical determinations. If our statistics say “0 rows have consumed > 0,” that doesn’t mean we can replace the condition with false. But we can still use them to guide our decisions on the assumption that they’re close to correct.

I believe this is more-or-less the industry standard for how statistics work. You do a periodic survey of the database, build statistics, and then we use those to take a crack at what you think the various sizes of tables are, using those to inform our planning decisions. I want to highlight a notable exception from that standard practice in Snowflake.

Snowflake

I learned from this talk a couple years ago that Snowflake doesn’t actually adhere to the same model as above, where statistics are allowed to be out of date. They’re always kept up to date, and so the query planner can rely on them for much stronger decisionmaking than one with weaker guarantees could. The whole talk is great, but I want to highlight that I think this is a really interesting design decision. They know, for sure, at plan time, a lot of information about the database, and specifically about particular shards. For example, the largest and smallest values of a particular column. I’ve been calling these “hard” statistics, contrasted with the more common “soft” statistics.

When I first heard about this, it seemed a strange decision to me—you’re basically turning your statistics into a materialized view, of sorts, which requires a lot of effort and computational resources to maintain. With some more thought, it seems to me like a really good decision for this kind of database specifically, for a couple of reasons.

Reason 1: it’s not that expensive

Since Snowflake serves exclusively analytical queries, they don’t have frequent updates to their datasets, and so don’t actually have to do this work to update their statistics all that often.

Reason 2: table scans need to be fast

Snowflake doesn’t support indexes, which means they rely really heavily on their full table scans being efficient. This would be a problem for a lot of databases (it surprised me when I first heard it, coming from a more OLTP background) but I think it makes sense for a database that’s going to be supporting a lot of ad-hoc analytical querying. You’re going to have to do full-table scans often, so you might as well just focus on making those fast. But to that end, if you want to be able to elide sections of your full table scan, you need up-to-date statistics on those sections, or else you can’t safely skip them (and Snowflake does exactly this).

Reason 3: you’re going to answer a lot of crappy queries

Since Snowflake powers a lot of dashboards and other analytical tools, that means a lot of its queries are going to be auto-generated. These queries often look like:

SELECT * FROM
  (SELECT * FROM table1 WHERE condition_1) UNION ALL
  (SELECT * FROM table2 where condition_2) UNION ALL
   ...

With various parameters templated in. But in many instantiations of these queries, lots of these subtrees are going to be empty, because condition_n will wind up not matching anything. Soft statistics can do this sometimes, when you wind up generating conditions like WHERE x = 2 AND x > 45, but less well when you have conditions like WHERE x = 'turnip' and it turns out there are no turnips in the database. Hard statistics let you do this! If there’s no turnips, you can know that and rely on it! This is obviously less valuable if your queries are going to be human written and won’t contain pointless subtrees, but if you’re a data warehouse, it makes a lot of sense.

I’ll admit I have a bit of a taxonomizing brain and the idea of being able to categorize query planners based on their choices for various aspects like this is really appealing to me.

(Apologies to Snowflake if any of my information is out of date or wrong, if you work at Snowflake and want to comment on any of the above please reach out!)

Note: some of this post is adapted from some older posts of mine, Understanding Cost Models and What is a Query Optimizer For?. If you are interested in this topic you might also enjoy those!