Storing Small Sets with Special Strings
23 Jun 2018A common trick when representing sets of small integers is to use a bitmap, with the \(i\)-th bit set if \(i\) is in the set. For example, we can represent the set \(\{0,1,5,7\}\) as the integer 163, or \(10100011\) in binary. When we can get away with this it’s great - it’s compact, can be stack-allocated, and it makes operations like union, intersection, and difference very fast, since they’re just bitwise operations. For this to be possible, the size of your integers must be bounded (and likely quite small) if you want to avoid heap allocations.
One concern is that in comparison to something like a sorted list or tree structure, it’s not obvious how to iterate over such a set efficiently. The most obvious way is to iterate from 0 up to the largest supported value (31, if we’re using a 32-bit integer as our bitmap), and checking individually if each value is contained, but this can do a lot of unnecessary work in cases where we have a small number of integers at the higher end of the supported range, requiring looping through a lot of 0s to find our 1s. How can we quickly find the 1s in our sea of 0s?
The \(\text{ntz}\) (number of trailing zeroes) problem is this: given an integer (say a 32-bit integer), how long is the sequence of 0s at the end? For example:
\(x\) | \(\text{ntz}(x)\) |
---|---|
1 | 0 |
101 | 0 |
100 | 2 |
1011000 | 3 |
0 | 32 (in a 32-bit integer) |
Having access to \(\text{ntz}\) solves our problem of iteration: compute \(\text{ntz}\) of the bitmap, that’s the smallest element of the set. Then remove it from the set and repeat until our set is empty (equals 0).
There’s a lovely, useful algorithm to compute \(\text{ntz}\) more efficiently than a loop, which I love largely because it looks so absolutely out of left field upon first inspection.
Here’s the implementation from the Go standard library:
const deBruijn32 = 0x077CB531
var deBruijn32tab = [32]byte{
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9,
}
func TrailingZeros32(x uint32) int {
if x == 0 {
return 32
}
return int(deBruijn32tab[(x&-x)*deBruijn32>>(32-5)])
}
I think this is great because of the two magic constants which upon first glance look entirely unmotivated (even if you’ve encountered De Bruijn sequences before), and then the symbol-soup of an index into deBruijn32tab
is so wacky. Thankfully, the justification for the algorithm is quite understandable, but it involves a handful of disparate ideas that happen to fit together perfectly, so we have to build them up independently.
First, let’s look at deBruijn32
. The significance of 0x077CB531
is (very slightly) clearer if we look at it in binary:
\[
00000111011111001011010100110001
\]
It’s the 5-digit binary De Bruijn sequence: every 5-digit binary string is a
substring of deBruijn32
’s binary representation. For instance, you can find
\(11111\) in there, \(10010\), etc. This assumes we allow it to extend off the
right side (with four zeroes, in effect looping around), which, as we’ll see,
is justified for our purposes. Also important is that every 5-digit substring
(including ones like \(10000\), which uses the imaginary zeroes off the right) is
unique.
The key takeaway is this: taking the substring of length 5 starting at position \(i\) (counting from the left), for \(i\) in the range \([0, 31]\), gives a unique 5-digit bitstring (and hence, a unique number from 0-31). For instance, the substring starting at position 2 gives the sequence \(00011\), which represents the number 3: \[ 00[00011]1011111001011010100110001 \] We can figure out exactly what this mapping is by computing it by hand:
Index | Substring | Value |
---|---|---|
0 | 00000111011111001011010100110001 |
0 |
1 | 00000111011111001011010100110001 |
1 |
2 | 00000111011111001011010100110001 |
3 |
3 | 00000111011111001011010100110001 |
7 |
4 | 00000111011111001011010100110001 |
14 |
5 | 00000111011111001011010100110001 |
29 |
6 | 00000111011111001011010100110001 |
27 |
7 | 00000111011111001011010100110001 |
23 |
8 | 00000111011111001011010100110001 |
15 |
9 | 00000111011111001011010100110001 |
31 |
10 | 00000111011111001011010100110001 |
30 |
11 | 00000111011111001011010100110001 |
28 |
12 | 00000111011111001011010100110001 |
25 |
13 | 00000111011111001011010100110001 |
18 |
14 | 00000111011111001011010100110001 |
5 |
15 | 00000111011111001011010100110001 |
11 |
16 | 00000111011111001011010100110001 |
22 |
17 | 00000111011111001011010100110001 |
13 |
18 | 00000111011111001011010100110001 |
26 |
19 | 00000111011111001011010100110001 |
21 |
20 | 00000111011111001011010100110001 |
10 |
21 | 00000111011111001011010100110001 |
20 |
22 | 00000111011111001011010100110001 |
9 |
23 | 00000111011111001011010100110001 |
19 |
24 | 00000111011111001011010100110001 |
6 |
25 | 00000111011111001011010100110001 |
12 |
26 | 00000111011111001011010100110001 |
24 |
27 | 00000111011111001011010100110001 |
17 |
28 | 00000111011111001011010100110001|0 |
2 |
29 | 00000111011111001011010100110001|00 |
4 |
30 | 00000111011111001011010100110001|000 |
8 |
31 | 00000111011111001011010100110001|0000 |
16 |
The Index → Value mapping answers the question “if I extract this substring, what value do I get?” For our purposes, the opposite question will be more important: “If I extracted a substring and got this value, what’s the index of the substring I extracted?” To this end, we can just invert the above mapping (an inverse exists, since every value in the rightmost column appears exactly once). This is the same table as above, just reordered:
Value | Index |
---|---|
0 | 0 |
1 | 1 |
2 | 28 |
3 | 2 |
4 | 29 |
5 | 14 |
6 | 24 |
7 | 3 |
8 | 30 |
9 | 22 |
10 | 20 |
11 | 15 |
12 | 25 |
13 | 17 |
14 | 4 |
15 | 8 |
16 | 31 |
17 | 27 |
18 | 13 |
19 | 23 |
20 | 21 |
21 | 19 |
22 | 16 |
23 | 7 |
24 | 26 |
25 | 12 |
26 | 18 |
27 | 6 |
28 | 11 |
29 | 5 |
30 | 10 |
31 | 9 |
This table is precisely the value of deBruijn32tab
. So we can easily answer the question “If I extracted a substring and got this value, what’s the index of the substring I extracted?” We’ll get back to why this is useful in a minute.
Let’s look at
\(x\text{\&}\mathord-x\).
Recall that in a two’s complement integer, \(\mathord-x = \overline x+1\) (where \(\overline x\) is \(x\) with all the 0s replaced with 1s and vice-versa).
Think about a string of trailing zeroes in \(x\). In \(\overline x\), that turns into a
string of 1s. Adding 1 then cascades up that string of 1s, resulting in a
1 in the same position as the rightmost 1 in \(x\) and all 0s to the right
of it. Every other bit will be the inverse of what it was in \(x\), and so
\(x\text{\&}\mathord-x\)
will only contain the rightmost 1 in \(x\). Thus, all
\(x\text{\&}\mathord-x\)
is doing is
isolating that rightmost 1. Further,
\(\text{ntz}(x) = \text{ntz}(x\text{\&}\mathord-x)\),
but
\(x\text{\&}\mathord-x\)
has the
additional benefit that it’s canonical: it has exactly one 1 (unless x
is 0,
which we handle as a special case). Here’s an example:
\[ \begin{aligned} x &= 110110110000000 \cr \overline x &= 001001001111111 \cr \overline x + 1 &= 001001010000000 \cr \mathord -x &= 001001010000000 \cr x \& \mathord -x &= 000000010000000 \end{aligned} \]
So \(x\text{\&}\mathord-x\) is a power of 2. In fact, it’s actually \(2^{\text{ntz}(x)}\).
Here’s another useful fact: multiplication by \(2^n\) is the same as a bit shift left by \(n\). So when we multiply deBruijn32
by
\(x\text{\&}\mathord-x\)
, we’re actually shifting it to the left by \(\text{ntz}(x)\). In the full expression
deBruijn32tab[(x&-x)*deBruijn32>>(32-5)]
deBruijn32
. Here’s a complete example (where \(d = \) deBruijn32
):
\[ \begin{aligned} x &= 11011010001000011000001001100000 \cr x\&-x &= 00000000000000000000000000100000 \cr d &= 00000111011111001011010100110001 \cr (x\&-x)*d &= 11101111100101101010011000100000 \cr (x\&-x)*d>>27 &= 00000000000000000000000000011101 = 29 \cr \end{aligned} \]
But as we saw already, deBruijn32tab
maps those substrings to the index (shift) that produced them, thus, the end value of this expression is precisely \(\text{ntz}(x)\). If we look at position 29, we find the value 5, which is indeed \(\text{ntz}(x)\). Pretty cool.
I actually write code that uses this algorithm every day - in the SQL layer of CockroachDB, we use integers to refer to columns when we express transformations on expressions or mappings between operators. It’s often very important for us to be able to represent a set of columns efficiently, and ideally without performing a heap allocation. One of my favourite mini-libraries in CockroachDB is FastIntSet. For sets involving large integers it spills over to a map, but for sets involving only values smaller than 64, it represents them as a bitmap on a 64-bit integer. This is great since we have to perform unions/intersections/differences on these sets quite frequently. See ForEach for its use of \(\text{ntz}\) (here called bits.TrailingZeros64
).
As a caveat to my example here, in Go, in some cases this function actually defers to processor-specific instructions which implement \(\text{ntz}\) as these tests demonstrate.
For more cute tricks like this, I recommend checking out the book Hacker’s Delight, which is where I learned of it.