Justin Jaffray

blog notes

Storing Small Sets With Special Strings

23 Jun 2018

A common trick when representing sets of small integers is to use a bitmap, with the \(i\)-th bit set if \(i\) is in the set. For example, we can represent the set \(\{0,1,5,7\}\) as the integer 163, or \(10100011\) in binary. When we can get away with this it’s great - it’s compact, can be stack-allocated, and it makes operations like union, intersection, and difference very fast, since they’re just bitwise operations. For this to be possible, the size of your integers must be bounded (and likely quite small) if you want to avoid heap allocations.

One concern is that in comparison to something like a sorted list or tree structure, it’s not obvious how to iterate over such a set efficiently. The most obvious way is to iterate from 0 up to the largest supported value (31, if we’re using a 32-bit integer as our bitmap), and checking individually if each value is contained, but this can do a lot of unnecessary work in cases where we have a small number of integers at the higher end of the supported range, requiring looping through a lot of 0s to find our 1s. How can we quickly find the 1s in our sea of 0s?

The \(\text{ntz}\) (number of trailing zeroes) problem is this: given an integer (say a 32-bit integer), how long is the sequence of 0s at the end? For example:

\(x\) \(\text{ntz}(x)\)
1 0
101 0
100 2
1011000 3
0 32 (in a 32-bit integer)

Having access to \(\text{ntz}\) solves our problem of iteration: compute \(\text{ntz}\) of the bitmap, that’s the smallest element of the set. Then remove it from the set and repeat until our set is empty (equals 0).

There’s a lovely, useful algorithm to compute \(\text{ntz}\) more efficiently than a loop, which I love largely because it looks so absolutely out of left field upon first inspection.

Here’s the implementation from the Go standard library:

const deBruijn32 = 0x077CB531

var deBruijn32tab = [32]byte{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9,
}

func TrailingZeros32(x uint32) int {
  if x == 0 {
    return 32
  }
  return int(deBruijn32tab[(x&-x)*deBruijn32>>(32-5)])
}

I think this is great because of the two magic constants which upon first glance look entirely unmotivated (even if you’ve encountered De Bruijn sequences before), and then the symbol-soup of an index into deBruijn32tab is so wacky. Thankfully, the justification for the algorithm is quite understandable, but it involves a handful of disparate ideas that happen to fit together perfectly, so we have to build them up independently.

First, let’s look at deBruijn32. The significance of 0x077CB531 is (very slightly) clearer if we look at it in binary: \[ 00000111011111001011010100110001 \] It’s the 5-digit binary De Bruijn sequence: every 5-digit binary string is a substring of deBruijn32’s binary representation. For instance, you can find \(11111\) in there, \(10010\), etc. This assumes we allow it to extend off the right side (with four zeroes, in effect looping around), which, as we’ll see, is justified for our purposes. Also important is that every 5-digit substring (including ones like \(10000\), which uses the imaginary zeroes off the right) is unique.

The key takeaway is this: taking the substring of length 5 starting at position \(i\) (counting from the left), for \(i\) in the range \([0, 31]\), gives a unique 5-digit bitstring (and hence, a unique number from 0-31). For instance, the substring starting at position 2 gives the sequence \(00011\), which represents the number 3: \[ 00[00011]1011111001011010100110001 \] We can figure out exactly what this mapping is by computing it by hand:

Index Substring Value
0 00000111011111001011010100110001 0
1 00000111011111001011010100110001 1
2 00000111011111001011010100110001 3
3 00000111011111001011010100110001 7
4 00000111011111001011010100110001 14
5 00000111011111001011010100110001 29
6 00000111011111001011010100110001 27
7 00000111011111001011010100110001 23
8 00000111011111001011010100110001 15
9 00000111011111001011010100110001 31
10 00000111011111001011010100110001 30
11 00000111011111001011010100110001 28
12 00000111011111001011010100110001 25
13 00000111011111001011010100110001 18
14 00000111011111001011010100110001 5
15 00000111011111001011010100110001 11
16 00000111011111001011010100110001 22
17 00000111011111001011010100110001 13
18 00000111011111001011010100110001 26
19 00000111011111001011010100110001 21
20 00000111011111001011010100110001 10
21 00000111011111001011010100110001 20
22 00000111011111001011010100110001 9
23 00000111011111001011010100110001 19
24 00000111011111001011010100110001 6
25 00000111011111001011010100110001 12
26 00000111011111001011010100110001 24
27 00000111011111001011010100110001 17
28 00000111011111001011010100110001|0 2
29 00000111011111001011010100110001|00 4
30 00000111011111001011010100110001|000 8
31 00000111011111001011010100110001|0000 16

The Index → Value mapping answers the question “if I extract this substring, what value do I get?” For our purposes, the opposite question will be more important: “If I extracted a substring and got this value, what’s the index of the substring I extracted?” To this end, we can just invert the above mapping (an inverse exists, since every value in the rightmost column appears exactly once). This is the same table as above, just reordered:

Value Index
0 0
1 1
2 28
3 2
4 29
5 14
6 24
7 3
8 30
9 22
10 20
11 15
12 25
13 17
14 4
15 8
16 31
17 27
18 13
19 23
20 21
21 19
22 16
23 7
24 26
25 12
26 18
27 6
28 11
29 5
30 10
31 9

This table is precisely the value of deBruijn32tab. So we can easily answer the question “If I extracted a substring and got this value, what’s the index of the substring I extracted?” We’ll get back to why this is useful in a minute.

Let’s look at \(x\text{\&}\mathord-x\). Recall that in a two’s complement integer, \(\mathord-x = \overline x+1\) (where \(\overline x\) is \(x\) with all the 0s replaced with 1s and vice-versa). Think about a string of trailing zeroes in \(x\). In \(\overline x\), that turns into a string of 1s. Adding 1 then cascades up that string of 1s, resulting in a 1 in the same position as the rightmost 1 in \(x\) and all 0s to the right of it. Every other bit will be the inverse of what it was in \(x\), and so \(x\text{\&}\mathord-x\) will only contain the rightmost 1 in \(x\). Thus, all \(x\text{\&}\mathord-x\) is doing is isolating that rightmost 1. Further, \(\text{ntz}(x) = \text{ntz}(x\text{\&}\mathord-x)\), but \(x\text{\&}\mathord-x\) has the additional benefit that it’s canonical: it has exactly one 1 (unless x is 0, which we handle as a special case). Here’s an example:

\[ \begin{aligned} x &= 110110110000000 \cr \overline x &= 001001001111111 \cr \overline x + 1 &= 001001010000000 \cr \mathord -x &= 001001010000000 \cr x \& \mathord -x &= 000000010000000 \end{aligned} \]

So \(x\text{\&}\mathord-x\) is a power of 2. In fact, it’s actually \(2^{\text{ntz}(x)}\).

Here’s another useful fact: multiplication by \(2^n\) is the same as a bit shift left by \(n\). So when we multiply deBruijn32 by \(x\text{\&}\mathord-x\) , we’re actually shifting it to the left by \(\text{ntz}(x)\). In the full expression

deBruijn32tab[(x&-x)*deBruijn32>>(32-5)]

we shift left by \(\text{ntz}(x)\) (truncating anything to the left) and then shift back right by 27 to realign the result. This has the end effect of extracting a 5-digit substring from deBruijn32. Here’s a complete example (where \(d = \) deBruijn32):

\[ \begin{aligned} x &= 11011010001000011000001001100000 \cr x\&-x &= 00000000000000000000000000100000 \cr d &= 00000111011111001011010100110001 \cr (x\&-x)*d &= 11101111100101101010011000100000 \cr (x\&-x)*d>>27 &= 00000000000000000000000000011101 = 29 \cr \end{aligned} \]

But as we saw already, deBruijn32tab maps those substrings to the index (shift) that produced them, thus, the end value of this expression is precisely \(\text{ntz}(x)\). If we look at position 29, we find the value 5, which is indeed \(\text{ntz}(x)\). Pretty cool.

I actually write code that uses this algorithm every day - in the SQL layer of CockroachDB, we use integers to refer to columns when we express transformations on expressions or mappings between operators. It’s often very important for us to be able to represent a set of columns efficiently, and ideally without performing a heap allocation. One of my favourite mini-libraries in CockroachDB is FastIntSet. For sets involving large integers it spills over to a map, but for sets involving only values smaller than 64, it represents them as a bitmap on a 64-bit integer. This is great since we have to perform unions/intersections/differences on these sets quite frequently. See ForEach for its use of \(\text{ntz}\) (here called bits.TrailingZeros64).

As a caveat to my example here, in Go, in some cases this function actually defers to processor-specific instructions which implement \(\text{ntz}\) as these tests demonstrate.

For more cute tricks like this, I recommend checking out the book Hacker’s Delight, which is where I learned of it.