Fast Permutation Compression

All code variations can be found at https://github.com/egonelbre/exp/blob/master/permutation/code.go.

Given a permutation what is the best way to convert it into a smaller data-structure?

Let’s say you have a permutation of 12 items. A trivial approach would be to store each item in a separate byte. This of course works, but can we do better?

1

type Permutation [12]byte

It’s easy to see that the maximum number we need to store is 12, but the byte can store 256 entries. By fitting it into 4 bits, it would be possible to store it in 12*4 = 48 bits.

1
2
3
4
5
6
7


func PackNybble(perm Permutation) uint64 {
    r := uint64(0)
    for _, v := range perm {
        r = r<<4 + uint64(v)
    }
    return r
}

We can actually do better, by encoding using base 12 directly, instead of rounding it to 16.

1
2
3
4
5
6
7


func PackBase12(perm Permutation) uint64 {
    r := uint64(0)
    for _, v := range perm {
        r = r*12 + uint64(v)
    }
    return r
}

This will end up using log2(12^12) = 43.0195 bits . We know that there are exactly 12! permutations. This would mean that in the perfect world, we could pack all of this into log2(12!)=28.835bits.

Some basic permutation theory gives us that we can assign a unique index to each permutation and convert it to and back. One of such numbering is Lehmer code.

I found one description of the algorithm for converting a permutation to code from here https://www.researchgate.net/figure/The-Lehmer-code-A-complete-translation-from-permutation-to-decimal-by-way-of-the_fig1_230831447

We can imagine this as reducing the problem at each stage to a smaller problem. At first there are 12! possibilities for using the index. We know that for permutation of size 11, there are 11! of them. This means when we partition the permutations into 12 equal chunks, based on some criteria, figure out which chunk it belongs to and then multiply that by 11! then we get our first reduction. Each next step won’t spill over to the next partition, because they are smaller than 11!.

The implementation required a little tinkering, but wasn’t anything too complicated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


func Code(perm [12]byte) uint64 {
    r := uint64(0)
    for min := byte(0); min < 11; min++ {
        z := 0
        n := base - min
        for i, v := range perm[:n] {
            if v == min {
                z = i
                break
            }
        }
        copy(perm[z:], perm[z+1:n])
        r += uint64(z) * fact(11 - min)
    }
    return r
}

It really gives the smallest encoding possible, but compared to the just packing we have worse performance. I implemented few variations, to see whether it would be possible to improve upon it. However the first variant was pretty good already.

1
2
3
4
5
6


BenchmarkCopy                         199 ns/op
BenchmarkCount                        312 ns/op
BenchmarkTable                        284 ns/op
BenchmarkTable2                       524 ns/op
BenchmarkPackNybble                   17.3 ns/op
BenchmarkPackNybbleUnroll2            4.94 ns/op

So, it’s ~40 times slower than just packing. Can we do better?

Notice that we are able to pack the whole permutation into a uint64 with very low cost.

The core of the loop is the following:

1
2
3


z := find the index of min in the permutation
remove z from the permutation
r += z * min!

Bit-parallel search is a good way to find that index. Bit-parallel is easier to show in an example, let’s take a permutation [1,2,3,0] and lets search for number 2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


decimal               1    2    3    0
                      ---- ---- ---- ----
packed                0001 0010 0011 0000
mask                  1101 1101 1101 1101

f1 := packed ^ mask   1101 1111 1110 1101
f2 := f1 >> 2           11 0111 1111 1111
f3 := f1 & f2         0001 0111 1110 1101
f4 := f3 >> 1          000 1011 1111 0110
f5 := f4 & f3         0000 0001 1110 0100

f5 := f4 & f3         0000 0001 1110 0100
onemask                  1    1    1    1
s := f5 & onemask     0000 0001 0000 0000
                                .... ....

At step 03 we see the bit representation of that packed sequence. First we need a mask to detect number 2 at all positions at the same time. We can use the complement of number 2 and use xor for that. This leaves us with a result where only the matching nybble contains all ones. This can be seen as the result f1.

Now we need to find the index, where all the numbers are one. For this we can and them all together using steps 7 to 10. Finally we need to remove all the other bits that aren’t relevant with Step 14.

Our result is the number of zeros after the number. Note that this result still needs to be divided by 4 to get the final result. Once we translate this idea into code, it looks roughly like this:

1
2
3
4
5
6


mask := masks[min]
filtered := perm ^ mask
filtered &= filtered >> 2
filtered &= filtered >> 1
filtered &= onemask
z4 := byte(bits.TrailingZeros64(filtered))

We can now need a way to remove that element from the sequence:

1
2
3


upper := (perm >> (z4 + 4)) << z4
lower := perm &^ (^uint64(0) << z4)
perm = upper | lower

Here we construct the upper part, by cutting off the bottom part. And then get the lower part by masking out all the upper parts. Then we combine them back together.

Benchmarking the code gives us a speed up from ~200ns to ~120ns.

1
2


BenchmarkCopy            199 ns/op
BenchmarkCopyBit         123 ns/op

The final (uncensored) code was this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


func CodeCopyBit(perm [base]byte) int {
    return codebit(uint64(PackNybbleUnroll(perm)))
}

func codebit(perm uint64) int {
    r := 0
    for min, mask := range masks {
        filtered := perm ^ mask
        filtered &= filtered >> 2
        filtered &= filtered >> 1
        filtered &= maskstarts
        z := byte(bits.TrailingZeros64(filtered))
        upper := (perm >> (z + 4)) << z
        lower := perm &^ (fullmask << z)
        perm = upper | lower
        r += int(z/4) * reverse_factorial[min]
    }
    return r
}

When there aren’t many elements then using bit-parallel approaches can get rid of branches and reduce cycles. Of course, when we have much larger sequences, then this approach becomes more cumbersome to write and might also end up slower.

This probably can be made even faster by unrolling the whole loop. However, it’s sufficient for this time.

PS: When there are very few elements then this whole thing can be replaced by a lookup table.

Robert Clausecker suggested even better approaches in Reddit. His descriptions in StackOverflow are quite nice. His benchmarks are here https://github.com/fuzxxl/permcode.

1
2
3
4


BenchmarkCopy          205.0 ns/op
BenchmarkCopyBit       126.0 ns/op
BenchmarkShuffle        69.2 ns/op
BenchmarkShuffleUnroll  39.2 ns/op

His best approach is doing a Fisher-Yates shuffle in reverse, which is brilliant. We can think of permuting an array [0, 1, 2, 3 .. 11] as picking random number and eliminating them. By finding the “random” numbers used to generate that we get a sequence, where each “maximum random” is smaller than the previous. So maximally we can only get a sequence [11, 10, 9, 8 .. 0]. We can convert that into a number using factoriadics (each element multiplied by the factorial).