Compress-a-Palooza: Unpacking 5B Varints in Only 4B CPU Cycles

Compress-a-Palooza: Unpacking 5B Varints in Only 4B CPU Cycles

by g0xA52A2A

inopinatus

> "Use the correct intrinsics" ... the Rust library contains retrofit implementations that are used when the target CPU doesn’t support certain instructions. As you might guess, these retrofit implementations are not nearly as fast.

I noticed something similar with clang today, which emitted code 10x slower than gcc or icc for what looked like an instantly vectorizable masked add. Turned out that clang ships without __builtin_ia32_paddw256_mask, and we timed clang's unnecessarily synthetic _mm256_mask_add_epi16 at 30% slower on some targets. So we rolled our own (https://godbolt.org/z/84GT7eo14). I'd heard that the state of AVX-512 was sketchy, still surprised to see as much for an apparently straightforward usage.

Lk7Of3vfJS2n

[dead]

Crafted by Rajat

Source Code

hckrnws

Compress-a-Palooza: Unpacking 5B Varints in Only 4B CPU Cycles