hckrnws
inopinatus
11m
> "Use the correct intrinsics" ... the Rust library contains retrofit implementations that are used when the target CPU doesn’t support certain instructions. As you might guess, these retrofit implementations are not nearly as fast.
I noticed something similar with clang today, which emitted code 10x slower than gcc or icc for what looked like an instantly vectorizable masked add. Turned out that clang ships without __builtin_ia32_paddw256_mask, and we timed clang's unnecessarily synthetic _mm256_mask_add_epi16 at 30% slower on some targets. So we rolled our own (https://godbolt.org/z/84GT7eo14). I'd heard that the state of AVX-512 was sketchy, still surprised to see as much for an apparently straightforward usage.
Lk7Of3vfJS2n
11m
[dead]
Crafted by Rajat
Source Code