Using SIMD in rust to `tac` at breakneck speeds

NeoSmart Technologies’ open source (MIT-licensed), cross-platform tac (written in rust) has been updated to version 2.0 (GitHub link). This is a major release with enormous performance improvements under-the-hood, primarily featuring handwritten SIMD acceleration of the core logic that allows it to process input up to three times faster than its GNU Coreutils counterpart. (For those that aren’t familiar with it, tac is a command-line utility to reverse the contents of a file.)

This release has been a long time in the making. I had the initial idea of utilizing vectorized instructions to speed up the core line ending search logic during the initial development of tac in 2017, but was put off by the complete lack of stabilized SIMD support in rust at the time. In 2019, after attempting to process a few log files – each of which were over 100 GiB – I decided to revisit the subject and implemented a proof-of-concept shortly thereafter… and that’s when things stalled for a number of reasons.

In perfect honesty, the biggest reason a SIMD release was put on hold was that I had what I needed: my local copy of tac that used AVX2 acceleration worked just fine to help speed up my processing of reversed server logs, and having completed the challenging part of the puzzle (trying to vectorize every possible operation in the core logic), I was not particularly interested in the remaining 20% that needed to be done – and the state of rust’s support for intrinsics was considerably lacking at the time.

Apart from requiring compiler intrinsics that were gated behind the rust nightly toolchain, I wasn’t particularly sure how to tackle “graceful degradation” when executing on AVX2-incapable hardware. As mentioned earlier, the vectorization of tac was entirely done by hand – as in, by explicitly using the AVX2 assembly instructions to carry out vectorized operations, rather by writing vectorization-friendly high-level rust code and then praying that the compiler would correctly vectorize the logic. Three years ago, the state of rust’s support for dynamic CPU instruction support detection looked very much different than it does today, and I wasn’t looking forward to releasing a tac 2.0 that required a nightly build (aka “bleeding edge” or “unstable”) of the rust compiler. For better or for worse, the SIMD-accelerated tac builds lay dormant (in a branch of the original code, no less) on GitHub until a couple of months ago when a number of other rust rewrites of common greybeard utilities were making the rounds and I was inspired to revisit the codebase.

Fortunately, in 2021 it was considerably easier to dynamically enable AVX2 support in tac, first by taking advantage of the is_x86_feature_detected!() macro (available since mid-2018) to detect early on whether or not AVX2 intrinsics are supported at runtime, and then converting a few manual invocations of nightly-only intrinsics to use SIMD-enabled equivalents exposed by the standard library instead. It helped that I had split off the core search logic into two blackbox functions, one with the original naive search logic, and the other with the AVX2-accelerated search logic, so all I needed to do was call the right function. What remained at that point was to verify that the generated dynamic-detection binary actually performed as I expected it to.

Until this point, I’d been compiling my local tac builds with my local environment defaults, in particular with the RUSTFLAGS environment variable set to -C target-cpu=native which would always result in the rustc compiler’s LLVM backend generating AVX2-compatible instructions. In addition to inspecting the generated assembly to ensure the expected AVX2 instructions were present in the AVX2 codepath (and temporarily adding a panic!("Wrong codepath!") to the non-AVX2 branch), I compiled a copy with target-cpu=native and set it aside, then compiled another copy with RUSTFLAGS unset, and proceeded to benchmark the two on some large (~40 GiB) payloads. I also benchmarked against my earlier proof-of-concept code that didn’t use dynamic detection at all.¹

Unfortunately I found that although the AVX2 codepath was being correctly taken, the “dynamically accelerated” version of tac compiled without target-cpu=native was always benchmarking up to 15% slower than the same codebase with that equivalent of -march=native enabled (which was indistinguishable in benchmarks from the “always-AVX2” proof-of-concept). Given the non-negligible magnitude of the slowdown, it was fairly obvious that any initial analysis of the problem should focus on the core processing loop rather than any of the startup/cleanup code, so with the code region isolated the only thing left to do was to diff the generated assembly for the core logic between the two binaries and see where they differed.

While rustc makes it easy to generate the assembly (just set RUSTFLAGS to --emit asm -Cllvm-args=--x86-asm-syntax=intel), I like to use godbolt.org for a quicker feedback loop, primarily because the highlighting makes it immediately obvious what instructions pertain to which functions and vice-versa – although it makes diffing the results a bit harder.² It was a bit hard to spot at first, but the line-by-line comparison ultimately made it possible to track down the problem.

The way dynamically enabling SIMD-accelerated codepaths works in rust is a little “manual” in the sense that the detection of the supported CPU instructions is quite separate from emitting the differently-compiled code in question; i.e. the following doesn’t actually accomplish anything (and you wouldn’t intuitively expect it to):

fn foo() {
    // Perform some operation here
}

fn foo_simd() {
    // Perform the same operation here
}

pub fn main() {
    if is_x86_feature_detected!("avx2") {
        foo_simd()
    } else {
        foo()
    }
}

But what this lets you do is apply different compilation options to one branch of code, and then dynamically select/execute it. So instead of compiling the entire assembly with -C target-cpu=avx2 or not at all, we can invoke cargo or rustc with no target-cpu specified, but then instruct the compiler to switch to a different target CPU to compile a particular function (but note that if it calls out to any other functions, those calls won’t be duplicated and recompiled w/ the optimizations enabled):

fn foo() {
    // Perform some operation here
}

#[target_feature(enable = "avx2")]
unsafe fn foo_simd() {
    // Perform the same operation here
}

pub fn main() {
    if is_x86_feature_detected!("avx2") {
        unsafe { foo_simd() }
    } else {
        foo()
    }
}

Note that we had to specify the avx2 feature bit twice and that there’s no “type checking” between the two – we could have checked for is_x86_feature_detected!("foo") and then called a function decorated with #[target_feature(enable = "bar")], so rust forces us to make the target function unsafe regardless of whether or not it actually uses any unsafe code. (This is a problem since it can mask inadvertent usage of unsafe code). Anyway, this is all just background to what the problem actually turned out to be, which is in turn a confluence of two issues:

rust/llvm does not correctly recognize “supersets” of instructions,
rust/llvm converts direct calls to intrinsics not supported by the specified target_feature(...) to non-inlined functions that end up containing just the requisite asm call itself, adding a massive amount of function call overhead.³

In this case, the AVX2 search function was decorated with #[target_feature(enable = "avx2")] but in addition to its direct usage of AVX2 functions and types, it also made a call to the BZHI and LZCNT intrinsics/asm instructions – which rustc/llvm do not recognize as being supported via the avx2 feature! So although (to the best of this developer’s knowledge) there does not exist a processor on the face of this planet that supports AVX2 but doesn’t support BZHI and LZCNT, attempting to use those two in a function that only enabled support for avx2 resulted in codegen that turned something like this:

#[target_feature(enable = "avx2")]
pub unsafe fn bzhi(num: u32) -> u32 {
    core::arch::x86_64::_bzhi_u32(num, 31)
}

into something like this (godbolt link):

core::core_arch::x86::bmi2::_bzhi_u32:
        sub     rsp, 4
        bzhi    eax, edi, esi
        mov     dword ptr [rsp], eax
        mov     eax, dword ptr [rsp]
        add     rsp, 4
        ret

example::bzhi:
        push    rax
        mov     esi, 31
        call    core::core_arch::x86::bmi2::_bzhi_u32
        mov     dword ptr [rsp + 4], eax
        mov     eax, dword ptr [rsp + 4]
        pop     rcx
        ret

…which is obviously not what we want.

Technically, BZHI (zero the high bits starting from the specified bit) is not part of the avx2 instruction set but rather part of the bmi2 (bit manipulation instructions v2) instruction set. However, AMD introduced support for the BMI2 instructions at the same time as they first introduced support for AVX2 (as part of the Excavator microarchitecture, in 2015). Intel likewise introduced BMI2 support (along with BMI1 support, as a matter of fact) as part of the Haswell microarchitecture in 2013, also at the same time they debuted support for AVX2. No AVX2 CPU has shipped from either company without BMI2 since then, and it’s pretty unfathomable that any would in the future.

Likewise, LZCNT (count number of leading zero bits) is not officially part of the AVX2 instruction set but rather part of the bmi1 instruction set – which, as we saw earlier, was supported by AMD prior to their debuting AVX2 support but only supported by Intel in Haswell alongside avx2 and bmi2 support. In other words, you can view avx2 support as a strict superset of support for both bmi1 and bmi2, and to a lesser extent, support for bmi2 as being identical to support for avx2 (because there could be x86-only CPUs out there with bmi2 but no avx2). Long story short, rust’s codegen (here provided by llvm) would certainly be within its rights to expand avx2 to avx2,bmi1,bmi2 (and then some).

Confusingly, while the poor codegen for lzcnt causes the intrinsic to expand to an inlined function call containing the fallback logic for platforms that don’t support the bmi1 instruction set:

#![feature(core_intrinsics)]

#[target_feature(enable = "avx2")]
pub unsafe fn ctlz(num: i32) -> i32 {
    core::intrinsics::ctlz(num)
}

example::ctlz:
        push    rax
        bsr     eax, edi
        mov     ecx, 63
        cmove   eax, ecx
        xor     eax, 31
        mov     dword ptr [rsp + 4], eax
        mov     eax, dword ptr [rsp + 4]
        mov     dword ptr [rsp], eax
        mov     eax, dword ptr [rsp]
        pop     rcx
        ret

… however, as you can see in the first assembly output from earlier, the bzhi intrinsic just generates a non-inlined function call to a new function that just executes bzhi directly, meaning you gain virtually nothing but function call overhead. Presumably there is a reason for this (unless it’s just bad codegen), my best guess is that this is an attempt to guard against microarchitectures that will generate a #UD exception if that instruction is seen, even if not executed? That seems a bit unlikely, given that we needed to implement W^X as its own thing, but I could be wrong.

In all cases, decorating the AVX2-enabled method with

#[target_feature(enable = "avx2")]
#[target_feature(enable = "bmi1")]
#[target_feature(enable = "bmi2")]
unsafe fn search_avx2() {
    // …
}

worked, and the dynamically enabled AVX2 acceleration of the core tac logic benchmarked identical to the other two builds (which is to say, extremely fast).

What’s the impact of these changes? Our tac 2.0 release is up to 3x faster than the GNU Coreutils version of tac if benchmarking with speculative execution mitigations disabled. If the mitigations are enabled, then our tac will be even faster than that as we are using memory-mapped files (with specifically mmap-friendly access patterns) to avoid the repeated syscalls to refill the buffers, which are considerably slower with mitigations enabled.

You can explore the source code yourself to see how this all comes together or inspect the SIMD instructions used to speed up the search. You can grab pre-compiled binaries for the major platforms from the releases page, and I’m working on getting tac into the FreeBSD ports repo and into either choco or scoop for Windows users. As with all our other open source projects, help getting the binaries into platform/distribution-specific package managers is always welcome.

Join the discussion on Hacker News or on reddit’s /r/rust. You can follow me on twitter @mqudsi and/or subscribe below to receive emails when I post specifically about rust (and never anything else):

If you would like to receive a notification the next time we release a rust library, publish a crate, or post some rust-related developer articles, you can subscribe below. Note that you'll only get notifications relevant to rust programming and development by NeoSmart Technologies. If you want to receive email updates for all NeoSmart Technologies posts and releases, please sign up in the sidebar to the right instead.

I used hyperfine for this – excellent tool to keep in your arsenal if you are into writing high-performance code. ↩
I recommend opening two side-by-side browser windows and copy-and-pasting to the terminal to diff -U ↩
Especially if combined with llvm inefficiencies when calling avx2 functions, although I didn’t run into that. ↩

The NeoSmart Files

Recovery software and more

Using SIMD acceleration in rust to create the world’s fastest `tac`

Featuring dynamically-enabled AVX2-accelerated line-ending search where supported

Similar Posts

Leave a Reply