{"id":4801,"date":"2021-08-30T13:57:07","date_gmt":"2021-08-30T18:57:07","guid":{"rendered":"https:\/\/neosmart.net\/blog\/?p=4801"},"modified":"2024-02-09T20:06:35","modified_gmt":"2024-02-10T02:06:35","slug":"using-simd-acceleration-in-rust-to-create-the-worlds-fastest-tac","status":"publish","type":"post","link":"https:\/\/neosmart.net\/blog\/using-simd-acceleration-in-rust-to-create-the-worlds-fastest-tac\/","title":{"rendered":"Using SIMD acceleration in rust to create the world&#8217;s fastest <code>tac<\/code>"},"content":{"rendered":"<p><a href=\"https:\/\/neosmart.net\/blog\/wp-content\/uploads\/tac.png\" rel=\"follow\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-4148 alignright colorbox-4801\" src=\"https:\/\/neosmart.net\/blog\/wp-content\/uploads\/tac.png\" alt=\"\" width=\"134\" height=\"134\" srcset=\"https:\/\/neosmart.net\/blog\/wp-content\/uploads\/tac.png 512w, https:\/\/neosmart.net\/blog\/wp-content\/uploads\/tac-150x150.png 150w, https:\/\/neosmart.net\/blog\/wp-content\/uploads\/tac-300x300.png 300w\" sizes=\"auto, (max-width: 134px) 100vw, 134px\" \/><\/a>NeoSmart Technologies&#8217; open source (MIT-licensed), cross-platform <code>tac<\/code> (written in rust) has been updated to version 2.0 (<a href=\"https:\/\/github.com\/neosmart\/tac\/\" rel=\"nofollow\">GitHub link<\/a>). This is a major release with enormous performance improvements under-the-hood, primarily featuring handwritten SIMD acceleration of the core logic that allows it to process input up to three times faster than its GNU Coreutils counterpart. (For those that aren&#8217;t familiar with it, <code>tac<\/code> is a command-line utility to reverse the contents of a file.)<\/p>\n<p>This release has been a long time in the making. I had the initial idea of utilizing vectorized instructions to speed up the core line ending search logic during the initial development of <code>tac<\/code> in 2017, but was put off by the complete lack of stabilized SIMD support in rust at the time. In 2019, after attempting to process a few log files &#8211; each of which were over 100 GiB &#8211; I decided to revisit the subject and implemented a proof-of-concept shortly thereafter&#8230; and that&#8217;s when things stalled for a number of reasons.<\/p>\n<p><!--more--><\/p>\n<p>In perfect honesty, the biggest reason a SIMD release was put on hold was that I had what I needed: my local copy of <code>tac<\/code> that used AVX2 acceleration worked just fine to help speed up my processing of reversed server logs, and having completed the challenging part of the puzzle (trying to vectorize every possible operation in the core logic), I was not particularly interested in the remaining 20% that needed to be done &#8211; and the state of rust&#8217;s support for intrinsics was considerably lacking at the time.<\/p>\n<p>Apart from requiring compiler intrinsics that were gated behind the rust nightly toolchain, I wasn&#8217;t particularly sure how to tackle &#8220;graceful degradation&#8221; when executing on AVX2-incapable hardware. As mentioned earlier, the vectorization of <code>tac<\/code> was entirely done by hand &#8211; as in, by explicitly using the AVX2 assembly instructions to carry out vectorized operations, rather by writing vectorization-friendly high-level rust code and then praying that the compiler would correctly vectorize the logic. Three years ago, the state of rust&#8217;s support for dynamic CPU instruction support detection looked very much different than it does today, and I wasn&#8217;t looking forward to releasing a <code>tac<\/code> 2.0 that required a nightly build (aka &#8220;bleeding edge&#8221; or &#8220;unstable&#8221;) of the rust compiler. For better or for worse, the SIMD-accelerated <code>tac<\/code> builds lay dormant (in a branch of the original code, no less) on GitHub until a couple of months ago when a number of other rust rewrites of common greybeard utilities were making the rounds and I was inspired to revisit the codebase.<\/p>\n<p>Fortunately, in 2021 it was considerably easier to dynamically enable AVX2 support in <code>tac<\/code>, first by taking advantage of the <code><a href=\"https:\/\/doc.rust-lang.org\/std\/macro.is_x86_feature_detected.html\">is_x86_feature_detected!()<\/a><\/code> macro (available since mid-2018) to detect early on whether or not AVX2 intrinsics are supported at runtime, and then <a href=\"https:\/\/github.com\/neosmart\/tac\/commit\/f840c3e2ce86bfbffca37fdd3812552a6053e03a\" rel=\"nofollow\">converting a few manual invocations<\/a> of nightly-only intrinsics to use SIMD-enabled equivalents exposed by the standard library instead. It helped that I had split off the core search logic into two blackbox functions, one with the original naive search logic, and the other with the AVX2-accelerated search logic, so all I needed to do was <a href=\"https:\/\/github.com\/neosmart\/tac\/commit\/0d6899e8d4cebeefb9679ae6c7a7636b54b6557b#diff-42cb6807ad74b3e201c5a7ca98b911c5fa08380e942be6e4ac5807f8377f87fcL287-R302\" rel=\"nofollow\">call the right function<\/a>. What remained at that point was to verify that the generated dynamic-detection binary actually performed as I expected it to.<\/p>\n<p>Until this point, I&#8217;d been compiling my local <code>tac<\/code> builds with my local environment defaults, in particular with the <code>RUSTFLAGS<\/code> environment variable set to <code>-C target-cpu=native<\/code> which would always result in the <code>rustc<\/code> compiler&#8217;s LLVM backend generating AVX2-compatible instructions. In addition to inspecting the generated assembly to ensure the expected AVX2 instructions were present in the AVX2 codepath (and temporarily adding a <code>panic!(\"Wrong codepath!\")<\/code> to the non-AVX2 branch), I compiled a copy with <code>target-cpu=native<\/code> and set it aside, then compiled another copy with <code>RUSTFLAGS<\/code> unset, and proceeded to benchmark the two on some large (~40 GiB) payloads. I also benchmarked against my earlier proof-of-concept code that didn&#8217;t use dynamic detection at all.<sup id=\"rf1-4801\"><a href=\"#fn1-4801\" title=\"I used hyperfine for this &ndash; excellent tool to keep in your arsenal if you are into writing high-performance code.\" rel=\"footnote\">1<\/a><\/sup><\/p>\n<p>Unfortunately I found that although the AVX2 codepath was being correctly taken, the &#8220;dynamically accelerated&#8221; version of <code>tac<\/code> compiled <em>without<\/em> <code>target-cpu=native<\/code> was always benchmarking up to 15% slower than the same codebase with that equivalent of <code>-march=native<\/code> enabled (which was indistinguishable in benchmarks from the &#8220;always-AVX2&#8221; proof-of-concept). Given the non-negligible magnitude of the slowdown, it was fairly obvious that any initial analysis of the problem should focus on the core processing loop rather than any of the startup\/cleanup code, so with the code region isolated the only thing left to do was to diff the generated assembly for the core logic between the two binaries and see where they differed.<\/p>\n<p>While <code>rustc<\/code> makes it easy to generate the assembly (just set <code>RUSTFLAGS<\/code> to <code>--emit asm -Cllvm-args=--x86-asm-syntax=intel<\/code>), I like to use godbolt.org for a quicker feedback loop, primarily because the highlighting makes it immediately obvious what instructions pertain to which functions and vice-versa &#8211; although it makes diffing the results a bit harder.<sup id=\"rf2-4801\"><a href=\"#fn2-4801\" title=\"I recommend opening two side-by-side browser windows and copy-and-pasting to the terminal to diff -U\" rel=\"footnote\">2<\/a><\/sup> It was a bit hard to spot at first, but the line-by-line comparison ultimately made it possible to track down the problem.<\/p>\n<p>The way dynamically enabling SIMD-accelerated codepaths works in rust is a little &#8220;manual&#8221; in the sense that the detection of the supported CPU instructions is quite separate from emitting the differently-compiled code in question; i.e. the following doesn&#8217;t actually accomplish anything (and you wouldn&#8217;t intuitively expect it to):<\/p>\n<pre><code class=\"language-rust\">fn foo() {\r\n    \/\/ Perform some operation here\r\n}\r\n\r\nfn foo_simd() {\r\n    \/\/ Perform the same operation here\r\n}\r\n\r\npub fn main() {\r\n    if is_x86_feature_detected!(\"avx2\") {\r\n        foo_simd()\r\n    } else {\r\n        foo()\r\n    }\r\n}\r\n<\/code><\/pre>\n<p>But what this lets you do is apply different compilation options to one branch of code, and then dynamically select\/execute it. So instead of compiling the entire assembly with <code>-C target-cpu=avx2<\/code> or not at all, we can invoke <code>cargo<\/code> or <code>rustc<\/code> with no <code>target-cpu<\/code> specified, but then instruct the compiler to switch to a different target CPU to compile a particular function (but note that if it calls out to any other functions, those calls won&#8217;t be duplicated and recompiled w\/ the optimizations enabled):<\/p>\n<pre><code class=\"language-rust\">fn foo() {\r\n    \/\/ Perform some operation here\r\n}\r\n\r\n#[target_feature(enable = \"avx2\")]\r\nunsafe fn foo_simd() {\r\n    \/\/ Perform the same operation here\r\n}\r\n\r\npub fn main() {\r\n    if is_x86_feature_detected!(\"avx2\") {\r\n        unsafe { foo_simd() }\r\n    } else {\r\n        foo()\r\n    }\r\n}\r\n<\/code><\/pre>\n<p>Note that we had to specify the <code>avx2<\/code> feature bit twice and that there&#8217;s no &#8220;type checking&#8221; between the two &#8211; we could have checked for <code>is_x86_feature_detected!(\"foo\")<\/code> and then called a function decorated with <code>#[target_feature(enable = \"bar\")]<\/code>, so rust forces us to make the target function <code>unsafe<\/code> regardless of whether or not it actually uses any unsafe code. (This is a problem since it can mask inadvertent usage of unsafe code). Anyway, this is all just background to what the problem actually turned out to be, which is in turn a confluence of two issues:<\/p>\n<ol>\n<li>rust\/llvm does not correctly recognize &#8220;supersets&#8221; of instructions,<\/li>\n<li>rust\/llvm converts direct calls to intrinsics not supported by the specified <code>target_feature(...)<\/code> to non-inlined functions that end up containing just the requisite asm call itself, adding a massive amount of function call overhead.<sup id=\"rf3-4801\"><a href=\"#fn3-4801\" title=\"Especially if combined with llvm inefficiencies when calling avx2 functions, although I didn&rsquo;t run into that.\" rel=\"footnote\">3<\/a><\/sup><\/li>\n<\/ol>\n<p>In this case, the AVX2 search function was decorated with <code>#[target_feature(enable = \"avx2\")]<\/code> but in addition to its direct usage of AVX2 functions and types, it also made a call to the <code>BZHI<\/code> and <code>LZCNT<\/code> intrinsics\/asm instructions &#8211;\u00a0<em>which rustc\/llvm do not recognize as being supported via the <code>avx2<\/code> feature!<\/em> So although (to the best of this developer&#8217;s knowledge) there does not exist a processor on the face of this planet that supports AVX2 but\u00a0<em>doesn&#8217;t<\/em> support <code>BZHI<\/code> and <code>LZCNT<\/code>, attempting to use those two in a function that only enabled support for <code>avx2<\/code> resulted in codegen that turned something like this:<\/p>\n<pre><code class=\"language-rust\">#[target_feature(enable = \"avx2\")]\r\npub unsafe fn bzhi(num: u32) -&gt; u32 {\r\n    core::arch::x86_64::_bzhi_u32(num, 31)\r\n}\r\n<\/code><\/pre>\n<p>into something like this (<a href=\"https:\/\/godbolt.org\/z\/Wex69MazT\" rel=\"follow\">godbolt link<\/a>):<\/p>\n<pre><code class=\"language-asm\">core::core_arch::x86::bmi2::_bzhi_u32:\r\n        sub     rsp, 4\r\n        bzhi    eax, edi, esi\r\n        mov     dword ptr [rsp], eax\r\n        mov     eax, dword ptr [rsp]\r\n        add     rsp, 4\r\n        ret\r\n\r\nexample::bzhi:\r\n        push    rax\r\n        mov     esi, 31\r\n        call    core::core_arch::x86::bmi2::_bzhi_u32\r\n        mov     dword ptr [rsp + 4], eax\r\n        mov     eax, dword ptr [rsp + 4]\r\n        pop     rcx\r\n        ret\r\n<\/code><\/pre>\n<p>\u2026which is obviously not what we want.<\/p>\n<p><em>Technically<\/em>, <code>BZHI<\/code> (zero the high bits starting from the specified bit) is not part of the <code>avx2<\/code> instruction set but rather part of the <code>bmi2<\/code> (bit manipulation instructions v2) instruction set. However, AMD introduced support for the BMI2 instructions at the same time as they first introduced support for AVX2 (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Excavator_(microarchitecture)\" rel=\"follow\">as part of the Excavator microarchitecture<\/a>, in 2015). Intel likewise introduced BMI2 support (along with BMI1 support, as a matter of fact) <a href=\"https:\/\/en.wikipedia.org\/wiki\/Haswell_(microarchitecture)\" rel=\"follow\">as part of the Haswell microarchitecture<\/a> in 2013, also at the same time they debuted support for AVX2. No AVX2 CPU has shipped from either company without BMI2 since then, and it&#8217;s pretty unfathomable that any would in the future.<\/p>\n<p>Likewise, <code>LZCNT<\/code> (count number of leading zero bits) is not officially part of the AVX2 instruction set but rather part of the <code>bmi1<\/code> instruction set &#8211; which, as we saw earlier, was supported by AMD prior to their debuting AVX2 support but only supported by Intel in Haswell alongside <code>avx2<\/code> and <code>bmi2<\/code> support. In other words, you can view <code>avx2<\/code> support as a strict superset of support for both <code>bmi1<\/code> and <code>bmi2<\/code>, and to a lesser extent, support for <code>bmi2<\/code> as being identical to support for <code>avx2<\/code> (because there could be x86-only CPUs out there with <code>bmi2<\/code> but no <code>avx2<\/code>). Long story short, rust&#8217;s codegen (here provided by llvm) would certainly be within its rights to expand <code>avx2<\/code> to <code>avx2,bmi1,bmi2<\/code> (and then some).<\/p>\n<p>Confusingly, while the poor codegen for <code>lzcnt<\/code> causes the intrinsic to expand to an inlined function call containing the fallback logic for platforms that don&#8217;t support the <code>bmi1<\/code> instruction set:<\/p>\n<pre><code class=\"language-rust\">#![feature(core_intrinsics)]\r\n\r\n#[target_feature(enable = \"avx2\")]\r\npub unsafe fn ctlz(num: i32) -&gt; i32 {\r\n    core::intrinsics::ctlz(num)\r\n}\r\n<\/code><\/pre>\n<pre><code class=\"language-asm\">example::ctlz:\r\n        push    rax\r\n        bsr     eax, edi\r\n        mov     ecx, 63\r\n        cmove   eax, ecx\r\n        xor     eax, 31\r\n        mov     dword ptr [rsp + 4], eax\r\n        mov     eax, dword ptr [rsp + 4]\r\n        mov     dword ptr [rsp], eax\r\n        mov     eax, dword ptr [rsp]\r\n        pop     rcx\r\n        ret\r\n<\/code><\/pre>\n<p>\u2026 however, as you can see in the first assembly output from earlier, the <code>bzhi<\/code> intrinsic just generates a non-inlined function call to a new function that just executes <code>bzhi<\/code> directly, meaning you gain virtually nothing but function call overhead. Presumably there is a reason for this (unless it&#8217;s just bad codegen), my best guess is that this is an attempt to guard against microarchitectures that will generate a <code>#UD<\/code> exception if that instruction is seen, even if not executed? That seems a bit unlikely, given that we needed to implement <code>W^X<\/code> as its own thing, but I could be wrong.<\/p>\n<p>In all cases, <a href=\"https:\/\/github.com\/neosmart\/tac\/commit\/956ee58511880bf3b4064234da8733f1a6589522\" rel=\"nofollow\">decorating the AVX2-enabled method<\/a> with<\/p>\n<pre><code class=\"language-rust\">#[target_feature(enable = \"avx2\")]\r\n#[target_feature(enable = \"bmi1\")]\r\n#[target_feature(enable = \"bmi2\")]\r\nunsafe fn search_avx2() {\r\n    \/\/ \u2026\r\n}\r\n<\/code><\/pre>\n<p>worked, and the dynamically enabled AVX2 acceleration of the core <code>tac<\/code> logic benchmarked identical to the other two builds (which is to say, extremely fast).<\/p>\n<p>What&#8217;s the impact of these changes? Our <code>tac<\/code> 2.0 release is up to 3x faster than the GNU Coreutils version of <code>tac<\/code> if benchmarking with speculative execution mitigations disabled. If the mitigations <em>are<\/em> enabled, then our <code>tac<\/code> will be even faster than that as we are using memory-mapped files (with specifically mmap-friendly access patterns) to avoid the repeated syscalls to refill the buffers, which are considerably slower with mitigations enabled.<\/p>\n<p>You can <a href=\"https:\/\/github.com\/neosmart\/tac\/\" rel=\"nofollow\">explore the source code yourself<\/a> to see how this all comes together or inspect the SIMD instructions used to speed up the search. You can grab pre-compiled binaries for the major platforms <a href=\"https:\/\/github.com\/neosmart\/tac\/releases\/tag\/2.0\" rel=\"nofollow\">from the releases page<\/a>, and I&#8217;m working on getting <code>tac<\/code> into the FreeBSD ports repo and into either <code>choco<\/code> or <code>scoop<\/code> for Windows users. As with all our other open source projects, help getting the binaries into platform\/distribution-specific package managers is always welcome.<\/p>\n<p>Join the discussion <a href=\"https:\/\/news.ycombinator.com\/item?id=28359680\" rel=\"follow\">on Hacker News<\/a> or on <a href=\"https:\/\/old.reddit.com\/r\/rust\/comments\/peovws\/using_simd_acceleration_in_rust_to_create_the\/?\" rel=\"follow\">reddit&#8217;s \/r\/rust<\/a>. You can follow me <a href=\"https:\/\/twitter.com\/mqudsi\" rel=\"follow\">on twitter @mqudsi<\/a> and\/or subscribe below to receive emails when I post specifically about rust (and never anything else):<\/p>\n<div class=\"sendy_widget\" style='margin-bottom: 0.5em;'>\n<p><em>If you would like to receive a notification the next time we release a rust library, publish a crate, or post some rust-related developer articles, you can subscribe below. Note that you'll only get notifications relevant to rust programming and development by NeoSmart Technologies. If you want to receive email updates for all NeoSmart Technologies posts and releases, please sign up in the sidebar to the right instead.<\/em><\/p>\n<iframe tabIndex=-1 onfocus=\"sendy_no_focus\" src=\"https:\/\/neosmart.net\/sendy\/subscription?f=BUopX8f2VyLSOb892VIx6W4IUNylMrro5AN6cExmwnoKFQPz9892VSk4Que8yv892RnQgL&title=Join+the+rust+mailing+list\" style=\"height: 300px; width: 100%;\"><\/iframe>\n<\/div>\n<script type=\"text\/javascript\">function sendy_no_focus(e) { e.preventDefault(); }<\/script>\n<hr class=\"footnotes\"><ol class=\"footnotes\"><li id=\"fn1-4801\"><p>I used <a href=\"https:\/\/github.com\/sharkdp\/hyperfine\" rel=\"nofollow\">hyperfine<\/a> for this &#8211; excellent tool to keep in your arsenal if you are into writing high-performance code.&nbsp;<a href=\"#rf1-4801\" class=\"backlink\" title=\"Jump back to footnote 1 in the text.\">&#8617;<\/a><\/p><\/li><li id=\"fn2-4801\"><p>I recommend opening two side-by-side browser windows and copy-and-pasting to the terminal to <code>diff -U<\/code>&nbsp;<a href=\"#rf2-4801\" class=\"backlink\" title=\"Jump back to footnote 2 in the text.\">&#8617;<\/a><\/p><\/li><li id=\"fn3-4801\"><p>Especially if combined with llvm inefficiencies <a href=\"https:\/\/github.com\/rust-lang\/rust\/issues\/71025\" rel=\"nofollow\">when calling avx2 functions<\/a>, although I didn&#8217;t run into that.&nbsp;<a href=\"#rf3-4801\" class=\"backlink\" title=\"Jump back to footnote 3 in the text.\">&#8617;<\/a><\/p><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>NeoSmart Technologies&#8217; open source (MIT-licensed), cross-platform tac (written in rust) has been updated to version 2.0 (GitHub link). This is a major release with enormous performance improvements under-the-hood, primarily featuring handwritten SIMD acceleration of the core logic that allows it &hellip; <a href=\"https:\/\/neosmart.net\/blog\/using-simd-acceleration-in-rust-to-create-the-worlds-fastest-tac\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":505,"featured_media":4148,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[999,1],"tags":[1012,564,936,1011,968],"class_list":["post-4801","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-programming","category-software","tag-avx2","tag-performance","tag-rust","tag-simd","tag-tac"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/neosmart.net\/blog\/wp-content\/uploads\/tac.png","jetpack_shortlink":"https:\/\/wp.me\/p4xDa-1fr","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts\/4801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/users\/505"}],"replies":[{"embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/comments?post=4801"}],"version-history":[{"count":17,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts\/4801\/revisions"}],"predecessor-version":[{"id":5133,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts\/4801\/revisions\/5133"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/media\/4148"}],"wp:attachment":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/media?parent=4801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/categories?post=4801"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/tags?post=4801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}