{"id":5166,"date":"2024-07-01T13:50:08","date_gmt":"2024-07-01T18:50:08","guid":{"rendered":"https:\/\/neosmart.net\/blog\/?p=5166"},"modified":"2024-07-10T15:13:45","modified_gmt":"2024-07-10T20:13:45","slug":"benchmarking-rust-compilation-speedups-and-slowdowns-from-sccache-and-zthreads","status":"publish","type":"post","link":"https:\/\/neosmart.net\/blog\/benchmarking-rust-compilation-speedups-and-slowdowns-from-sccache-and-zthreads\/","title":{"rendered":"Benchmarking rust compilation speedups and slowdowns from <code>sccache<\/code> and <code>-Zthreads<\/code>"},"content":{"rendered":"<p>Just a PSA from one rust developer to another: if you use <code>sccache<\/code>, take a moment to benchmark a clean build<sup id=\"rf1-5166\"><a href=\"#fn1-5166\" title=\"sccache does not cache nor speed up incremental builds, and recent versions try to more or less bypass the caching pipeline altogether in an attempt to avoid slowing down incremental builds.\" rel=\"footnote\">1<\/a><\/sup> of your favorite or current project and verify whether or not having <code>RUSTC_WRAPPER=sccache<\/code> is doing you any favors.<\/p>\n<p>I&#8217;ve been an sccache user almost from the very start, when the Mozilla team first introduced it in the rust discussions on GitHub maybe seven years back or so, probably because of my die-hard belief in the one-and-only <code>ccache<\/code> earned over long years of saving my considerable time and effort on C++ development projects (life pro-tip: <code>git bisect<\/code> on large C or C++ projects is a night-and-day difference with versus without <code>ccache<\/code>). At the same time, I was always painfully aware of just how little <code>sccache<\/code> actually cached compared to its C++ progenitor, and I was left feeling perpetually discontented ever since learning to query its hit rate stats with <code>sccache -s<\/code> (something I never needed to do for <code>ccache<\/code>).<\/p>\n<p>But my blind belief in the value of build accelerators led me to complacency, and I confess that with <code>sccache<\/code> <em>mostly<\/em> chugging away without issue in the background, I kind of forgot that I had <code>RUSTC_WRAPPER<\/code> set at all. But I recently remembered it and in a bout of procrastination, decided to benchmark how much time <code>sccache<\/code> was <em>actually<\/em> saving me&#8230; and the results were decidedly not great.<\/p>\n<p><!--more--><\/p>\n<p>The bulk of my rust use at $work is done on a workstation under WSLv1, while a lot of of my open source rust work is done on a &#8220;mobile workstation&#8221; running a Debian derivative. I began my investigation on the WSL desktop, originally spurred by a <del>need<\/del> desire to see what kind of speedups different values for<a href=\"https:\/\/blog.rust-lang.org\/2023\/11\/09\/parallel-rustc.html\" rel=\"follow\"> the ~recently added <code>-Z threads<\/code> flag<\/a> to parallelize the <code>rustc<\/code> frontend would net me, then remembered that I had <code>sccache<\/code> enabled and should disable that for deterministic results&#8230; leading me to include the impact of <code>sccache<\/code> in my benchmarks.<\/p>\n<p>On a (well-cooled, not thermally throttled) 16-core (32-thread) AMD Ryzen ThreadRipper 1950X machine with 64 GB of DDR4 RAM and NVMe boot and (code) storage disks, using rustc 1.81.0-nightly to compile a relatively modest-sized rust project from scratch in debug mode (followed by exactly one clean rebuild in the case of sccache to see how much of a speedup it gives), I obtained the following results:<\/p>\n<table border=\"1\">\n<thead class=\"bold\">\n<tr>\n<th>-Zthreads<\/th>\n<th>sccache<\/th>\n<th>Time<\/th>\n<th>Time (2nd run)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><em>not set<\/em><\/td>\n<td>no<\/td>\n<td>33.08s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><em>not set<\/em><\/td>\n<td>yes<\/td>\n<td>1m 18s<\/td>\n<td>56.20s<\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td>no<\/td>\n<td>33.27s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td>yes<\/td>\n<td>1m 32s<\/td>\n<td>1m 00s<\/td>\n<\/tr>\n<tr>\n<td>16<\/td>\n<td>no<\/td>\n<td>34.43s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>16<\/td>\n<td>yes<\/td>\n<td>40.78s<\/td>\n<td>56.06s<\/td>\n<\/tr>\n<tr>\n<td>32<\/td>\n<td>no<\/td>\n<td>37.25s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>32<\/td>\n<td>yes<\/td>\n<td>1m 14s<\/td>\n<td>52.99s<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Shockingly enough, there was not a single configuration where the usage of <code>sccache<\/code> ended up speeding up compilation over the baseline without it \u2014 not even on a subsequent clean build of the same code with the same <code>RUSTFLAGS<\/code> value.<\/p>\n<p>After inspecting the results of <code>sccache -s<\/code> and with some experimentation, it turned out that this pathological performance appeared to be caused \u2013 in part \u2013 by having a full sccache cache folder.<sup id=\"rf2-5166\"><a href=\"#fn2-5166\" title=\"A similar (but not identical) issue was reported in the project&rsquo;s GitHub repo in December of 2019.\" rel=\"footnote\">2<\/a><\/sup> Running <code>rm -rf ~\/.cache\/sccache\/*<\/code> then re-running the benchmark (a few project commits later, so not to be directly compared to the above) revealed a significant improvement to the baseline build times with sccache\u2026 but probably not enough to justify its use:<\/p>\n<div class=\"table-scroll\">\n<table border=\"1\">\n<thead class=\"bold\">\n<tr>\n<th>-Zthreads<\/th>\n<th>sccache<\/th>\n<th>Time<\/th>\n<th>Time<br \/>\n(2nd Run)<\/th>\n<th>-ZTHREADS<\/th>\n<th>sccache<br \/>\n(initial)<\/th>\n<th>sccache<br \/>\n(subsequent)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><em>not set<\/em><\/td>\n<td>no<\/td>\n<td>31.41s<\/td>\n<td><\/td>\n<td>1.000<\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td><em>not set<\/em><\/td>\n<td>yes<\/td>\n<td>42.57s<\/td>\n<td>29.47s<\/td>\n<td><\/td>\n<td>1.355<\/td>\n<td>0.938<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>no<\/td>\n<td>35.94s<\/td>\n<td><\/td>\n<td>1.144<\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>yes<\/td>\n<td>44.26s<\/td>\n<td>28.95s<\/td>\n<td><\/td>\n<td>1.231<\/td>\n<td>0.806<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>no<\/td>\n<td>30.53s<\/td>\n<td><\/td>\n<td>0.972<\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>yes<\/td>\n<td>58.94s<\/td>\n<td>38.36s<\/td>\n<td><\/td>\n<td>1.931<\/td>\n<td>1.256<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>no<\/td>\n<td>29.69s<\/td>\n<td><\/td>\n<td>0.945<\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>yes<\/td>\n<td>1m 10s<\/td>\n<td>43.57s<\/td>\n<td><\/td>\n<td>2.358<\/td>\n<td>1.467<\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td>no<\/td>\n<td>30.43s<\/td>\n<td><\/td>\n<td>0.969<\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td>yes<\/td>\n<td>1m 17s<\/td>\n<td>47.90s<\/td>\n<td><\/td>\n<td>2.530<\/td>\n<td>1.574<\/td>\n<\/tr>\n<tr>\n<td>16<\/td>\n<td>no<\/td>\n<td>32.39s<\/td>\n<td><\/td>\n<td>1.031<\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>16<\/td>\n<td>yes<\/td>\n<td>1m 22s<\/td>\n<td>52.67s<\/td>\n<td><\/td>\n<td>2.532<\/td>\n<td>1.626<\/td>\n<\/tr>\n<tr>\n<td>32<\/td>\n<td>no<\/td>\n<td>35.17s<\/td>\n<td><\/td>\n<td>1.120<\/td>\n<td><\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>32<\/td>\n<td>yes<\/td>\n<td>1m 26s<\/td>\n<td>53.27s<\/td>\n<td><\/td>\n<td>2.445<\/td>\n<td>1.515<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>Looking at the chart above, using sccache slowed down the clean build by anywhere from 23% (in the case of no <code>-Zthreads<\/code>) to 153% (in the case of <code>-Zthreads<\/code> set to 8 or 16), while providing a speed up to subsequent builds with identical flags and unchanged code\/toolchain only in the case of no <code>-Zthreads<\/code> (6% speedup) and by 19% in the case of <code>-Zthreads<\/code> set to <code>0<\/code> (its default value),\u00a0<strong>but it still managed to slow down even subsequent, clean builds with a fully primed cache as compared to the no-sccache baseline <\/strong>by from 25% (<code>-Zthreads=2<\/code>) to 51% (<code>-Zthreads=32<\/code>).<\/p>\n<p>Analyzing the benefits of <code>-Zthreads<\/code> is much harder. Looking at the cases with <code>-Zthreads<\/code> but no <code>sccache<\/code>, it appears that with a heavily saturated pipeline (building all project dependencies from scratch in non-incremental mode affords a lot of opportunities for parallelized codegen), the use of <code>-Zthreads<\/code> can provide <em>at best<\/em> a very modest 5% speed up to build times (in the case of <code>-Zthreads=4<\/code>) while actually <em>slowing down\u00a0<\/em>compile times by 14% (the soon-to-be-default <code>-Zthreads=0<\/code>) and 12% (with <code>-Zthreads=16<\/code>).<sup id=\"rf3-5166\"><a href=\"#fn3-5166\" title=\"I assumed -Zthreads=0 would mean &ldquo;default to the available concurrency&rdquo; (i.e. 32 threads, in my case) but that doesn&rsquo;t appear to be the case just by looking at the numbers.\" rel=\"footnote\">3<\/a><\/sup><\/p>\n<p>It&#8217;s interesting to note that the rust developers were rather more ebullient when <a href=\"https:\/\/blog.rust-lang.org\/2023\/11\/09\/parallel-rustc.html\" rel=\"follow\">introducing this feature<\/a> to the world, claiming wins of up to 50% with <code>-Zthreads=8<\/code> and suggesting that lower levels of parallelism would see lower speedups (the opposite of what I saw, where using 8 threads provided about half the benefit of using 4, and going any higher caused slow-downs rather than speed-ups). Note that I was compiling in the default dev\/debug profile above, so maybe I should try and see what happens in release mode, though I would think the architectural limitations would persist.<\/p>\n<p>Back to sccache, though.<\/p>\n<p>One of the open source projects I contribute to most is <a href=\"https:\/\/github.com\/fish-shell\/fish-shell\/\" rel=\"nofollow\">fish-shell<\/a>, which recently underwent a complete port from C++ to Rust (piece-by-piece while still passing all tests every step of the way). Some day I want to write at length about that experience, but the reason I&#8217;m bringing this up is because my fish experience has taught me that some things that are normally very fast under Linux can run much slower than expected under WSL, primarily due to I\/O constraints caused by the filesystem virtualization layer. I haven&#8217;t dug into it yet, but going with the working theory that reading\/writing some 550-800 MiB<sup id=\"rf4-5166\"><a href=\"#fn4-5166\" title=\"The size of the target\/ folder varies depending on the RUSTFLAGS used.\" rel=\"footnote\">4<\/a><\/sup> was slowing down my builds (even with the code and the cache located on two separate NVMe drives), I moved on to my other machine (where I&#8217;m running Linux natively).<\/p>\n<p>Running the same benchmark with the same versions of <code>rustc<\/code> and <code>sccache<\/code> on the 4-core\/8-thread Intel Xeon E3-1545M v5 (Skylake) with 32GiB of DDR3 RAM gave the following results, which were <em>much<\/em> more in line with my expectations for sccache (though even more disappointing when it came to the frontend parallelization flag):<\/p>\n<table border=\"1\">\n<thead class=\"bold\">\n<tr>\n<th>-Zthreads<\/th>\n<th>sccache<\/th>\n<th>Time<\/th>\n<th>Time (2nd Run)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>unset<\/td>\n<td>no<\/td>\n<td>1m 10s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>unset<\/td>\n<td>yes<\/td>\n<td>1m 13s<\/td>\n<td>14.12s<\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>no<\/td>\n<td>1m 12s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>0<\/td>\n<td>yes<\/td>\n<td>1m 18s<\/td>\n<td>14.26s<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>no<\/td>\n<td>1m 10s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>yes<\/td>\n<td>1m 17s<\/td>\n<td>14.20s<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>no<\/td>\n<td>1m 11s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>yes<\/td>\n<td>1m 15s<\/td>\n<td>14.44s<\/td>\n<\/tr>\n<tr>\n<td>6<\/td>\n<td>no<\/td>\n<td>1m 12s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>6<\/td>\n<td>yes<\/td>\n<td>1m 16s<\/td>\n<td>14.90s<\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td>no<\/td>\n<td>1m 14s<\/td>\n<td><\/td>\n<\/tr>\n<tr>\n<td>8<\/td>\n<td>yes<\/td>\n<td>1m 20s<\/td>\n<td>15.04s<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Here at last are the <code>sccache<\/code> results I was expecting! A maximum slowdown of about 8% for uncached builds and speedups of about 80% across the board for a (clean) rebuild of the same project immediately after caching the artifacts.<sup id=\"rf5-5166\"><a href=\"#fn5-5166\" title=\"This was, of course, the same version of sccache that was tested under WSL above.\" rel=\"footnote\">5<\/a><\/sup><\/p>\n<p>As for <code>-Zthreads<\/code>, the results here are consistent with what we saw above, at least once you take into account the fact that there are significantly fewer cores\/threads to distribute the parallelized frontend work across. But we&#8217;re left with the same conclusion that at times when the CPU is already handling a high degree of concurrency with jobserver already saturated with work from the compilation pipeline across multiple compilation units from independent crates, adding further threads to the mix ends up hurting overall performance (to the tune of 14% in the worst case, when <code>-Zthreads<\/code> is set to total number of cores). The good news is that adding just a slight degree of parallelization with <code>-Zthreads=2<\/code> doesn&#8217;t hurt build times in this worst-case scenario and likely helps when the available threads aren&#8217;t already saturated with more work than they can handle, so that at least seems to be a safe value for the option for now.<\/p>\n<p>I would have <em>expected<\/em> that <code>-Zthreads<\/code> wouldn&#8217;t &#8220;unilaterally&#8221; dictate the number of chunks frontend work was being broken up into. While I&#8217;m sure it integrates nicely with jobserver to prevent an insane number of threads from being spawned and overwhelming the machine, it would seem that dividing the frontend work into <em>n<\/em> chunks when there aren&#8217;t <em>n<\/em> threads immediately available ends up hurting overall build performance. So in that sense, I suppose it would be better if <code>-Zthreads<\/code> were a hint of sorts, treated as a &#8220;max chunks&#8221; limit, and if introspection of available threads happened <em>before<\/em> the decision was made to chunk the available work (and to what extent) so that the behavior of <code>-Zthreads<\/code>, even with a hard-coded number, would hopefully only ever improve build times and never hurt.<\/p>\n<div class=\"sendy_widget\" style='margin-bottom: 0.5em;'>\n<p><em>If you would like to receive a notification the next time we release a rust library, publish a crate, or post some rust-related developer articles, you can subscribe below. Note that you'll only get notifications relevant to rust programming and development by NeoSmart Technologies. If you want to receive email updates for all NeoSmart Technologies posts and releases, please sign up in the sidebar to the right instead.<\/em><\/p>\n<iframe tabIndex=-1 onfocus=\"sendy_no_focus\" src=\"https:\/\/neosmart.net\/sendy\/subscription?f=BUopX8f2VyLSOb892VIx6W4IUNylMrro5AN6cExmwnoKFQPz9892VSk4Que8yv892RnQgL&title=Join+the+rust+mailing+list\" style=\"height: 300px; width: 100%;\"><\/iframe>\n<\/div>\n<script type=\"text\/javascript\">function sendy_no_focus(e) { e.preventDefault(); }<\/script>\n<hr class=\"footnotes\"><ol class=\"footnotes\"><li id=\"fn1-5166\"><p>sccache does not cache nor speed up incremental builds, and recent versions try to more or less bypass the caching pipeline altogether in an attempt to avoid slowing down incremental builds.&nbsp;<a href=\"#rf1-5166\" class=\"backlink\" title=\"Jump back to footnote 1 in the text.\">&#8617;<\/a><\/p><\/li><li id=\"fn2-5166\"><p>A <a href=\"https:\/\/github.com\/mozilla\/sccache\/issues\/629\" rel=\"nofollow\">similar (but not identical) issue<\/a> was reported in the project&#8217;s GitHub repo in December of 2019.&nbsp;<a href=\"#rf2-5166\" class=\"backlink\" title=\"Jump back to footnote 2 in the text.\">&#8617;<\/a><\/p><\/li><li id=\"fn3-5166\"><p>I <em>assumed<\/em> <code>-Zthreads=0<\/code> would mean &#8220;default to the available concurrency&#8221; (i.e. 32 threads, in my case) but that doesn&#8217;t appear to be the case just by looking at the numbers.&nbsp;<a href=\"#rf3-5166\" class=\"backlink\" title=\"Jump back to footnote 3 in the text.\">&#8617;<\/a><\/p><\/li><li id=\"fn4-5166\"><p>The size of the <code>target\/<\/code> folder varies depending on the <code>RUSTFLAGS<\/code> used.&nbsp;<a href=\"#rf4-5166\" class=\"backlink\" title=\"Jump back to footnote 4 in the text.\">&#8617;<\/a><\/p><\/li><li id=\"fn5-5166\"><p>This was, of course, the same version of <code>sccache<\/code> that was tested under WSL above.&nbsp;<a href=\"#rf5-5166\" class=\"backlink\" title=\"Jump back to footnote 5 in the text.\">&#8617;<\/a><\/p><\/li><\/ol>","protected":false},"excerpt":{"rendered":"<p>Just a PSA from one rust developer to another: if you use sccache, take a moment to benchmark a clean build1 of your favorite or current project and verify whether or not having RUSTC_WRAPPER=sccache is doing you any favors. I&#8217;ve &hellip; <a href=\"https:\/\/neosmart.net\/blog\/benchmarking-rust-compilation-speedups-and-slowdowns-from-sccache-and-zthreads\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":505,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[999,1],"tags":[564,936,1034,1033],"class_list":["post-5166","post","type-post","status-publish","format-standard","hentry","category-programming","category-software","tag-performance","tag-rust","tag-rustc","tag-sccache"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4xDa-1lk","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts\/5166","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/users\/505"}],"replies":[{"embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/comments?post=5166"}],"version-history":[{"count":28,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts\/5166\/revisions"}],"predecessor-version":[{"id":5195,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/posts\/5166\/revisions\/5195"}],"wp:attachment":[{"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/media?parent=5166"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/categories?post=5166"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neosmart.net\/blog\/wp-json\/wp\/v2\/tags?post=5166"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}