Benchmarking rust compilation speedups and slowdowns from sccache and -Zthreads

Known bugs, platform issues, and caching overhead could make it slower to compile with sccache than without

Just a PSA from one rust developer to another: if you use sccache, take a moment to benchmark a clean build1 of your favorite or current project and verify whether or not having RUSTC_WRAPPER=sccache is doing you any favors.

I’ve been an sccache user almost from the very start, when the Mozilla team first introduced it in the rust discussions on GitHub maybe seven years back or so, probably because of my die-hard belief in the one-and-only ccache earned over long years of saving my considerable time and effort on C++ development projects (life pro-tip: git bisect on large C or C++ projects is a night-and-day difference with versus without ccache). At the same time, I was always painfully aware of just how little sccache actually cached compared to its C++ progenitor, and I was left feeling perpetually discontented ever since learning to query its hit rate stats with sccache -s (something I never needed to do for ccache).

But my blind belief in the value of build accelerators led me to complacency, and I confess that with sccache mostly chugging away without issue in the background, I kind of forgot that I had RUSTC_WRAPPER set at all. But I recently remembered it and in a bout of procrastination, decided to benchmark how much time sccache was actually saving me… and the results were decidedly not great.

The bulk of my rust use at $work is done on a workstation under WSLv1, while a lot of of my open source rust work is done on a “mobile workstation” running a Debian derivative. I began my investigation on the WSL desktop, originally spurred by a need desire to see what kind of speedups different values for the ~recently added -Z threads flag to parallelize the rustc frontend would net me, then remembered that I had sccache enabled and should disable that for deterministic results… leading me to include the impact of sccache in my benchmarks.

On a (well-cooled, not thermally throttled) 16-core (32-thread) AMD Ryzen ThreadRipper 1950X machine with 64 GB of DDR4 RAM and NVMe boot and (code) storage disks, using rustc 1.81.0-nightly to compile a relatively modest-sized rust project from scratch in debug mode (followed by exactly one clean rebuild in the case of sccache to see how much of a speedup it gives), I obtained the following results:

-Zthreads sccache Time Time (2nd run)
not set no 33.08s
not set yes 1m 18s 56.20s
8 no 33.27s
8 yes 1m 32s 1m 00s
16 no 34.43s
16 yes 40.78s 56.06s
32 no 37.25s
32 yes 1m 14s 52.99s

Shockingly enough, there was not a single configuration where the usage of sccache ended up speeding up compilation over the baseline without it — not even on a subsequent clean build of the same code with the same RUSTFLAGS value.

After inspecting the results of sccache -s and with some experimentation, it turned out that this pathological performance appeared to be caused – in part – by having a full sccache cache folder.2 Running rm -rf ~/.cache/sccache/* then re-running the benchmark (a few project commits later, so not to be directly compared to the above) revealed a significant improvement to the baseline build times with sccache… but probably not enough to justify its use:

-Zthreads sccache Time Time
(2nd Run)
-ZTHREADS sccache
not set no 31.41s 1.000
not set yes 42.57s 29.47s 1.355 0.938
0 no 35.94s 1.144
0 yes 44.26s 28.95s 1.231 0.806
2 no 30.53s 0.972
2 yes 58.94s 38.36s 1.931 1.256
4 no 29.69s 0.945
4 yes 1m 10s 43.57s 2.358 1.467
8 no 30.43s 0.969
8 yes 1m 17s 47.90s 2.530 1.574
16 no 32.39s 1.031
16 yes 1m 22s 52.67s 2.532 1.626
32 no 35.17s 1.120
32 yes 1m 26s 53.27s 2.445 1.515

Looking at the chart above, using sccache slowed down the clean build by anywhere from 23% (in the case of no -Zthreads) to 153% (in the case of -Zthreads set to 8 or 16), while providing a speed up to subsequent builds with identical flags and unchanged code/toolchain only in the case of no -Zthreads (6% speedup) and by 19% in the case of -Zthreads set to 0 (its default value), but it still managed to slow down even subsequent, clean builds with a fully primed cache as compared to the no-sccache baseline by from 25% (-Zthreads=2) to 51% (-Zthreads=32).

Analyzing the benefits of -Zthreads is much harder. Looking at the cases with -Zthreads but no sccache, it appears that with a heavily saturated pipeline (building all project dependencies from scratch in non-incremental mode affords a lot of opportunities for parallelized codegen), the use of -Zthreads can provide at best a very modest 5% speed up to build times (in the case of -Zthreads=4) while actually slowing down compile times by 14% (the soon-to-be-default -Zthreads=0) and 12% (with -Zthreads=16).3

It’s interesting to note that the rust developers were rather more ebullient when introducing this feature to the world, claiming wins of up to 50% with -Zthreads=8 and suggesting that lower levels of parallelism would see lower speedups (the opposite of what I saw, where using 8 threads provided about half the benefit of using 4, and going any higher caused slow-downs rather than speed-ups). Note that I was compiling in the default dev/debug profile above, so maybe I should try and see what happens in release mode, though I would think the architectural limitations would persist.

Back to sccache, though.

One of the open source projects I contribute to most is fish-shell, which recently underwent a complete port from C++ to Rust (piece-by-piece while still passing all tests every step of the way). Some day I want to write at length about that experience, but the reason I’m bringing this up is because my fish experience has taught me that some things that are normally very fast under Linux can run much slower than expected under WSL, primarily due to I/O constraints caused by the filesystem virtualization layer. I haven’t dug into it yet, but going with the working theory that reading/writing some 550-800 MiB4 was slowing down my builds (even with the code and the cache located on two separate NVMe drives), I moved on to my other machine (where I’m running Linux natively).

Running the same benchmark with the same versions of rustc and sccache on the 4-core/8-thread Intel Xeon E3-1545M v5 (Skylake) with 32GiB of DDR3 RAM gave the following results, which were much more in line with my expectations for sccache (though even more disappointing when it came to the frontend parallelization flag):

-Zthreads sccache Time Time (2nd Run)
unset no 1m 10s
unset yes 1m 13s 14.12s
0 no 1m 12s
0 yes 1m 18s 14.26s
2 no 1m 10s
2 yes 1m 17s 14.20s
4 no 1m 11s
4 yes 1m 15s 14.44s
6 no 1m 12s
6 yes 1m 16s 14.90s
8 no 1m 14s
8 yes 1m 20s 15.04s

Here at last are the sccache results I was expecting! A maximum slowdown of about 8% for uncached builds and speedups of about 80% across the board for a (clean) rebuild of the same project immediately after caching the artifacts.5

As for -Zthreads, the results here are consistent with what we saw above, at least once you take into account the fact that there are significantly fewer cores/threads to distribute the parallelized frontend work across. But we’re left with the same conclusion that at times when the CPU is already handling a high degree of concurrency with jobserver already saturated with work from the compilation pipeline across multiple compilation units from independent crates, adding further threads to the mix ends up hurting overall performance (to the tune of 14% in the worst case, when -Zthreads is set to total number of cores). The good news is that adding just a slight degree of parallelization with -Zthreads=2 doesn’t hurt build times in this worst-case scenario and likely helps when the available threads aren’t already saturated with more work than they can handle, so that at least seems to be a safe value for the option for now.

I would have expected that -Zthreads wouldn’t “unilaterally” dictate the number of chunks frontend work was being broken up into. While I’m sure it integrates nicely with jobserver to prevent an insane number of threads from being spawned and overwhelming the machine, it would seem that dividing the frontend work into n chunks when there aren’t n threads immediately available ends up hurting overall build performance. So in that sense, I suppose it would be better if -Zthreads were a hint of sorts, treated as a “max chunks” limit, and if introspection of available threads happened before the decision was made to chunk the available work (and to what extent) so that the behavior of -Zthreads, even with a hard-coded number, would hopefully only ever improve build times and never hurt.

If you would like to receive a notification the next time we release a rust library, publish a crate, or post some rust-related developer articles, you can subscribe below. Note that you'll only get notifications relevant to rust programming and development by NeoSmart Technologies. If you want to receive email updates for all NeoSmart Technologies posts and releases, please sign up in the sidebar to the right instead.

  1. sccache does not cache nor speed up incremental builds, and recent versions try to more or less bypass the caching pipeline altogether in an attempt to avoid slowing down incremental builds. 

  2. A similar (but not identical) issue was reported in the project’s GitHub repo in December of 2019. 

  3. I assumed -Zthreads=0 would mean “default to the available concurrency” (i.e. 32 threads, in my case) but that doesn’t appear to be the case just by looking at the numbers. 

  4. The size of the target/ folder varies depending on the RUSTFLAGS used. 

  5. This was, of course, the same version of sccache that was tested under WSL above. 

  • Similar Posts

    Craving more? Here are some posts a vector similarity search turns up as being relevant or similar from our catalog you might also enjoy.
    1. Using SIMD acceleration in rust to create the world's fastest tac
    2. Rust Error: Could not find `avx` in `x86`
    3. Implementing truly safe semaphores in rust
    4. Using to integrate rust applications with system libraries like a pro
    5. PrettySize 0.3 release and a weakness in rust's type system
  • One thought on “Benchmarking rust compilation speedups and slowdowns from sccache and -Zthreads

    1. My own conclusion is that the poor performance of Windows’s WSL I/O destroys any good cache effects that sccache intends to offer.

    Leave a Reply

    Your email address will not be published. Required fields are marked *