Just a PSA from one rust developer to another: if you use sccache
, take a moment to benchmark a clean build1 of your favorite or current project and verify whether or not having RUSTC_WRAPPER=sccache
is doing you any favors.
I’ve been an sccache user almost from the very start, when the Mozilla team first introduced it in the rust discussions on GitHub maybe seven years back or so, probably because of my die-hard belief in the one-and-only ccache
earned over long years of saving my considerable time and effort on C++ development projects (life pro-tip: git bisect
on large C or C++ projects is a night-and-day difference with versus without ccache
). At the same time, I was always painfully aware of just how little sccache
actually cached compared to its C++ progenitor, and I was left feeling perpetually discontented ever since learning to query its hit rate stats with sccache -s
(something I never needed to do for ccache
).
But my blind belief in the value of build accelerators led me to complacency, and I confess that with sccache
mostly chugging away without issue in the background, I kind of forgot that I had RUSTC_WRAPPER
set at all. But I recently remembered it and in a bout of procrastination, decided to benchmark how much time sccache
was actually saving me… and the results were decidedly not great.
The bulk of my rust use at $work is done on a workstation under WSLv1, while a lot of of my open source rust work is done on a “mobile workstation” running a Debian derivative. I began my investigation on the WSL desktop, originally spurred by a need desire to see what kind of speedups different values for the ~recently added -Z threads
flag to parallelize the rustc
frontend would net me, then remembered that I had sccache
enabled and should disable that for deterministic results… leading me to include the impact of sccache
in my benchmarks.
On a (well-cooled, not thermally throttled) 16-core (32-thread) AMD Ryzen ThreadRipper 1950X machine with 64 GB of DDR4 RAM and NVMe boot and (code) storage disks, using rustc 1.81.0-nightly to compile a relatively modest-sized rust project from scratch in debug mode (followed by exactly one clean rebuild in the case of sccache to see how much of a speedup it gives), I obtained the following results:
-Zthreads | sccache | Time | Time (2nd run) |
---|---|---|---|
not set | no | 33.08s | |
not set | yes | 1m 18s | 56.20s |
8 | no | 33.27s | |
8 | yes | 1m 32s | 1m 00s |
16 | no | 34.43s | |
16 | yes | 40.78s | 56.06s |
32 | no | 37.25s | |
32 | yes | 1m 14s | 52.99s |
Shockingly enough, there was not a single configuration where the usage of sccache
ended up speeding up compilation over the baseline without it — not even on a subsequent clean build of the same code with the same RUSTFLAGS
value.
After inspecting the results of sccache -s
and with some experimentation, it turned out that this pathological performance appeared to be caused – in part – by having a full sccache cache folder.2 Running rm -rf ~/.cache/sccache/*
then re-running the benchmark (a few project commits later, so not to be directly compared to the above) revealed a significant improvement to the baseline build times with sccache… but probably not enough to justify its use:
-Zthreads | sccache | Time | Time (2nd Run) |
-ZTHREADS | sccache (initial) |
sccache (subsequent) |
---|---|---|---|---|---|---|
not set | no | 31.41s | 1.000 | |||
not set | yes | 42.57s | 29.47s | 1.355 | 0.938 | |
0 | no | 35.94s | 1.144 | |||
0 | yes | 44.26s | 28.95s | 1.231 | 0.806 | |
2 | no | 30.53s | 0.972 | |||
2 | yes | 58.94s | 38.36s | 1.931 | 1.256 | |
4 | no | 29.69s | 0.945 | |||
4 | yes | 1m 10s | 43.57s | 2.358 | 1.467 | |
8 | no | 30.43s | 0.969 | |||
8 | yes | 1m 17s | 47.90s | 2.530 | 1.574 | |
16 | no | 32.39s | 1.031 | |||
16 | yes | 1m 22s | 52.67s | 2.532 | 1.626 | |
32 | no | 35.17s | 1.120 | |||
32 | yes | 1m 26s | 53.27s | 2.445 | 1.515 |
Looking at the chart above, using sccache slowed down the clean build by anywhere from 23% (in the case of no -Zthreads
) to 153% (in the case of -Zthreads
set to 8 or 16), while providing a speed up to subsequent builds with identical flags and unchanged code/toolchain only in the case of no -Zthreads
(6% speedup) and by 19% in the case of -Zthreads
set to 0
(its default value), but it still managed to slow down even subsequent, clean builds with a fully primed cache as compared to the no-sccache baseline by from 25% (-Zthreads=2
) to 51% (-Zthreads=32
).
Analyzing the benefits of -Zthreads
is much harder. Looking at the cases with -Zthreads
but no sccache
, it appears that with a heavily saturated pipeline (building all project dependencies from scratch in non-incremental mode affords a lot of opportunities for parallelized codegen), the use of -Zthreads
can provide at best a very modest 5% speed up to build times (in the case of -Zthreads=4
) while actually slowing down compile times by 14% (the soon-to-be-default -Zthreads=0
) and 12% (with -Zthreads=16
).3
It’s interesting to note that the rust developers were rather more ebullient when introducing this feature to the world, claiming wins of up to 50% with -Zthreads=8
and suggesting that lower levels of parallelism would see lower speedups (the opposite of what I saw, where using 8 threads provided about half the benefit of using 4, and going any higher caused slow-downs rather than speed-ups). Note that I was compiling in the default dev/debug profile above, so maybe I should try and see what happens in release mode, though I would think the architectural limitations would persist.
Back to sccache, though.
One of the open source projects I contribute to most is fish-shell, which recently underwent a complete port from C++ to Rust (piece-by-piece while still passing all tests every step of the way). Some day I want to write at length about that experience, but the reason I’m bringing this up is because my fish experience has taught me that some things that are normally very fast under Linux can run much slower than expected under WSL, primarily due to I/O constraints caused by the filesystem virtualization layer. I haven’t dug into it yet, but going with the working theory that reading/writing some 550-800 MiB4 was slowing down my builds (even with the code and the cache located on two separate NVMe drives), I moved on to my other machine (where I’m running Linux natively).
Running the same benchmark with the same versions of rustc
and sccache
on the 4-core/8-thread Intel Xeon E3-1545M v5 (Skylake) with 32GiB of DDR3 RAM gave the following results, which were much more in line with my expectations for sccache (though even more disappointing when it came to the frontend parallelization flag):
-Zthreads | sccache | Time | Time (2nd Run) |
---|---|---|---|
unset | no | 1m 10s | |
unset | yes | 1m 13s | 14.12s |
0 | no | 1m 12s | |
0 | yes | 1m 18s | 14.26s |
2 | no | 1m 10s | |
2 | yes | 1m 17s | 14.20s |
4 | no | 1m 11s | |
4 | yes | 1m 15s | 14.44s |
6 | no | 1m 12s | |
6 | yes | 1m 16s | 14.90s |
8 | no | 1m 14s | |
8 | yes | 1m 20s | 15.04s |
Here at last are the sccache
results I was expecting! A maximum slowdown of about 8% for uncached builds and speedups of about 80% across the board for a (clean) rebuild of the same project immediately after caching the artifacts.5
As for -Zthreads
, the results here are consistent with what we saw above, at least once you take into account the fact that there are significantly fewer cores/threads to distribute the parallelized frontend work across. But we’re left with the same conclusion that at times when the CPU is already handling a high degree of concurrency with jobserver already saturated with work from the compilation pipeline across multiple compilation units from independent crates, adding further threads to the mix ends up hurting overall performance (to the tune of 14% in the worst case, when -Zthreads
is set to total number of cores). The good news is that adding just a slight degree of parallelization with -Zthreads=2
doesn’t hurt build times in this worst-case scenario and likely helps when the available threads aren’t already saturated with more work than they can handle, so that at least seems to be a safe value for the option for now.
I would have expected that -Zthreads
wouldn’t “unilaterally” dictate the number of chunks frontend work was being broken up into. While I’m sure it integrates nicely with jobserver to prevent an insane number of threads from being spawned and overwhelming the machine, it would seem that dividing the frontend work into n chunks when there aren’t n threads immediately available ends up hurting overall build performance. So in that sense, I suppose it would be better if -Zthreads
were a hint of sorts, treated as a “max chunks” limit, and if introspection of available threads happened before the decision was made to chunk the available work (and to what extent) so that the behavior of -Zthreads
, even with a hard-coded number, would hopefully only ever improve build times and never hurt.
sccache does not cache nor speed up incremental builds, and recent versions try to more or less bypass the caching pipeline altogether in an attempt to avoid slowing down incremental builds. ↩
A similar (but not identical) issue was reported in the project’s GitHub repo in December of 2019. ↩
I assumed
-Zthreads=0
would mean “default to the available concurrency” (i.e. 32 threads, in my case) but that doesn’t appear to be the case just by looking at the numbers. ↩The size of the
target/
folder varies depending on theRUSTFLAGS
used. ↩This was, of course, the same version of
sccache
that was tested under WSL above. ↩
My own conclusion is that the poor performance of Windows’s WSL I/O destroys any good cache effects that sccache intends to offer.
try to use sccache with Redis, for in memory cache