Home

Awesome

Comparison of Rust async and Linux thread context switch time and memory use

These are a few programs that try to measure context switch time and task memory use in various ways. In summary:

These are probably not the limiting factors in your application, but it's nice to know that the headroom is there.

Measuring thread context switch time

The programs thread-brigade and async-brigade each create 500 tasks connected by pipes (like a “bucket brigade”) and measure how long it takes to propagate a single byte from the first to the last. One is implemented with threads, and the other is implemented with the Tokio crate's async I/O.

$ cd async-brigade/
$ /bin/time cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `/home/jimb/rust/context-switch/target/release/async-brigade`
500 tasks, 10000 iterations:
mean 1.795ms per iteration, stddev 82.016µs (3.589µs per task per iter)
9.83user 8.33system 0:18.19elapsed 99%CPU (0avgtext+0avgdata 17144maxresident)k
0inputs+0outputs (0major+2283minor)pagefaults 0swaps
$

$ cd ../thread-brigade
$ /bin/time cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `/home/jimb/rust/context-switch/target/release/thread-brigade`
500 tasks, 10000 iterations:
mean 2.657ms per iteration, stddev 231.822µs (5.313µs per task per iter)
9.14user 27.88system 0:26.91elapsed 137%CPU (0avgtext+0avgdata 16784maxresident)k
0inputs+0outputs (0major+3381minor)pagefaults 0swaps
$

In these runs, I'm seeing 18.19s / 26.91s ≅ 0.68 or a 30% speedup from going async. However, if I pin the threaded version to a single core, the speed advantage of async disappears:

$ taskset --cpu-list 1 /bin/time cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `/home/jimb/rust/context-switch/target/release/thread-brigade`
500 tasks, 10000 iterations:
mean 1.709ms per iteration, stddev 102.926µs (3.417µs per task per iter)
4.81user 12.50system 0:17.37elapsed 99%CPU (0avgtext+0avgdata 16744maxresident)k
0inputs+0outputs (0major+3610minor)pagefaults 0swaps
$

I don't know why.

It would be interesting to see whether/how the number of tasks in the brigade affects these numbers.

Per-thread resident memory use in thread-brigade is about 9.5KiB, whereas per-async-task memory use in async-brigade is around 0.4KiB, a factor of ~20. See 'Measuring memory use', below.

There are differences in the system calls performed by the two versions:

The async-brigade performance isn't affected much if we switch from Tokio's default multi-thread executor to a single-threaded executor, so it's not spending much time in kernel context switches. thread-brigade does a kernel context switch from each task to the next. I think this means that context switches are more expensive than a recvfrom and epoll system call.

If we run the test with 50000 tasks (and reduce the number of iterations to 100), the speedup doesn't change much, but thread-brigade requires a 466MiB resident set, whereas async-brigade runs in around 21MiB. That's 10kiB of memory being actively touched by each task, versus 0.4kiB, about a twentieth. This isn't just the effect of pessimistically-sized thread stacks: we're looking at the resident set size, which shouldn't include pages allocated to the stack that the thread never actually touches. So the way Rust right-sizes futures seems really effective.

This microbenchmark doesn't do much, but a real application would add to each task's working set, and that difference might become less significant. But I was able to run async-brigade with 250,000 tasks; I wasn't able to get my laptop to run 250,000 threads at all.

The other programs are minor variations, or make other measurements:

Measuring memory use

The scripts thread-brigade/rss-per-thread.sh and async-brigade/rss-per-task.sh run their respective brigade microbenchmarks with varying numbers of tasks, and measure the virtual and resident memory consumption at each count. You can then do a linear regression to see the memory use of a single task. Note that async-brigade/rss-per-task.sh runs 10x as many tasks, to keep the noise down.

As mentioned above, in my measurements, each thread costs around 9.5KiB, and each async task costs around 0.4KiB, so the async version uses about 1/20th as much memory as the threaded version.

To run this script, you'll need to have the Linux pmap utility installed; this gives an accurate measurement of resident set size. On Fedora, this is included in the procps-ng package. (Pull requests for info about other major distributions welcome.)

Running tests with large numbers of threads

It's interesting to play with the number of tasks to see how that affects the relative speed of the async and threaded bucket brigades. But in order to test large numbers of threads, you may need to remove some of your system's guardrails.

On Linux:

With these changes made, I was able to run thread-brigade with 80000 tasks.

Does any of this matter?

In GitHub issue #1, @spacejam raised a good point:

overall, there are a lot of things here that really fade into insignificance when you consider the simple effort required to deserialize JSON or handle TLS. People often see that there's some theoretical benefit of async and then they accept far less ergonomic coding styles and the additional bug classes that only happen on async due to accidental blocking etc... despite the fact that when you consider a real-world deployed application, those "benefits" become indistinguishable from noise. However, due to the additional bug classes and worse ergonomics, there is now less energy for actually optimizing the business logic, which is where all of the cycles and resource use are anyway, so in-practice async implementations tend to be buggier and slower.

Below is my reply to them, lightly edited:

I have a few responses to this.

First of all, the reason I carried out the experiments in this repo in the first place was that I basically agreed with all of your points here. I think async is wildly oversold as "faster" without any real investigation into why that would be. It is hard to pin down exactly how the alleged advantages would arise. The same I/O operations have to be carried out either way (or worse); kernel context switches have been heavily optimized over the years (although the Spectre mitigations made them worse); and the whole story of the creation of NPTL was about it beating IBM's competing M-on-N thread implementation (which I see as analogous to async task systems) in the very microbenchmarks in which the M-on-N thread library was expected to have an advantage.

However, in conversations that I sought out with people with experience implementing high-volume servers, both with threads and with async designs, my async skepticism met a lot of pushback. They consistently reported struggling with threaded designs and not being able to get performance under control until they went async. Big caveat: they were not using Rust - these were older designs in C++ and even C. But it jibes well with the other successful designs you see out there, like nginx and Elixir (which is used by WhatsApp, among others), which are all essentially async.

So the purpose of these experiments was to see if I could isolate some of the sources of async's apparent advantages. It came down to memory consumption, creation time, and context switch time each having best-case order-of-magnitude advantages. Taken together, those advantages are beyond the point that I'm willing to call negligible. How often the best case actually arises is unclear, but one can argue that that, at least, is under the programmer's control, so the ceiling on how far implementation effort can get you is higher, in an async design.

Ultimately, as far as this repo is concerned, you need to decide whether you trust your readers to understand both the value and the limitations of microbenchmarks. If you assume your readers are in Twitter mode---they're just going to glance at the headlines and come away with a binary, "async good, two legs bad" kind of conclusion---then maybe it's better not to publish microbenchmarks at all, because they're misleading. Reality is more sensitive to details. But I think the benefit of offering these microbenchmarks and the README's analysis to careful readers might(?) outweigh the harm done by the noise from careless readers, because I think the careful readers are more likely to use the material in a way that has lasting impact. The wind changes; the forest does not.

The 2nd edition of Programming Rust (due out in June 2021) has a chapter on async that ends with a discussion of the rationale for async programming. It tries to dismiss some of the commonly heard bogus arguments, and present the advantages that async does have with the appropriate qualifications. It mentions tooling disadvantages. Generally, the chapter describes Rust's async implementation in a decent amount of detail, because we want our readers to be able to anticipate how it will perform and where it might help; the summary attempts to make clear what all that machinery can and cannot accomplish.

The only thing I'd add is that the measurements reported here for asynchronous performance were taken of an implementation that uses epoll-style system calls. The newer io_uring-style APIs seem radically different, and I'm curious to see whether these might change the story here.