Skip to content

bench: add remote_spawn benchmark for inject queue contention#7944

Merged
Darksonn merged 2 commits intotokio-rs:masterfrom
alex:bench-remote-spawn-contention
Mar 11, 2026
Merged

bench: add remote_spawn benchmark for inject queue contention#7944
Darksonn merged 2 commits intotokio-rs:masterfrom
alex:bench-remote-spawn-contention

Conversation

@alex
Copy link
Contributor

@alex alex commented Feb 28, 2026

Add a benchmark that measures contention on the scheduler's inject queue mutex (push_remote_task) when multiple external threads spawn tasks into the runtime simultaneously. Every rt.spawn() from a non-worker thread unconditionally goes through push_remote_task, so this directly measures the scalability of the inject queue.

Results on an M1 Max MacBook Pro (10 cores), spawning 12,800 total tasks:

threads/1: 3.39 ms (265 ns/task, 1.00x)
threads/2: 4.74 ms (370 ns/task, 1.40x)
threads/4: 5.89 ms (460 ns/task, 1.74x)
threads/8: 8.10 ms (633 ns/task, 2.39x)

Wall-clock time increases with more threads despite constant total work, confirming the single mutex serializes producers.

Motivation

Solution

Add a benchmark that measures contention on the scheduler's inject queue
mutex (push_remote_task) when multiple external threads spawn tasks into
the runtime simultaneously. Every rt.spawn() from a non-worker thread
unconditionally goes through push_remote_task, so this directly measures
the scalability of the inject queue.

Results on an M1 Max MacBook Pro (10 cores), spawning 12,800 total tasks:

  threads/1:  3.39 ms  (265 ns/task, 1.00x)
  threads/2:  4.74 ms  (370 ns/task, 1.40x)
  threads/4:  5.89 ms  (460 ns/task, 1.74x)
  threads/8:  8.10 ms  (633 ns/task, 2.39x)

Wall-clock time increases with more threads despite constant total work,
confirming the single mutex serializes producers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alex alex force-pushed the bench-remote-spawn-contention branch from 8449476 to ce47f1a Compare February 28, 2026 18:04
@alex
Copy link
Contributor Author

alex commented Feb 28, 2026

I also tried a version of this that had remote threads pushing onto an mpsc unbounded queue -- it displayed much worse scaling behavior, but it wasn't clear how much of that was about the mpcs queue vs. the injector queue.

Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using .iter_custom() you could measure only the time spent in spawning and ignore the time awaiting handles and cleaning up.

diff --git i/benches/remote_spawn.rs w/benches/remote_spawn.rs
index 16cfa0b9..d0aa4741 100644
--- i/benches/remote_spawn.rs
+++ w/benches/remote_spawn.rs
@@ -32,38 +32,46 @@ fn remote_spawn_contention(c: &mut Criterion) {
             |b, &num_threads| {
                 let rt = rt();
                 let tasks_per_thread = TOTAL_TASKS / num_threads;
+                let barrier = Barrier::new(num_threads);

-                b.iter(|| {
-                    let barrier = Barrier::new(num_threads);
-
-                    std::thread::scope(|s| {
-                        let handles: Vec<_> = (0..num_threads)
-                            .map(|_| {
-                                let barrier = &barrier;
-                                let rt = &rt;
-                                s.spawn(move || {
-                                    let mut join_handles = Vec::with_capacity(tasks_per_thread);
-                                    barrier.wait();
-
-                                    for _ in 0..tasks_per_thread {
-                                        join_handles.push(rt.spawn(async {}));
-                                    }
-                                    join_handles
+                b.iter_custom(|iters| {
+                    let mut total_duration = std::time::Duration::ZERO;
+                    for _ in 0..iters {
+
+                        let start = std::time::Instant::now();
+
+                        let all_handles = std::thread::scope(|s| {
+                            let handles: Vec<_> = (0..num_threads)
+                                .map(|_| {
+                                    let barrier = &barrier;
+                                    let rt = &rt;
+                                    s.spawn(move || {
+                                        let mut join_handles = Vec::with_capacity(tasks_per_thread);
+                                        barrier.wait();
+
+                                        for _ in 0..tasks_per_thread {
+                                            join_handles.push(rt.spawn(async {}));
+                                        }
+                                        join_handles
+                                    })
                                 })
-                            })
-                            .collect();
+                                .collect();
+
+                            handles
+                                .into_iter()
+                                .flat_map(|h| h.join().unwrap())
+                                .collect::<Vec<_>>()
+                        });

-                        let all_handles: Vec<_> = handles
-                            .into_iter()
-                            .flat_map(|h| h.join().unwrap())
-                            .collect();
+                        total_duration += start.elapsed();

                         rt.block_on(async {
                             for h in all_handles {
                                 h.await.unwrap();
                             }
                         });
-                    });
+                    }
+                    total_duration
                 });
             },
         );
@@ -85,7 +93,6 @@ fn parallelism_levels() -> Vec<usize> {

 fn rt() -> Runtime {
     runtime::Builder::new_multi_thread()
-        .enable_all()
         .build()
         .unwrap()
 }

@alex
Copy link
Contributor Author

alex commented Mar 3, 2026

Thanks, will implement this feedback later this afternoon!

Do you agree that this is measuring the right thing? (That's my biggest concern, that I might accidentally be measuring the wrong contention!)

@alex
Copy link
Contributor Author

alex commented Mar 4, 2026

Feedback incorporated, thanks!

Use iter_custom to time only the spawn phase (push_remote_task
contention), excluding the await/cleanup of join handles. Also remove
enable_all() since the benchmark doesn't need IO or timers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alex alex force-pushed the bench-remote-spawn-contention branch from b9d5eb0 to 073e72d Compare March 4, 2026 00:03
@martin-g
Copy link
Member

martin-g commented Mar 4, 2026

Do you agree that this is measuring the right thing?

I think so, yes!

@alex
Copy link
Contributor Author

alex commented Mar 6, 2026

FWIW, once this lands I've got an implementation of a sharded queue (similar to #7757) that improves scalability considerably (tested up to 64 cores).

@martin-g martin-g added I-slow Problems and improvements related to performance. A-benches Area: Benchmarks labels Mar 7, 2026
@alex
Copy link
Contributor Author

alex commented Mar 10, 2026

Is anything else needed to merge this? Happy to take any work.

@Darksonn Darksonn merged commit ee04abe into tokio-rs:master Mar 11, 2026
102 checks passed
@alex alex deleted the bench-remote-spawn-contention branch March 11, 2026 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-benches Area: Benchmarks I-slow Problems and improvements related to performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants