bench: add remote_spawn benchmark for inject queue contention#7944
bench: add remote_spawn benchmark for inject queue contention#7944Darksonn merged 2 commits intotokio-rs:masterfrom
Conversation
Add a benchmark that measures contention on the scheduler's inject queue mutex (push_remote_task) when multiple external threads spawn tasks into the runtime simultaneously. Every rt.spawn() from a non-worker thread unconditionally goes through push_remote_task, so this directly measures the scalability of the inject queue. Results on an M1 Max MacBook Pro (10 cores), spawning 12,800 total tasks: threads/1: 3.39 ms (265 ns/task, 1.00x) threads/2: 4.74 ms (370 ns/task, 1.40x) threads/4: 5.89 ms (460 ns/task, 1.74x) threads/8: 8.10 ms (633 ns/task, 2.39x) Wall-clock time increases with more threads despite constant total work, confirming the single mutex serializes producers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8449476 to
ce47f1a
Compare
|
I also tried a version of this that had remote threads pushing onto an mpsc unbounded queue -- it displayed much worse scaling behavior, but it wasn't clear how much of that was about the mpcs queue vs. the injector queue. |
martin-g
left a comment
There was a problem hiding this comment.
By using .iter_custom() you could measure only the time spent in spawning and ignore the time awaiting handles and cleaning up.
diff --git i/benches/remote_spawn.rs w/benches/remote_spawn.rs
index 16cfa0b9..d0aa4741 100644
--- i/benches/remote_spawn.rs
+++ w/benches/remote_spawn.rs
@@ -32,38 +32,46 @@ fn remote_spawn_contention(c: &mut Criterion) {
|b, &num_threads| {
let rt = rt();
let tasks_per_thread = TOTAL_TASKS / num_threads;
+ let barrier = Barrier::new(num_threads);
- b.iter(|| {
- let barrier = Barrier::new(num_threads);
-
- std::thread::scope(|s| {
- let handles: Vec<_> = (0..num_threads)
- .map(|_| {
- let barrier = &barrier;
- let rt = &rt;
- s.spawn(move || {
- let mut join_handles = Vec::with_capacity(tasks_per_thread);
- barrier.wait();
-
- for _ in 0..tasks_per_thread {
- join_handles.push(rt.spawn(async {}));
- }
- join_handles
+ b.iter_custom(|iters| {
+ let mut total_duration = std::time::Duration::ZERO;
+ for _ in 0..iters {
+
+ let start = std::time::Instant::now();
+
+ let all_handles = std::thread::scope(|s| {
+ let handles: Vec<_> = (0..num_threads)
+ .map(|_| {
+ let barrier = &barrier;
+ let rt = &rt;
+ s.spawn(move || {
+ let mut join_handles = Vec::with_capacity(tasks_per_thread);
+ barrier.wait();
+
+ for _ in 0..tasks_per_thread {
+ join_handles.push(rt.spawn(async {}));
+ }
+ join_handles
+ })
})
- })
- .collect();
+ .collect();
+
+ handles
+ .into_iter()
+ .flat_map(|h| h.join().unwrap())
+ .collect::<Vec<_>>()
+ });
- let all_handles: Vec<_> = handles
- .into_iter()
- .flat_map(|h| h.join().unwrap())
- .collect();
+ total_duration += start.elapsed();
rt.block_on(async {
for h in all_handles {
h.await.unwrap();
}
});
- });
+ }
+ total_duration
});
},
);
@@ -85,7 +93,6 @@ fn parallelism_levels() -> Vec<usize> {
fn rt() -> Runtime {
runtime::Builder::new_multi_thread()
- .enable_all()
.build()
.unwrap()
}|
Thanks, will implement this feedback later this afternoon! Do you agree that this is measuring the right thing? (That's my biggest concern, that I might accidentally be measuring the wrong contention!) |
|
Feedback incorporated, thanks! |
Use iter_custom to time only the spawn phase (push_remote_task contention), excluding the await/cleanup of join handles. Also remove enable_all() since the benchmark doesn't need IO or timers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b9d5eb0 to
073e72d
Compare
I think so, yes! |
|
FWIW, once this lands I've got an implementation of a sharded queue (similar to #7757) that improves scalability considerably (tested up to 64 cores). |
|
Is anything else needed to merge this? Happy to take any work. |
Add a benchmark that measures contention on the scheduler's inject queue mutex (push_remote_task) when multiple external threads spawn tasks into the runtime simultaneously. Every rt.spawn() from a non-worker thread unconditionally goes through push_remote_task, so this directly measures the scalability of the inject queue.
Results on an M1 Max MacBook Pro (10 cores), spawning 12,800 total tasks:
threads/1: 3.39 ms (265 ns/task, 1.00x)
threads/2: 4.74 ms (370 ns/task, 1.40x)
threads/4: 5.89 ms (460 ns/task, 1.74x)
threads/8: 8.10 ms (633 ns/task, 2.39x)
Wall-clock time increases with more threads despite constant total work, confirming the single mutex serializes producers.
Motivation
Solution