You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add 'Why Not the Networking TS?' motivation section to paper
New Section 2 addresses common objections:
- The Template Tax: N×M instantiations, compile times
- The Encapsulation Problem: leaky APIs vs clean task types
- Translation Unit Benefits: ABI stability, compile time, encapsulation
- Coroutine-Compatible vs Coroutine-First: adapter vs native design
- When to Use the Networking TS Instead: honest trade-offs
Renumbered all sections (now 1-14).
## 2. The Insight: Coroutines as Natural Type Erasure
26
+
## 2. Motivation: Why Not the Networking TS?
27
+
28
+
The Networking TS (and its progenitor Boost.Asio) is the de facto standard for asynchronous I/O in C++. It is mature, well-tested, and supports coroutines through completion tokens. Why build something new?
29
+
30
+
### 2.1 The Template Tax
31
+
32
+
The Networking TS design philosophy—zero-overhead abstraction through templates—incurs costs that compound in large codebases:
33
+
34
+
**Every async operation signature includes executor and handler types:**
Platform types (`OVERLAPPED`, `io_uring_sqe`) are hidden, but the *structure* of the async machinery leaks through every API boundary.
76
+
77
+
**Our approach:**
78
+
```cpp
79
+
classhttp_client
80
+
{
81
+
public:
82
+
task async_get(std::string url); // That's it.
83
+
};
84
+
```
85
+
86
+
The implementation—sockets, buffers, executors—lives in the translation unit. The interface is stable. The ABI is stable. Compilation is fast.
87
+
88
+
### 2.3 Translation Unit Benefits
89
+
90
+
Hiding implementation in `.cpp` files provides:
91
+
92
+
1.**ABI Stability**: Library interfaces don't change when implementation details change. Users don't recompile when you switch from `epoll` to `io_uring`.
93
+
94
+
2.**Compile Time**: Incremental builds recompile only changed translation units. Header changes don't cascade through the dependency graph.
95
+
96
+
3.**Encapsulation**: Platform-specific types (`OVERLAPPED`, `HANDLE`, `io_uring_sqe`) never appear in headers. Mocking and testing become trivial.
97
+
98
+
4.**Binary Size**: One instantiation per function, not one per template argument combination.
99
+
100
+
### 2.4 Coroutine-Compatible vs Coroutine-First
101
+
102
+
The Networking TS added coroutine support through completion tokens like `use_awaitable`:
- **Executor handling is manual**: `co_spawn` and `bind_executor` required
111
+
- **Error handling diverges**: Exceptions vs `error_code` vs `expected`
112
+
- **Mental model mismatch**: Writing coroutines that think in callbacks
113
+
114
+
Our design is **coroutine-first**: the suspension/resumption model is the foundation, not an adapter. Executor propagation is automatic. Type erasure is structural. The callback path (`dispatch().resume()`) is the compatibility layer, not the other way around.
115
+
116
+
### 2.5 When to Use the Networking TS Instead
117
+
118
+
The Networking TS remains the right choice when:
119
+
- You need callback-based APIs for C compatibility
120
+
- Template instantiation cost is acceptable
121
+
- You're already invested in the Asio ecosystem
122
+
- Maximum performance with zero abstraction is required
123
+
- Standardization timeline matters for your project
124
+
125
+
Our framework is better suited when:
126
+
- Coroutines are the primary programming model
127
+
- Public APIs must hide implementation details
128
+
- Compile time and binary size matter
129
+
- ABI stability is required across library boundaries
130
+
- Clean, simple interfaces are prioritized
131
+
132
+
---
133
+
134
+
## 3. The Insight: Coroutines as Natural Type Erasure
27
135
28
136
C++20 coroutines provide type erasure *by construction*—but not through the handle type. `std::coroutine_handle<void>` and `std::coroutine_handle<promise_type>` are both just pointers with identical overhead. The erasure that matters is *structural*:
29
137
@@ -37,7 +145,7 @@ The key insight is that a coroutine's promise can store execution context *by re
37
145
38
146
---
39
147
40
-
## 3. The Executor Model
148
+
## 4. The Executor Model
41
149
42
150
We define an executor as any type satisfying the `is_executor` concept:
43
151
@@ -56,7 +164,7 @@ The distinction between `dispatch` and `post` is fundamental:
56
164
57
165
Executors must be equality-comparable to enable optimizations when source and target executors are identical.
58
166
59
-
### 3.1 Type-Erased Executor Reference
167
+
### 4.1 Type-Erased Executor Reference
60
168
61
169
To store executors without encoding their type, we introduce `executor_ref`—a non-owning, type-erased reference:
A central goal is encapsulation: platform-specific types (`OVERLAPPED`, `io_uring_sqe`, file descriptors) should not appear in public headers. We achieve this through *preallocated, type-erased operation state*.
97
205
98
-
### 4.1 The Socket Abstraction
206
+
### 5.1 The Socket Abstraction
99
207
100
208
```cpp
101
209
structsocket
@@ -127,7 +235,7 @@ The `state` structure:
127
235
3. Contains platform-specific members (OVERLAPPED, handles) invisible to callers
128
236
4. Is allocated *once* at socket construction, not per-operation
129
237
130
-
### 4.2 Intrusive Work Queue
238
+
### 5.2 Intrusive Work Queue
131
239
132
240
Submitted work uses an intrusive singly-linked list:
133
241
@@ -144,7 +252,7 @@ This design eliminates container allocations—each work item carries its own li
144
252
145
253
---
146
254
147
-
## 5. The Affine Awaitable Protocol
255
+
## 6. The Affine Awaitable Protocol
148
256
149
257
The core innovation is how execution context flows through coroutine chains. We extend the standard awaitable protocol with an *affine* overload of `await_suspend` that returns a coroutine handle for symmetric transfer:
150
258
@@ -175,7 +283,7 @@ auto await_transform(Awaitable&& a)
175
283
176
284
This mechanism achieves implicit executor propagation: child coroutines inherit their parent's executor without explicit parameter passing.
177
285
178
-
### 5.1 Symmetric Transfer
286
+
### 6.1 Symmetric Transfer
179
287
180
288
A critical design decision is that `await_suspend` returns `std::coroutine_handle<>` rather than `void`. When `await_suspend` returns a handle, the runtime resumes that coroutine *without growing the stack*—effectively a tail call. This prevents stack overflow in deep coroutine chains.
181
289
@@ -203,7 +311,7 @@ coro dispatch(coro h) const
203
311
204
312
If the executor must post rather than dispatch (cross-thread), it returns `std::noop_coroutine()`.
205
313
206
-
### 5.2 Sender/Receiver Compatibility
314
+
### 6.2 Sender/Receiver Compatibility
207
315
208
316
The design is compatible with P3352R3 and `std::execution`. The `dispatch()` method returns a `std::coroutine_handle<>` that can be used in two ways:
Any coroutine can switch executors mid-operation using `run_on`:
278
386
@@ -294,7 +402,7 @@ This separation of `ex_` (where I run) and `caller_ex_` (where my caller resumes
294
402
295
403
---
296
404
297
-
## 7. Launching the Root: Executor Lifetime
405
+
## 8. Launching the Root: Executor Lifetime
298
406
299
407
Top-level coroutines present a lifetime challenge: the executor must outlive all operations, but the coroutine owns only references. We solve this with a wrapper coroutine that *owns* the executor:
300
408
@@ -328,11 +436,11 @@ This design imposes overhead only at the root—intermediate tasks pay nothing f
328
436
329
437
---
330
438
331
-
## 8. Frame Allocator Customization
439
+
## 9. Frame Allocator Customization
332
440
333
441
The default coroutine allocation strategy—one heap allocation per frame—is suboptimal for repeated operations. We introduce a *frame allocator protocol* that allows I/O objects to provide custom allocation strategies.
The `has_frame_allocator` concept allows I/O objects to opt-in explicitly by providing a `get_frame_allocator()` member function.
351
459
352
-
### 8.2 Task Integration
460
+
### 9.2 Task Integration
353
461
354
462
The `task::promise_type` overloads `operator new` to detect frame allocator providers:
355
463
@@ -374,7 +482,7 @@ struct promise_type
374
482
375
483
When a coroutine's first or second parameter satisfies `has_frame_allocator`, the frame is allocated from that object's allocator. Otherwise, the global heap is used.
376
484
377
-
### 8.3 Allocation Tagging
485
+
### 9.3 Allocation Tagging
378
486
379
487
To enable unified deallocation, we prepend a header to each frame:
380
488
@@ -388,7 +496,7 @@ struct alloc_header
388
496
389
497
The header stores a deallocation function pointer and context. When `operator delete` is called, it reads the header to determine whether to use the custom allocator or the global heap.
390
498
391
-
### 8.4 Thread-Local Frame Pool
499
+
### 9.4 Thread-Local Frame Pool
392
500
393
501
I/O objects implement `get_frame_allocator()` returning a pool with thread-local caching:
394
502
@@ -413,7 +521,7 @@ After the first iteration, frames are recycled without syscalls. The global pool
413
521
414
522
---
415
523
416
-
## 9. Allocation Analysis
524
+
## 10. Allocation Analysis
417
525
418
526
With recycling enabled for both models, we achieve zero steady-state allocations:
419
527
@@ -424,15 +532,15 @@ With recycling enabled for both models, we achieve zero steady-state allocations
424
532
| async_read (3 levels) | 0 | 0 |
425
533
| async_request (100 iterations) | 0 | 0 |
426
534
427
-
### 9.1 The Critical Insight: Recycling Matters for Both
535
+
### 10.1 The Critical Insight: Recycling Matters for Both
428
536
429
537
A naive implementation of either model performs poorly. Without recycling:
430
538
-**Callbacks**: Each I/O operation allocates and deallocates operation state
431
539
-**Coroutines**: Each coroutine frame is heap-allocated and freed
432
540
433
541
The key optimization for *both* models is **thread-local recycling**: caching recently freed memory for immediate reuse by the next operation.
434
542
435
-
### 9.2 Callback Recycling
543
+
### 10.2 Callback Recycling
436
544
437
545
For callbacks, we implement a single-block thread-local cache:
438
546
@@ -468,7 +576,7 @@ struct op_cache
468
576
469
577
The critical pattern is: **delete before dispatch**. When an I/O operation completes, it deallocates its state *before* invoking the completion handler. If that handler immediately starts another operation, the allocation finds the just-freed memory in the cache.
470
578
471
-
### 9.3 Coroutine Frame Pooling
579
+
### 10.3 Coroutine Frame Pooling
472
580
473
581
For coroutines, we use a global frame pool that all coroutines share, regardless of whether they have explicit frame allocator parameters:
This ensures that *all* coroutines—including lambdas, wrappers, and tasks without I/O object parameters—benefit from frame recycling. The pool uses thread-local caching with a global overflow pool for cross-thread scenarios.
485
593
486
-
### 9.4 Amortized Cost
594
+
### 10.4 Amortized Cost
487
595
488
596
Both models achieve **zero steady-state allocations** after warmup. The first iteration populates the caches; all subsequent operations recycle memory without syscalls.
489
597
490
598
---
491
599
492
-
## 10. Performance Comparison
600
+
## 11. Performance Comparison
493
601
494
-
### 10.1 Clang with Frame Elision
602
+
### 11.1 Clang with Frame Elision
495
603
496
604
Benchmarks compiled with Clang 20.1, `-O3`, Windows x64, with `[[clang::coro_await_elidable]]`:
497
605
@@ -502,7 +610,7 @@ Benchmarks compiled with Clang 20.1, `-O3`, Windows x64, with `[[clang::coro_awa
The same benchmarks compiled with MSVC 19.x, RelWithDebInfo, Windows x64:
508
616
@@ -518,7 +626,7 @@ MSVC's coroutine implementation is approximately 2× slower than Clang's for thi
518
626
- No support for `[[clang::coro_await_elidable]]`
519
627
- Different code generation for symmetric transfer
520
628
521
-
### 10.3 Analysis
629
+
### 11.3 Analysis
522
630
523
631
**Overhead Ratio Improves with Depth**: The key observation is that the callback/coroutine ratio *improves* as operation complexity increases:
524
632
@@ -533,7 +641,7 @@ This is because coroutine overhead is *fixed per suspension*, while the useful w
533
641
- MSVC: 1749 ns / 100 ops = 17.5 ns per I/O
534
642
- Callback: ~5 ns per I/O
535
643
536
-
### 10.4 Real-World Context
644
+
### 11.4 Real-World Context
537
645
538
646
For I/O-bound workloads:
539
647
- Network RTT: 100,000+ ns
@@ -544,7 +652,7 @@ Even with MSVC, the coroutine overhead is **0.02%** of a typical network operati
544
652
545
653
---
546
654
547
-
## 11. The Unavoidable Cost: `resume()` Opacity
655
+
## 12. The Unavoidable Cost: `resume()` Opacity
548
656
549
657
Coroutine performance is fundamentally limited by the opacity of `std::coroutine_handle<>::resume()`. The compiler cannot inline across resume boundaries because:
550
658
@@ -556,7 +664,7 @@ Note: This overhead is unrelated to handle typing. Whether you hold `coroutine_h
556
664
557
665
This prevents optimizations that callbacks enable: register allocation across async boundaries, constant propagation through handlers, and dead code elimination of unused paths.
558
666
559
-
### 11.1 HALO and Coroutine Elision
667
+
### 12.1 HALO and Coroutine Elision
560
668
561
669
HALO (Heap Allocation eLision Optimization) can inline coroutine frames when the compiler can prove:
562
670
1. The coroutine is immediately awaited
@@ -586,7 +694,7 @@ With this attribute, Clang can elide nested coroutine frames into the parent's f
586
694
587
695
This optimization is Clang-specific. MSVC does not currently support coroutine await elision, contributing to its 2× slower coroutine performance.
588
696
589
-
### 11.2 Compiler Differences
697
+
### 12.2 Compiler Differences
590
698
591
699
| Feature | Clang 20.x | MSVC 19.x |
592
700
|---------|-----------|-----------|
@@ -597,17 +705,17 @@ This optimization is Clang-specific. MSVC does not currently support coroutine a
597
705
598
706
For performance-critical coroutine code, Clang currently provides superior optimization. MSVC's coroutine implementation continues to improve, but production code should account for this difference.
599
707
600
-
### 11.3 Implemented Mitigations
708
+
### 12.3 Implemented Mitigations
601
709
602
-
1. **Frame pooling** (Section 8): Custom `operator new/delete` with thread-local caching eliminates allocation overhead after warmup
710
+
1. **Frame pooling** (Section 9): Custom `operator new/delete` with thread-local caching eliminates allocation overhead after warmup
603
711
2. **`[[clang::coro_await_elidable]]`**: Enables frame elision for nested coroutines on Clang
0 commit comments