Skip to content

Commit 37b150a

Browse files
committed
Add 'Why Not the Networking TS?' motivation section to paper
New Section 2 addresses common objections: - The Template Tax: N×M instantiations, compile times - The Encapsulation Problem: leaky APIs vs clean task types - Translation Unit Benefits: ABI stability, compile time, encapsulation - Coroutine-Compatible vs Coroutine-First: adapter vs native design - When to Use the Networking TS Instead: honest trade-offs Renumbered all sections (now 1-14).
1 parent a68be4e commit 37b150a

File tree

1 file changed

+144
-36
lines changed

1 file changed

+144
-36
lines changed

doc/research/corosio.md

Lines changed: 144 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,115 @@ We sought a third path.
2323
2424
---
2525
26-
## 2. The Insight: Coroutines as Natural Type Erasure
26+
## 2. Motivation: Why Not the Networking TS?
27+
28+
The Networking TS (and its progenitor Boost.Asio) is the de facto standard for asynchronous I/O in C++. It is mature, well-tested, and supports coroutines through completion tokens. Why build something new?
29+
30+
### 2.1 The Template Tax
31+
32+
The Networking TS design philosophy—zero-overhead abstraction through templates—incurs costs that compound in large codebases:
33+
34+
**Every async operation signature includes executor and handler types:**
35+
```cpp
36+
template<class Executor, class ReadHandler>
37+
void async_read(basic_socket<Protocol, Executor>& s,
38+
MutableBufferSequence const& buffers,
39+
ReadHandler&& handler);
40+
```
41+
42+
**Composed operations propagate these types:**
43+
```cpp
44+
template<class Executor, class Handler>
45+
void async_http_request(basic_socket<tcp, Executor>& sock,
46+
http::request const& req,
47+
Handler&& handler);
48+
```
49+
50+
This creates:
51+
- **N×M template instantiations** for N operations × M executor/handler combinations
52+
- **Header-only dependencies** that must be recompiled on change
53+
- **Binary size growth** that can reach megabytes in complex applications
54+
- **Compile times measured in minutes** for moderate codebases
55+
56+
### 2.2 The Encapsulation Problem
57+
58+
The Networking TS templates expose implementation details through public interfaces:
59+
60+
```cpp
61+
// User sees:
62+
class http_client
63+
{
64+
public:
65+
template<class Executor, class Handler>
66+
void async_get(Executor ex, std::string url, Handler&& h);
67+
};
68+
```
69+
70+
The user is forced to know:
71+
- What executor types are valid
72+
- What handler signatures are expected
73+
- That the implementation uses sockets at all
74+
75+
Platform types (`OVERLAPPED`, `io_uring_sqe`) are hidden, but the *structure* of the async machinery leaks through every API boundary.
76+
77+
**Our approach:**
78+
```cpp
79+
class http_client
80+
{
81+
public:
82+
task async_get(std::string url); // That's it.
83+
};
84+
```
85+
86+
The implementation—sockets, buffers, executors—lives in the translation unit. The interface is stable. The ABI is stable. Compilation is fast.
87+
88+
### 2.3 Translation Unit Benefits
89+
90+
Hiding implementation in `.cpp` files provides:
91+
92+
1. **ABI Stability**: Library interfaces don't change when implementation details change. Users don't recompile when you switch from `epoll` to `io_uring`.
93+
94+
2. **Compile Time**: Incremental builds recompile only changed translation units. Header changes don't cascade through the dependency graph.
95+
96+
3. **Encapsulation**: Platform-specific types (`OVERLAPPED`, `HANDLE`, `io_uring_sqe`) never appear in headers. Mocking and testing become trivial.
97+
98+
4. **Binary Size**: One instantiation per function, not one per template argument combination.
99+
100+
### 2.4 Coroutine-Compatible vs Coroutine-First
101+
102+
The Networking TS added coroutine support through completion tokens like `use_awaitable`:
103+
104+
```cpp
105+
co_await async_read(socket, buffer, use_awaitable);
106+
```
107+
108+
This adapts callback-based operations for coroutines. It works, but:
109+
- **Double indirection**: Callback machinery wraps coroutine machinery
110+
- **Executor handling is manual**: `co_spawn` and `bind_executor` required
111+
- **Error handling diverges**: Exceptions vs `error_code` vs `expected`
112+
- **Mental model mismatch**: Writing coroutines that think in callbacks
113+
114+
Our design is **coroutine-first**: the suspension/resumption model is the foundation, not an adapter. Executor propagation is automatic. Type erasure is structural. The callback path (`dispatch().resume()`) is the compatibility layer, not the other way around.
115+
116+
### 2.5 When to Use the Networking TS Instead
117+
118+
The Networking TS remains the right choice when:
119+
- You need callback-based APIs for C compatibility
120+
- Template instantiation cost is acceptable
121+
- You're already invested in the Asio ecosystem
122+
- Maximum performance with zero abstraction is required
123+
- Standardization timeline matters for your project
124+
125+
Our framework is better suited when:
126+
- Coroutines are the primary programming model
127+
- Public APIs must hide implementation details
128+
- Compile time and binary size matter
129+
- ABI stability is required across library boundaries
130+
- Clean, simple interfaces are prioritized
131+
132+
---
133+
134+
## 3. The Insight: Coroutines as Natural Type Erasure
27135
28136
C++20 coroutines provide type erasure *by construction*—but not through the handle type. `std::coroutine_handle<void>` and `std::coroutine_handle<promise_type>` are both just pointers with identical overhead. The erasure that matters is *structural*:
29137
@@ -37,7 +145,7 @@ The key insight is that a coroutine's promise can store execution context *by re
37145
38146
---
39147
40-
## 3. The Executor Model
148+
## 4. The Executor Model
41149
42150
We define an executor as any type satisfying the `is_executor` concept:
43151
@@ -56,7 +164,7 @@ The distinction between `dispatch` and `post` is fundamental:
56164

57165
Executors must be equality-comparable to enable optimizations when source and target executors are identical.
58166

59-
### 3.1 Type-Erased Executor Reference
167+
### 4.1 Type-Erased Executor Reference
60168

61169
To store executors without encoding their type, we introduce `executor_ref`—a non-owning, type-erased reference:
62170

@@ -91,11 +199,11 @@ bool operator==(executor_ref const& other) const noexcept
91199

92200
---
93201

94-
## 4. Platform I/O: Hiding the Machinery
202+
## 5. Platform I/O: Hiding the Machinery
95203

96204
A central goal is encapsulation: platform-specific types (`OVERLAPPED`, `io_uring_sqe`, file descriptors) should not appear in public headers. We achieve this through *preallocated, type-erased operation state*.
97205

98-
### 4.1 The Socket Abstraction
206+
### 5.1 The Socket Abstraction
99207

100208
```cpp
101209
struct socket
@@ -127,7 +235,7 @@ The `state` structure:
127235
3. Contains platform-specific members (OVERLAPPED, handles) invisible to callers
128236
4. Is allocated *once* at socket construction, not per-operation
129237

130-
### 4.2 Intrusive Work Queue
238+
### 5.2 Intrusive Work Queue
131239

132240
Submitted work uses an intrusive singly-linked list:
133241

@@ -144,7 +252,7 @@ This design eliminates container allocations—each work item carries its own li
144252

145253
---
146254

147-
## 5. The Affine Awaitable Protocol
255+
## 6. The Affine Awaitable Protocol
148256

149257
The core innovation is how execution context flows through coroutine chains. We extend the standard awaitable protocol with an *affine* overload of `await_suspend` that returns a coroutine handle for symmetric transfer:
150258

@@ -175,7 +283,7 @@ auto await_transform(Awaitable&& a)
175283

176284
This mechanism achieves implicit executor propagation: child coroutines inherit their parent's executor without explicit parameter passing.
177285

178-
### 5.1 Symmetric Transfer
286+
### 6.1 Symmetric Transfer
179287

180288
A critical design decision is that `await_suspend` returns `std::coroutine_handle<>` rather than `void`. When `await_suspend` returns a handle, the runtime resumes that coroutine *without growing the stack*—effectively a tail call. This prevents stack overflow in deep coroutine chains.
181289

@@ -203,7 +311,7 @@ coro dispatch(coro h) const
203311
204312
If the executor must post rather than dispatch (cross-thread), it returns `std::noop_coroutine()`.
205313
206-
### 5.2 Sender/Receiver Compatibility
314+
### 6.2 Sender/Receiver Compatibility
207315
208316
The design is compatible with P3352R3 and `std::execution`. The `dispatch()` method returns a `std::coroutine_handle<>` that can be used in two ways:
209317
@@ -232,7 +340,7 @@ ex.dispatch(h).resume(); // Explicit resume, noop if posted
232340
233341
This means one executor interface serves both coroutines and senders with no conditional code paths.
234342
235-
### 5.3 The Task Type
343+
### 6.3 The Task Type
236344
237345
The `task` type represents a lazy, composable coroutine:
238346
@@ -272,7 +380,7 @@ std::coroutine_handle<> await_suspend(coro h) const noexcept
272380
273381
---
274382
275-
## 6. Executor Switching: The `run_on` Primitive
383+
## 7. Executor Switching: The `run_on` Primitive
276384
277385
Any coroutine can switch executors mid-operation using `run_on`:
278386
@@ -294,7 +402,7 @@ This separation of `ex_` (where I run) and `caller_ex_` (where my caller resumes
294402

295403
---
296404

297-
## 7. Launching the Root: Executor Lifetime
405+
## 8. Launching the Root: Executor Lifetime
298406

299407
Top-level coroutines present a lifetime challenge: the executor must outlive all operations, but the coroutine owns only references. We solve this with a wrapper coroutine that *owns* the executor:
300408

@@ -328,11 +436,11 @@ This design imposes overhead only at the root—intermediate tasks pay nothing f
328436
329437
---
330438
331-
## 8. Frame Allocator Customization
439+
## 9. Frame Allocator Customization
332440
333441
The default coroutine allocation strategy—one heap allocation per frame—is suboptimal for repeated operations. We introduce a *frame allocator protocol* that allows I/O objects to provide custom allocation strategies.
334442
335-
### 8.1 The Frame Allocator Concepts
443+
### 9.1 The Frame Allocator Concepts
336444
337445
```cpp
338446
template<class A>
@@ -349,7 +457,7 @@ concept has_frame_allocator = requires(T& t) {
349457

350458
The `has_frame_allocator` concept allows I/O objects to opt-in explicitly by providing a `get_frame_allocator()` member function.
351459

352-
### 8.2 Task Integration
460+
### 9.2 Task Integration
353461

354462
The `task::promise_type` overloads `operator new` to detect frame allocator providers:
355463

@@ -374,7 +482,7 @@ struct promise_type
374482
375483
When a coroutine's first or second parameter satisfies `has_frame_allocator`, the frame is allocated from that object's allocator. Otherwise, the global heap is used.
376484
377-
### 8.3 Allocation Tagging
485+
### 9.3 Allocation Tagging
378486
379487
To enable unified deallocation, we prepend a header to each frame:
380488
@@ -388,7 +496,7 @@ struct alloc_header
388496

389497
The header stores a deallocation function pointer and context. When `operator delete` is called, it reads the header to determine whether to use the custom allocator or the global heap.
390498

391-
### 8.4 Thread-Local Frame Pool
499+
### 9.4 Thread-Local Frame Pool
392500

393501
I/O objects implement `get_frame_allocator()` returning a pool with thread-local caching:
394502

@@ -413,7 +521,7 @@ After the first iteration, frames are recycled without syscalls. The global pool
413521

414522
---
415523

416-
## 9. Allocation Analysis
524+
## 10. Allocation Analysis
417525

418526
With recycling enabled for both models, we achieve zero steady-state allocations:
419527

@@ -424,15 +532,15 @@ With recycling enabled for both models, we achieve zero steady-state allocations
424532
| async_read (3 levels) | 0 | 0 |
425533
| async_request (100 iterations) | 0 | 0 |
426534

427-
### 9.1 The Critical Insight: Recycling Matters for Both
535+
### 10.1 The Critical Insight: Recycling Matters for Both
428536

429537
A naive implementation of either model performs poorly. Without recycling:
430538
- **Callbacks**: Each I/O operation allocates and deallocates operation state
431539
- **Coroutines**: Each coroutine frame is heap-allocated and freed
432540

433541
The key optimization for *both* models is **thread-local recycling**: caching recently freed memory for immediate reuse by the next operation.
434542

435-
### 9.2 Callback Recycling
543+
### 10.2 Callback Recycling
436544

437545
For callbacks, we implement a single-block thread-local cache:
438546

@@ -468,7 +576,7 @@ struct op_cache
468576

469577
The critical pattern is: **delete before dispatch**. When an I/O operation completes, it deallocates its state *before* invoking the completion handler. If that handler immediately starts another operation, the allocation finds the just-freed memory in the cache.
470578

471-
### 9.3 Coroutine Frame Pooling
579+
### 10.3 Coroutine Frame Pooling
472580

473581
For coroutines, we use a global frame pool that all coroutines share, regardless of whether they have explicit frame allocator parameters:
474582

@@ -483,15 +591,15 @@ static void* operator new(std::size_t size)
483591

484592
This ensures that *all* coroutines—including lambdas, wrappers, and tasks without I/O object parameters—benefit from frame recycling. The pool uses thread-local caching with a global overflow pool for cross-thread scenarios.
485593

486-
### 9.4 Amortized Cost
594+
### 10.4 Amortized Cost
487595

488596
Both models achieve **zero steady-state allocations** after warmup. The first iteration populates the caches; all subsequent operations recycle memory without syscalls.
489597

490598
---
491599

492-
## 10. Performance Comparison
600+
## 11. Performance Comparison
493601

494-
### 10.1 Clang with Frame Elision
602+
### 11.1 Clang with Frame Elision
495603

496604
Benchmarks compiled with Clang 20.1, `-O3`, Windows x64, with `[[clang::coro_await_elidable]]`:
497605

@@ -502,7 +610,7 @@ Benchmarks compiled with Clang 20.1, `-O3`, Windows x64, with `[[clang::coro_awa
502610
| async_read (10×) | 44 ns | 99 ns | 2.3× |
503611
| async_request (100×) | 498 ns | 750 ns | **1.5×** |
504612

505-
### 10.2 MSVC Comparison
613+
### 11.2 MSVC Comparison
506614

507615
The same benchmarks compiled with MSVC 19.x, RelWithDebInfo, Windows x64:
508616

@@ -518,7 +626,7 @@ MSVC's coroutine implementation is approximately 2× slower than Clang's for thi
518626
- No support for `[[clang::coro_await_elidable]]`
519627
- Different code generation for symmetric transfer
520628

521-
### 10.3 Analysis
629+
### 11.3 Analysis
522630

523631
**Overhead Ratio Improves with Depth**: The key observation is that the callback/coroutine ratio *improves* as operation complexity increases:
524632

@@ -533,7 +641,7 @@ This is because coroutine overhead is *fixed per suspension*, while the useful w
533641
- MSVC: 1749 ns / 100 ops = 17.5 ns per I/O
534642
- Callback: ~5 ns per I/O
535643

536-
### 10.4 Real-World Context
644+
### 11.4 Real-World Context
537645

538646
For I/O-bound workloads:
539647
- Network RTT: 100,000+ ns
@@ -544,7 +652,7 @@ Even with MSVC, the coroutine overhead is **0.02%** of a typical network operati
544652

545653
---
546654

547-
## 11. The Unavoidable Cost: `resume()` Opacity
655+
## 12. The Unavoidable Cost: `resume()` Opacity
548656

549657
Coroutine performance is fundamentally limited by the opacity of `std::coroutine_handle<>::resume()`. The compiler cannot inline across resume boundaries because:
550658

@@ -556,7 +664,7 @@ Note: This overhead is unrelated to handle typing. Whether you hold `coroutine_h
556664

557665
This prevents optimizations that callbacks enable: register allocation across async boundaries, constant propagation through handlers, and dead code elimination of unused paths.
558666

559-
### 11.1 HALO and Coroutine Elision
667+
### 12.1 HALO and Coroutine Elision
560668

561669
HALO (Heap Allocation eLision Optimization) can inline coroutine frames when the compiler can prove:
562670
1. The coroutine is immediately awaited
@@ -586,7 +694,7 @@ With this attribute, Clang can elide nested coroutine frames into the parent's f
586694
587695
This optimization is Clang-specific. MSVC does not currently support coroutine await elision, contributing to its 2× slower coroutine performance.
588696
589-
### 11.2 Compiler Differences
697+
### 12.2 Compiler Differences
590698
591699
| Feature | Clang 20.x | MSVC 19.x |
592700
|---------|-----------|-----------|
@@ -597,17 +705,17 @@ This optimization is Clang-specific. MSVC does not currently support coroutine a
597705
598706
For performance-critical coroutine code, Clang currently provides superior optimization. MSVC's coroutine implementation continues to improve, but production code should account for this difference.
599707
600-
### 11.3 Implemented Mitigations
708+
### 12.3 Implemented Mitigations
601709
602-
1. **Frame pooling** (Section 8): Custom `operator new/delete` with thread-local caching eliminates allocation overhead after warmup
710+
1. **Frame pooling** (Section 9): Custom `operator new/delete` with thread-local caching eliminates allocation overhead after warmup
603711
2. **`[[clang::coro_await_elidable]]`**: Enables frame elision for nested coroutines on Clang
604-
3. **Symmetric transfer** (Section 5.1): Returning handles from `await_suspend` prevents stack growth
605-
4. **Preallocated I/O state** (Section 4.1): Socket operation state is allocated once, not per-operation
712+
3. **Symmetric transfer** (Section 6.1): Returning handles from `await_suspend` prevents stack growth
713+
4. **Preallocated I/O state** (Section 5.1): Socket operation state is allocated once, not per-operation
606714
5. **Global frame pool fallback**: Coroutines without explicit frame allocator parameters still benefit from pooling
607715
608716
---
609717
610-
## 12. Design Trade-offs
718+
## 13. Design Trade-offs
611719
612720
| Aspect | Callback Model | Coroutine Model |
613721
|--------|---------------|-----------------|
@@ -627,7 +735,7 @@ For performance-critical coroutine code, Clang currently provides superior optim
627735
628736
---
629737
630-
## 13. Conclusion
738+
## 14. Conclusion
631739
632740
We have demonstrated a coroutine-first asynchronous I/O framework that achieves:
633741

0 commit comments

Comments
 (0)