Add 'Why Not the Networking TS?' motivation section to paper

vinniefalco · vinniefalco · commit 37b150abd133 · 2026-01-01T21:00:56.000-08:00
New Section 2 addresses common objections:
- The Template Tax: NÃ—M instantiations, compile times
- The Encapsulation Problem: leaky APIs vs clean task types
- Translation Unit Benefits: ABI stability, compile time, encapsulation
- Coroutine-Compatible vs Coroutine-First: adapter vs native design
- When to Use the Networking TS Instead: honest trade-offs

Renumbered all sections (now 1-14).
diff --git a/doc/research/corosio.md b/doc/research/corosio.md
@@ -23,7 +23,115 @@ We sought a third path.
 
 ---
 
-## 2. The Insight: Coroutines as Natural Type Erasure
+## 2. Motivation: Why Not the Networking TS?
+
+The Networking TS (and its progenitor Boost.Asio) is the de facto standard for asynchronous I/O in C++. It is mature, well-tested, and supports coroutines through completion tokens. Why build something new?
+
+### 2.1 The Template Tax
+
+The Networking TS design philosophy—zero-overhead abstraction through templates—incurs costs that compound in large codebases:
+
+**Every async operation signature includes executor and handler types:**
+```cpp
+template<class Executor, class ReadHandler>
+void async_read(basic_socket<Protocol, Executor>& s,
+                MutableBufferSequence const& buffers,
+                ReadHandler&& handler);
+```
+
+**Composed operations propagate these types:**
+```cpp
+template<class Executor, class Handler>
+void async_http_request(basic_socket<tcp, Executor>& sock,
+                        http::request const& req,
+                        Handler&& handler);
+```
+
+This creates:
+- **N×M template instantiations** for N operations × M executor/handler combinations
+- **Header-only dependencies** that must be recompiled on change
+- **Binary size growth** that can reach megabytes in complex applications
+- **Compile times measured in minutes** for moderate codebases
+
+### 2.2 The Encapsulation Problem
+
+The Networking TS templates expose implementation details through public interfaces:
+
+```cpp
+// User sees:
+class http_client
+{
+public:
+    template<class Executor, class Handler>
+    void async_get(Executor ex, std::string url, Handler&& h);
+};
+```
+
+The user is forced to know:
+- What executor types are valid
+- What handler signatures are expected
+- That the implementation uses sockets at all
+
+Platform types (`OVERLAPPED`, `io_uring_sqe`) are hidden, but the *structure* of the async machinery leaks through every API boundary.
+
+**Our approach:**
+```cpp
+class http_client
+{
+public:
+    task async_get(std::string url);  // That's it.
+};
+```
+
+The implementation—sockets, buffers, executors—lives in the translation unit. The interface is stable. The ABI is stable. Compilation is fast.
+
+### 2.3 Translation Unit Benefits
+
+Hiding implementation in `.cpp` files provides:
+
+1. **ABI Stability**: Library interfaces don't change when implementation details change. Users don't recompile when you switch from `epoll` to `io_uring`.
+
+2. **Compile Time**: Incremental builds recompile only changed translation units. Header changes don't cascade through the dependency graph.
+
+3. **Encapsulation**: Platform-specific types (`OVERLAPPED`, `HANDLE`, `io_uring_sqe`) never appear in headers. Mocking and testing become trivial.
+
+4. **Binary Size**: One instantiation per function, not one per template argument combination.
+
+### 2.4 Coroutine-Compatible vs Coroutine-First
+
+The Networking TS added coroutine support through completion tokens like `use_awaitable`:
+
+```cpp
+co_await async_read(socket, buffer, use_awaitable);
+```
+
+This adapts callback-based operations for coroutines. It works, but:
+- **Double indirection**: Callback machinery wraps coroutine machinery
+- **Executor handling is manual**: `co_spawn` and `bind_executor` required
+- **Error handling diverges**: Exceptions vs `error_code` vs `expected`
+- **Mental model mismatch**: Writing coroutines that think in callbacks
+
+Our design is **coroutine-first**: the suspension/resumption model is the foundation, not an adapter. Executor propagation is automatic. Type erasure is structural. The callback path (`dispatch().resume()`) is the compatibility layer, not the other way around.
+
+### 2.5 When to Use the Networking TS Instead
+
+The Networking TS remains the right choice when:
+- You need callback-based APIs for C compatibility
+- Template instantiation cost is acceptable
+- You're already invested in the Asio ecosystem
+- Maximum performance with zero abstraction is required
+- Standardization timeline matters for your project
+
+Our framework is better suited when:
+- Coroutines are the primary programming model
+- Public APIs must hide implementation details
+- Compile time and binary size matter
+- ABI stability is required across library boundaries
+- Clean, simple interfaces are prioritized
+
+---
+
+## 3. The Insight: Coroutines as Natural Type Erasure
 
 C++20 coroutines provide type erasure *by construction*—but not through the handle type. `std::coroutine_handle<void>` and `std::coroutine_handle<promise_type>` are both just pointers with identical overhead. The erasure that matters is *structural*:
 
@@ -37,7 +145,7 @@ The key insight is that a coroutine's promise can store execution context *by re
 
 ---
 
-## 3. The Executor Model
+## 4. The Executor Model
 
 We define an executor as any type satisfying the `is_executor` concept:
 
@@ -56,7 +164,7 @@ The distinction between `dispatch` and `post` is fundamental:
 
 Executors must be equality-comparable to enable optimizations when source and target executors are identical.
 
-### 3.1 Type-Erased Executor Reference
+### 4.1 Type-Erased Executor Reference
 
 To store executors without encoding their type, we introduce `executor_ref`—a non-owning, type-erased reference:
 
@@ -91,11 +199,11 @@ bool operator==(executor_ref const& other) const noexcept
 
 ---
 
-## 4. Platform I/O: Hiding the Machinery
+## 5. Platform I/O: Hiding the Machinery
 
 A central goal is encapsulation: platform-specific types (`OVERLAPPED`, `io_uring_sqe`, file descriptors) should not appear in public headers. We achieve this through *preallocated, type-erased operation state*.
 
-### 4.1 The Socket Abstraction
+### 5.1 The Socket Abstraction
 
 ```cpp
 struct socket
@@ -127,7 +235,7 @@ The `state` structure:
 3. Contains platform-specific members (OVERLAPPED, handles) invisible to callers
 4. Is allocated *once* at socket construction, not per-operation
 
-### 4.2 Intrusive Work Queue
+### 5.2 Intrusive Work Queue
 
 Submitted work uses an intrusive singly-linked list:
 
@@ -144,7 +252,7 @@ This design eliminates container allocations—each work item carries its own li
 
 ---
 
-## 5. The Affine Awaitable Protocol
+## 6. The Affine Awaitable Protocol
 
 The core innovation is how execution context flows through coroutine chains. We extend the standard awaitable protocol with an *affine* overload of `await_suspend` that returns a coroutine handle for symmetric transfer:
 
@@ -175,7 +283,7 @@ auto await_transform(Awaitable&& a)
 
 This mechanism achieves implicit executor propagation: child coroutines inherit their parent's executor without explicit parameter passing.
 
-### 5.1 Symmetric Transfer
+### 6.1 Symmetric Transfer
 
 A critical design decision is that `await_suspend` returns `std::coroutine_handle<>` rather than `void`. When `await_suspend` returns a handle, the runtime resumes that coroutine *without growing the stack*—effectively a tail call. This prevents stack overflow in deep coroutine chains.
 
@@ -203,7 +311,7 @@ coro dispatch(coro h) const
 
 If the executor must post rather than dispatch (cross-thread), it returns `std::noop_coroutine()`.
 
-### 5.2 Sender/Receiver Compatibility
+### 6.2 Sender/Receiver Compatibility
 
 The design is compatible with P3352R3 and `std::execution`. The `dispatch()` method returns a `std::coroutine_handle<>` that can be used in two ways:
 
@@ -232,7 +340,7 @@ ex.dispatch(h).resume();      // Explicit resume, noop if posted
 
 This means one executor interface serves both coroutines and senders with no conditional code paths.
 
-### 5.3 The Task Type
+### 6.3 The Task Type
 
 The `task` type represents a lazy, composable coroutine:
 
@@ -272,7 +380,7 @@ std::coroutine_handle<> await_suspend(coro h) const noexcept
 
 ---
 
-## 6. Executor Switching: The `run_on` Primitive
+## 7. Executor Switching: The `run_on` Primitive
 
 Any coroutine can switch executors mid-operation using `run_on`:
 
@@ -294,7 +402,7 @@ This separation of `ex_` (where I run) and `caller_ex_` (where my caller resumes
 
 ---
 
-## 7. Launching the Root: Executor Lifetime
+## 8. Launching the Root: Executor Lifetime
 
 Top-level coroutines present a lifetime challenge: the executor must outlive all operations, but the coroutine owns only references. We solve this with a wrapper coroutine that *owns* the executor:
 
@@ -328,11 +436,11 @@ This design imposes overhead only at the root—intermediate tasks pay nothing f
 
 ---
 
-## 8. Frame Allocator Customization
+## 9. Frame Allocator Customization
 
 The default coroutine allocation strategy—one heap allocation per frame—is suboptimal for repeated operations. We introduce a *frame allocator protocol* that allows I/O objects to provide custom allocation strategies.
 
-### 8.1 The Frame Allocator Concepts
+### 9.1 The Frame Allocator Concepts
 
 ```cpp
 template<class A>
@@ -349,7 +457,7 @@ concept has_frame_allocator = requires(T& t) {
 
 The `has_frame_allocator` concept allows I/O objects to opt-in explicitly by providing a `get_frame_allocator()` member function.
 
-### 8.2 Task Integration
+### 9.2 Task Integration
 
 The `task::promise_type` overloads `operator new` to detect frame allocator providers:
 
@@ -374,7 +482,7 @@ struct promise_type
 
 When a coroutine's first or second parameter satisfies `has_frame_allocator`, the frame is allocated from that object's allocator. Otherwise, the global heap is used.
 
-### 8.3 Allocation Tagging
+### 9.3 Allocation Tagging
 
 To enable unified deallocation, we prepend a header to each frame:
 
@@ -388,7 +496,7 @@ struct alloc_header
 
 The header stores a deallocation function pointer and context. When `operator delete` is called, it reads the header to determine whether to use the custom allocator or the global heap.
 
-### 8.4 Thread-Local Frame Pool
+### 9.4 Thread-Local Frame Pool
 
 I/O objects implement `get_frame_allocator()` returning a pool with thread-local caching:
 
@@ -413,7 +521,7 @@ After the first iteration, frames are recycled without syscalls. The global pool
 
 ---
 
-## 9. Allocation Analysis
+## 10. Allocation Analysis
 
 With recycling enabled for both models, we achieve zero steady-state allocations:
 
@@ -424,15 +532,15 @@ With recycling enabled for both models, we achieve zero steady-state allocations
 | async_read (3 levels) | 0 | 0 |
 | async_request (100 iterations) | 0 | 0 |
 
-### 9.1 The Critical Insight: Recycling Matters for Both
+### 10.1 The Critical Insight: Recycling Matters for Both
 
 A naive implementation of either model performs poorly. Without recycling:
 - **Callbacks**: Each I/O operation allocates and deallocates operation state
 - **Coroutines**: Each coroutine frame is heap-allocated and freed
 
 The key optimization for *both* models is **thread-local recycling**: caching recently freed memory for immediate reuse by the next operation.
 
-### 9.2 Callback Recycling
+### 10.2 Callback Recycling
 
 For callbacks, we implement a single-block thread-local cache:
 
@@ -468,7 +576,7 @@ struct op_cache
 
 The critical pattern is: **delete before dispatch**. When an I/O operation completes, it deallocates its state *before* invoking the completion handler. If that handler immediately starts another operation, the allocation finds the just-freed memory in the cache.
 
-### 9.3 Coroutine Frame Pooling
+### 10.3 Coroutine Frame Pooling
 
 For coroutines, we use a global frame pool that all coroutines share, regardless of whether they have explicit frame allocator parameters:
 
@@ -483,15 +591,15 @@ static void* operator new(std::size_t size)
 
 This ensures that *all* coroutines—including lambdas, wrappers, and tasks without I/O object parameters—benefit from frame recycling. The pool uses thread-local caching with a global overflow pool for cross-thread scenarios.
 
-### 9.4 Amortized Cost
+### 10.4 Amortized Cost
 
 Both models achieve **zero steady-state allocations** after warmup. The first iteration populates the caches; all subsequent operations recycle memory without syscalls.
 
 ---
 
-## 10. Performance Comparison
+## 11. Performance Comparison
 
-### 10.1 Clang with Frame Elision
+### 11.1 Clang with Frame Elision
 
 Benchmarks compiled with Clang 20.1, `-O3`, Windows x64, with `[[clang::coro_await_elidable]]`:
 
@@ -502,7 +610,7 @@ Benchmarks compiled with Clang 20.1, `-O3`, Windows x64, with `[[clang::coro_awa
 | async_read (10×) | 44 ns | 99 ns | 2.3× |
 | async_request (100×) | 498 ns | 750 ns | **1.5×** |
 
-### 10.2 MSVC Comparison
+### 11.2 MSVC Comparison
 
 The same benchmarks compiled with MSVC 19.x, RelWithDebInfo, Windows x64:
 
@@ -518,7 +626,7 @@ MSVC's coroutine implementation is approximately 2× slower than Clang's for thi
 - No support for `[[clang::coro_await_elidable]]`
 - Different code generation for symmetric transfer
 
-### 10.3 Analysis
+### 11.3 Analysis
 
 **Overhead Ratio Improves with Depth**: The key observation is that the callback/coroutine ratio *improves* as operation complexity increases:
 
@@ -533,7 +641,7 @@ This is because coroutine overhead is *fixed per suspension*, while the useful w
 - MSVC: 1749 ns / 100 ops = 17.5 ns per I/O
 - Callback: ~5 ns per I/O
 
-### 10.4 Real-World Context
+### 11.4 Real-World Context
 
 For I/O-bound workloads:
 - Network RTT: 100,000+ ns
@@ -544,7 +652,7 @@ Even with MSVC, the coroutine overhead is **0.02%** of a typical network operati
 
 ---
 
-## 11. The Unavoidable Cost: `resume()` Opacity
+## 12. The Unavoidable Cost: `resume()` Opacity
 
 Coroutine performance is fundamentally limited by the opacity of `std::coroutine_handle<>::resume()`. The compiler cannot inline across resume boundaries because:
 
@@ -556,7 +664,7 @@ Note: This overhead is unrelated to handle typing. Whether you hold `coroutine_h
 
 This prevents optimizations that callbacks enable: register allocation across async boundaries, constant propagation through handlers, and dead code elimination of unused paths.
 
-### 11.1 HALO and Coroutine Elision
+### 12.1 HALO and Coroutine Elision
 
 HALO (Heap Allocation eLision Optimization) can inline coroutine frames when the compiler can prove:
 1. The coroutine is immediately awaited
@@ -586,7 +694,7 @@ With this attribute, Clang can elide nested coroutine frames into the parent's f
 
 This optimization is Clang-specific. MSVC does not currently support coroutine await elision, contributing to its 2× slower coroutine performance.
 
-### 11.2 Compiler Differences
+### 12.2 Compiler Differences
 
 | Feature | Clang 20.x | MSVC 19.x |
 |---------|-----------|-----------|
@@ -597,17 +705,17 @@ This optimization is Clang-specific. MSVC does not currently support coroutine a
 
 For performance-critical coroutine code, Clang currently provides superior optimization. MSVC's coroutine implementation continues to improve, but production code should account for this difference.
 
-### 11.3 Implemented Mitigations
+### 12.3 Implemented Mitigations
 
-1. **Frame pooling** (Section 8): Custom `operator new/delete` with thread-local caching eliminates allocation overhead after warmup
+1. **Frame pooling** (Section 9): Custom `operator new/delete` with thread-local caching eliminates allocation overhead after warmup
 2. **`[[clang::coro_await_elidable]]`**: Enables frame elision for nested coroutines on Clang
-3. **Symmetric transfer** (Section 5.1): Returning handles from `await_suspend` prevents stack growth
-4. **Preallocated I/O state** (Section 4.1): Socket operation state is allocated once, not per-operation
+3. **Symmetric transfer** (Section 6.1): Returning handles from `await_suspend` prevents stack growth
+4. **Preallocated I/O state** (Section 5.1): Socket operation state is allocated once, not per-operation
 5. **Global frame pool fallback**: Coroutines without explicit frame allocator parameters still benefit from pooling
 
 ---
 
-## 12. Design Trade-offs
+## 13. Design Trade-offs
 
 | Aspect | Callback Model | Coroutine Model |
 |--------|---------------|-----------------|
@@ -627,7 +735,7 @@ For performance-critical coroutine code, Clang currently provides superior optim
 
 ---
 
-## 13. Conclusion
+## 14. Conclusion
 
 We have demonstrated a coroutine-first asynchronous I/O framework that achieves: