Fix RDMA re-entrancy assertion and lost wakeup deadlocks with I/O threads#3335
Fix RDMA re-entrancy assertion and lost wakeup deadlocks with I/O threads#3335Ada-Church-Closure wants to merge 4 commits intovalkey-io:unstablefrom
Conversation
When using RDMA with I/O threads and pipelining (e.g., -P 32), the server crashes due to a `c->cmd_queue.len == 0` assertion or hangs indefinitely (lost wakeup). Root causes: 1. Crash: `connUpdateState(c->conn)` was called before the parsed commands were executed. For RDMA, this synchronously triggers the CQ, re-entering parsing while `cmd_queue` is heavily populated. 2. Hang: I/O threads leave unparsed residual data in `querybuf` due to batch limits. Because RDMA CQ is edge-triggered, returning to the event loop without draining the buffer causes a lost wakeup. Fix: - Track handled clients in `processIOThreadsReadDone` using a list. - After draining the command queue via `processClientsCommandsBatch`, synchronously drain any residual `querybuf` using `processPendingCommandAndInputBuffer`. - Safely call `connUpdateState` only after the command queue is fully empty to prevent re-entrancy. This change strictly respects the edge-triggered nature of RDMA while remaining 100% safe for standard TCP/TLS connections. Fixes valkey-io#3112 Signed-off-by: Ada-Church-Closure <[email protected]>
1cab189 to
3fc6395
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3335 +/- ##
============================================
- Coverage 75.05% 74.26% -0.79%
============================================
Files 129 130 +1
Lines 71553 72730 +1177
============================================
+ Hits 53705 54016 +311
- Misses 17848 18714 +866
🚀 New features to boost your workflow:
|
pizhenwei
left a comment
There was a problem hiding this comment.
Thank you for your contribution and for discovering the root cause of this issue. Overall, the solution LGTM. I have only a few minor suggestions, and welcome further discussion.
Hi @zuiderkwast please correct me if I misunderstand anything.
…cture Signed-off-by: Ada-Church-Closure <[email protected]>
pizhenwei
left a comment
There was a problem hiding this comment.
LGTM, thanks!
Co-authored-by: zhenwei pi [email protected]
Signed-off-by: zhenwei pi [email protected]
zuiderkwast
left a comment
There was a problem hiding this comment.
Thanks for finding the root cause of the issue and solving it!
I don't know if the extra linked list affects the performance in any way, with or without pipelining. I kicked off benchmark tests. Let's wait and see. We could perhaps try to use a fifo.{c,h} queue instead of a linked list if it makes it faster. Let's wait for the benchmark and see.
| * for edge-triggered transports like RDMA). */ | ||
| list *handled_clients = listCreate(); | ||
|
|
||
| listNode *next = listFirst(server.clients_pending_io_read); |
There was a problem hiding this comment.
This communication between IO threads and main thread is being changed completely in #3324. Lock-free queues are used instead of looping over the clients and checking each one if their IO is finished.
These lines are being deleted: https://github.com/valkey-io/valkey/pull/3324/changes#diff-252bce0cc340542712f0c1adf62e9035ea47a4a064321fbf40ec3dd4b814aaf2L6378
I think we can merge this fix first, so we can backport it to 9.0 and 8.x. (When was RDMA introduced?) But it might need to be rewritten later when #3324 gets merged.
src/networking.c
Outdated
|
|
||
| /* Track the client so we can drain any leftover buffered data even if no | ||
| * further network events arrive. */ | ||
| listAddNodeTail(handled_clients, c); |
There was a problem hiding this comment.
I have some UAF concerns with storing direct client pointers. Is it possible that the client can already be dead / free but still be added to the list.
on 6448 - we call processPendingCommandAndInputBuffer, which calls processCommandAndResetClient is the client is ready for execution.
The description of this function says -
/* This function calls processCommand(), but also performs a few sub tasks
* for the client that are useful in that context:
*
* 1. It sets the current client to the client 'c'.
* 2. calls commandProcessed() if the command was handled.
*
* The function returns C_ERR in case the client was freed as a side effect
* of processing the command, otherwise C_OK is returned. */
So is it possible that in case of an error, the client is already free, and we still can add the dead client to the list?
Please correct me if I am missing something.
There was a problem hiding this comment.
Thanks! It's a problem.
|
Benchmark ran on this commit: Benchmark Comparison: HEAD vs HEAD (averaged) - rps metricsRun Summary:
Statistical Notes:
Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate averages from X runs with standard deviation Y, coefficient of variation Z%, 99% confidence interval margin of error ±W% of the mean, and 99% prediction interval margin of error ±V% of the mean. CI bounds [A, B] and PI bounds [C, D] show the actual interval ranges. Configuration:
Configuration:
|
|
There is a crash in this job: https://github.com/valkey-io/valkey/actions/runs/22915719396/job/66500408197?pr=3335#step:9:25952 Stack trace main thread: |
Signed-off-by: Ada-Church-Closure <[email protected]>
5642a80 to
0dabb3d
Compare
src/networking.c
Outdated
| listRewind(handled_clients_ids, &handled_li); | ||
| while ((handled_ln = listNext(&handled_li))) { | ||
| uint64_t id = (uint64_t)(uintptr_t)listNodeValue(handled_ln); | ||
| client *c = lookupClientByID(id); |
There was a problem hiding this comment.
client *c = lookupClientByID(id); brings more overhead, not sure the performance impact.
I have another choice:
int processIOThreadsReadDone(void) {
...
int ret = addCommandToBatchAndProcessIfFull(c);
/* If the command was not added to the commands batch, process it immediately */
if (ret == C_ERR) {
if (processPendingCommandAndInputBuffer(c) == C_OK) {
beforeNextClient(c);
listAddNodeTail(handled_clients, c); /* NEW CODES */
}
}
}
And modify processClientsCommandsBatch:
void processClientsCommandsBatch(list *handled_clients) {
...
for (size_t i = 0; i < batch->client_count; i++) {
client *c = batch->clients[i];
if (c == NULL) continue;
/* Set the client to null immediately to avoid accessing it again recursively when ProcessingEventsWhileBlocked */
batch->clients[i] = NULL;
batch->executed_commands++;
if (processPendingCommandAndInputBuffer(c) != C_ERR) {
beforeNextClient(c);
if (handled_clients) listAddNodeTail(handled_clients, c); /* NEW CODES */
}
}
Then we can handle the healthy clients as the previous version, this will avoid lookupClientByID
d38811e to
3e6aa49
Compare
Signed-off-by: Ada-Church-Closure <[email protected]>
3e6aa49 to
df67b5d
Compare
|
Hi @sarthakaggarwal97 would you please take a loot at the latest version? Hi @zuiderkwast, would you please retry the test job to reproduce crash? |
|
I used a shell to test this: #!/bin/bash
# ==============================================================================
# Valkey Chaos Testing Script: Mixed TCP & RDMA workload with random disconnections
# Purpose: Verify the robustness of the I/O threads state machine and ensure
# no Use-After-Free (UAF) or Lost Wakeups occur during heavy batching.
# ==============================================================================
# 1: Unlock memory limits for RDMA Memory Registration (MR)
echo "[INFO] Unlocking memory limits for RDMA..."
sudo prlimit --memlock=unlimited --pid $$
HOST="10.0.0.1"
PORT="6379"
# Aggressive parameters to maximize pipeline depth and force batching logic
# Adjust -P (pipeline) and -c (connections) to push your specific hardware to the limit.
COMMON_ARGS="-h $HOST -p $PORT -d 256 --threads 16 -c 200 -P 64 -n 50000000 -t get"
echo "[INFO] Starting 4 TCP benchmark processes in the background..."
TCP_PIDS=()
for i in {1..4}; do
./src/valkey-benchmark $COMMON_ARGS > /dev/null 2>&1 &
TCP_PIDS+=($!)
done
echo "[INFO] TCP benchmarks started. PIDs: ${TCP_PIDS[*]}"
echo "[INFO] Starting 4 RDMA benchmark processes in the background..."
RDMA_PIDS=()
for i in {1..4}; do
./src/valkey-benchmark $COMMON_ARGS --rdma > /dev/null 2>&1 &
RDMA_PIDS+=($!)
done
echo "[INFO] RDMA benchmarks started. PIDs: ${RDMA_PIDS[*]}"
echo "[INFO] Letting the workloads saturate CPU and I/O threads (waiting 4 seconds)..."
sleep 4
echo "[INFO] Chaos initiated Sending SIGINT (Ctrl+C) to random clients..."
kill -INT "${RDMA_PIDS[0]}" "${RDMA_PIDS[1]}"
kill -INT "${TCP_PIDS[0]}"
echo "[INFO] Killed RDMA PIDs: ${RDMA_PIDS[0]}, ${RDMA_PIDS[1]}"
echo "[INFO] Killed TCP PID: ${TCP_PIDS[0]}"
echo "[INFO] OS is now sending RST packets. Valkey server should survive without Segfault/UAF."
echo "[INFO] Waiting for the surviving benchmark processes to complete naturally..."
wait
echo "[SUCCESS] Chaos test completed! If the Valkey server is still running and responsive, the fix is rock solid."and the server works fine: ~/Project/valkey fix-rdma-io-threads* 1m 23s ❯ sudo ./chaos_test.sh
[INFO] Unlocking memory limits for RDMA...
[INFO] Starting 4 TCP benchmark processes in the background...
[INFO] TCP benchmarks started. PIDs: 19370 19371 19372 19373
[INFO] Starting 4 RDMA benchmark processes in the background...
[INFO] RDMA benchmarks started. PIDs: 19374 19375 19376 19377
[INFO] Letting the workloads saturate CPU and I/O threads (waiting 4 seconds)...
[INFO] Chaos initiated Sending SIGINT (Ctrl+C) to random clients...
[INFO] Killed RDMA PIDs: 19374, 19375
[INFO] Killed TCP PID: 19370
[INFO] OS is now sending RST packets. Valkey server should survive without Segfault/UAF.
[INFO] Waiting for the surviving benchmark processes to complete naturally...
[SUCCESS] Chaos test completed! If the Valkey server is still running and responsive, the fix is rock solid. |
Fixes #3112
When using RDMA with I/O threads and pipelining (e.g., -P 32), the server crashes due to a
c->cmd_queue.len == 0assertion or hangs indefinitely (lost wakeup).1. The Problem
When using RDMA (--rdma) with I/O threads and high pipelining (e.g., -P 32), the server suffers from two critical issues:
Crash: Assertion c->cmd_queue.len == 0 failed in networking.c.
Hang (Lost Wakeup): The server completely stalls under heavy pipeline loads because leftover data in querybuf is not parsed, and RDMA's edge-triggered CQ does not fire again for already buffered data.
2. Root Cause & The Fix
In processIOThreadsReadDone, the execution sequence was conflicting with RDMA's synchronous and edge-triggered nature:
Fixing the Crash: Previously, connUpdateState(c->conn) was called before the parsed commands were executed by the batch. For RDMA, this synchronously triggers the CQ, re-entering the parse logic while cmd_queue is heavily populated, breaking the assert. Fix: We now defer connUpdateState until after processClientsCommandsBatch().
Fixing the Hang: IO threads may leave unparsed residual data in querybuf due to pipeline limits. If we just go back to the event loop, RDMA will not wake us up again. Fix: We track handled clients in a temporary list. After batch execution, we synchronously drain any residual querybuf using processPendingCommandAndInputBuffer(c) before calling connUpdateState.
3. Impact on TCP / TLS
This change is 100% safe for standard TCP/TLS connections.
For TCP, connUpdateState simply modifies the epoll registration asynchronously, so deferring it past the batch execution is a functional no-op. Eagerly draining the querybuf synchronously also slightly reduces epoll round-trips for TCP pipelines.
4. How to Reproduce & Verify (Linux Soft-RoCE)
You can easily verify this using a dummy interface and RXE, without needing physical RDMA hardware:
sudo prlimit --memlock=unlimited --pid $$ ./src/valkey-benchmark -h 10.0.0.1 -p 6379 -d 256 --threads 16 -c 100 -P 32 -n 20000000 -t get --rdmaBefore this patch,Benchmark crashes or hangs immediately.
After this patch,Completes 20,000,000 requests flawlessly at peak QPS like: