Skip to content

Fix RDMA re-entrancy assertion and lost wakeup deadlocks with I/O threads#3335

Open
Ada-Church-Closure wants to merge 4 commits intovalkey-io:unstablefrom
Ada-Church-Closure:fix-rdma-io-threads
Open

Fix RDMA re-entrancy assertion and lost wakeup deadlocks with I/O threads#3335
Ada-Church-Closure wants to merge 4 commits intovalkey-io:unstablefrom
Ada-Church-Closure:fix-rdma-io-threads

Conversation

@Ada-Church-Closure
Copy link
Contributor

Fixes #3112
When using RDMA with I/O threads and pipelining (e.g., -P 32), the server crashes due to a c->cmd_queue.len == 0 assertion or hangs indefinitely (lost wakeup).

1. The Problem

When using RDMA (--rdma) with I/O threads and high pipelining (e.g., -P 32), the server suffers from two critical issues:

  • Crash: Assertion c->cmd_queue.len == 0 failed in networking.c.

  • Hang (Lost Wakeup): The server completely stalls under heavy pipeline loads because leftover data in querybuf is not parsed, and RDMA's edge-triggered CQ does not fire again for already buffered data.

2. Root Cause & The Fix

In processIOThreadsReadDone, the execution sequence was conflicting with RDMA's synchronous and edge-triggered nature:

  • Fixing the Crash: Previously, connUpdateState(c->conn) was called before the parsed commands were executed by the batch. For RDMA, this synchronously triggers the CQ, re-entering the parse logic while cmd_queue is heavily populated, breaking the assert. Fix: We now defer connUpdateState until after processClientsCommandsBatch().

  • Fixing the Hang: IO threads may leave unparsed residual data in querybuf due to pipeline limits. If we just go back to the event loop, RDMA will not wake us up again. Fix: We track handled clients in a temporary list. After batch execution, we synchronously drain any residual querybuf using processPendingCommandAndInputBuffer(c) before calling connUpdateState.

3. Impact on TCP / TLS

This change is 100% safe for standard TCP/TLS connections.
For TCP, connUpdateState simply modifies the epoll registration asynchronously, so deferring it past the batch execution is a functional no-op. Eagerly draining the querybuf synchronously also slightly reduces epoll round-trips for TCP pipelines.

4. How to Reproduce & Verify (Linux Soft-RoCE)

You can easily verify this using a dummy interface and RXE, without needing physical RDMA hardware:

  • server:
sudo prlimit --memlock=unlimited --pid $$
sudo modprobe dummy
sudo ip link add dummy0 type dummy
sudo ip link set dummy0 up
sudo ip addr add 10.0.0.1/24 dev dummy0
sudo rdma link add rxe_dummy type rxe netdev dummy0
# compile and run
make BUILD_RDMA=yes USE_FAST_FLOAT=yes
sudo ./src/valkey-server valkey.conf --rdma-bind 10.0.0.1 --rdma-port 6379
  • client
sudo prlimit --memlock=unlimited --pid $$
./src/valkey-benchmark -h 10.0.0.1 -p 6379 -d 256 --threads 16 -c 100 -P 32 -n 20000000 -t get --rdma

Before this patch,Benchmark crashes or hangs immediately.
After this patch,Completes 20,000,000 requests flawlessly at peak QPS like:

~/Project/valkey fix-rdma-io-threads* 15s ❯ ./src/valkey-benchmark -h 10.0.0.1 -p 6379 -d 256 --threads 5 -r 100000 -n 20000000 -c 30 -P 32 -t get --rdma
====== GET ======                                                                         
  20000000 requests completed in 10.28 seconds
  30 parallel clients
  256 bytes payload
  keep alive: 1
  host configuration "save": 
  host configuration "appendonly": no
  multi-thread: yes
  threads: 5

Latency by percentile distribution:
0.000% <= 0.015 milliseconds (cumulative count 160)
50.000% <= 0.463 milliseconds (cumulative count 10938848)
75.000% <= 0.487 milliseconds (cumulative count 15378240)
87.500% <= 0.519 milliseconds (cumulative count 17661824)
93.750% <= 0.583 milliseconds (cumulative count 18812384)
96.875% <= 0.695 milliseconds (cumulative count 19375456)
98.438% <= 0.863 milliseconds (cumulative count 19698176)
99.219% <= 1.015 milliseconds (cumulative count 19846016)
99.609% <= 1.167 milliseconds (cumulative count 19924960)
99.805% <= 1.255 milliseconds (cumulative count 19962496)
99.902% <= 1.359 milliseconds (cumulative count 19981312)
99.951% <= 1.479 milliseconds (cumulative count 19990336)
99.976% <= 1.687 milliseconds (cumulative count 19995136)
99.988% <= 1.959 milliseconds (cumulative count 19997568)
99.994% <= 2.215 milliseconds (cumulative count 19998816)
99.997% <= 3.007 milliseconds (cumulative count 19999488)
99.998% <= 3.167 milliseconds (cumulative count 19999712)
99.999% <= 3.711 milliseconds (cumulative count 19999872)
100.000% <= 7.807 milliseconds (cumulative count 19999936)
100.000% <= 8.879 milliseconds (cumulative count 19999968)
100.000% <= 9.911 milliseconds (cumulative count 20000000)
100.000% <= 9.911 milliseconds (cumulative count 20000000)

Cumulative distribution of latencies:
0.119% <= 0.103 milliseconds (cumulative count 23712)
0.475% <= 0.207 milliseconds (cumulative count 94976)
2.051% <= 0.303 milliseconds (cumulative count 410112)
12.146% <= 0.407 milliseconds (cumulative count 2429280)
84.272% <= 0.503 milliseconds (cumulative count 16854368)
94.958% <= 0.607 milliseconds (cumulative count 18991552)
96.972% <= 0.703 milliseconds (cumulative count 19394304)
98.019% <= 0.807 milliseconds (cumulative count 19603808)
98.768% <= 0.903 milliseconds (cumulative count 19753600)
99.201% <= 1.007 milliseconds (cumulative count 19840160)
99.464% <= 1.103 milliseconds (cumulative count 19892832)
99.718% <= 1.207 milliseconds (cumulative count 19943552)
99.868% <= 1.303 milliseconds (cumulative count 19973536)
99.929% <= 1.407 milliseconds (cumulative count 19985888)
99.956% <= 1.503 milliseconds (cumulative count 19991264)
99.970% <= 1.607 milliseconds (cumulative count 19994080)
99.977% <= 1.703 milliseconds (cumulative count 19995392)
99.983% <= 1.807 milliseconds (cumulative count 19996544)
99.986% <= 1.903 milliseconds (cumulative count 19997248)
99.990% <= 2.007 milliseconds (cumulative count 19997952)
99.991% <= 2.103 milliseconds (cumulative count 19998176)
99.998% <= 3.103 milliseconds (cumulative count 19999584)
99.999% <= 4.103 milliseconds (cumulative count 19999872)
100.000% <= 5.103 milliseconds (cumulative count 19999904)
100.000% <= 8.103 milliseconds (cumulative count 19999936)
100.000% <= 9.103 milliseconds (cumulative count 19999968)
100.000% <= 10.103 milliseconds (cumulative count 20000000)

Summary:
  throughput summary: 1945903.88 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.472     0.008     0.463     0.615     0.959     9.911

~/Project/valkey fix-rdma-io-threads* 10s ❯ ./src/valkey-benchmark -h 10.0.0.1 -p 6379 -d 256 --threads 5 -r 100000 -n 20000000 -c 30 -P 256 -t get --rdma
====== GET ======                                                                         
  20000000 requests completed in 8.03 seconds
  30 parallel clients
  256 bytes payload
  keep alive: 1
  host configuration "save": 
  host configuration "appendonly": no
  multi-thread: yes
  threads: 5

Latency by percentile distribution:
0.000% <= 0.031 milliseconds (cumulative count 768)
50.000% <= 2.687 milliseconds (cumulative count 10050816)
75.000% <= 2.991 milliseconds (cumulative count 15007232)
87.500% <= 4.287 milliseconds (cumulative count 17501440)
93.750% <= 5.231 milliseconds (cumulative count 18751744)
96.875% <= 5.759 milliseconds (cumulative count 19377664)
98.438% <= 6.367 milliseconds (cumulative count 19688448)
99.219% <= 7.159 milliseconds (cumulative count 19844864)
99.609% <= 7.951 milliseconds (cumulative count 19922432)
99.805% <= 8.775 milliseconds (cumulative count 19961088)
99.902% <= 9.431 milliseconds (cumulative count 19980800)
99.951% <= 10.303 milliseconds (cumulative count 19990528)
99.976% <= 10.663 milliseconds (cumulative count 19995136)
99.988% <= 11.591 milliseconds (cumulative count 19997696)
99.994% <= 12.647 milliseconds (cumulative count 19998976)
99.997% <= 12.751 milliseconds (cumulative count 19999488)
99.998% <= 12.839 milliseconds (cumulative count 20000000)
100.000% <= 12.839 milliseconds (cumulative count 20000000)

Cumulative distribution of latencies:
0.032% <= 0.103 milliseconds (cumulative count 6400)
0.050% <= 0.207 milliseconds (cumulative count 9984)
0.073% <= 0.303 milliseconds (cumulative count 14592)
0.087% <= 0.407 milliseconds (cumulative count 17408)
0.113% <= 0.503 milliseconds (cumulative count 22528)
0.134% <= 0.607 milliseconds (cumulative count 26880)
0.160% <= 0.703 milliseconds (cumulative count 32000)
0.188% <= 0.807 milliseconds (cumulative count 37632)
0.219% <= 0.903 milliseconds (cumulative count 43776)
0.302% <= 1.007 milliseconds (cumulative count 60416)
0.411% <= 1.103 milliseconds (cumulative count 82176)
0.580% <= 1.207 milliseconds (cumulative count 115968)
0.774% <= 1.303 milliseconds (cumulative count 154880)
1.076% <= 1.407 milliseconds (cumulative count 215296)
1.509% <= 1.503 milliseconds (cumulative count 301824)
2.172% <= 1.607 milliseconds (cumulative count 434432)
2.935% <= 1.703 milliseconds (cumulative count 587008)
4.189% <= 1.807 milliseconds (cumulative count 837888)
5.833% <= 1.903 milliseconds (cumulative count 1166592)
8.218% <= 2.007 milliseconds (cumulative count 1643520)
11.154% <= 2.103 milliseconds (cumulative count 2230784)
78.685% <= 3.103 milliseconds (cumulative count 15737088)
86.468% <= 4.103 milliseconds (cumulative count 17293568)
92.955% <= 5.103 milliseconds (cumulative count 18590976)
98.003% <= 6.103 milliseconds (cumulative count 19600640)
99.178% <= 7.103 milliseconds (cumulative count 19835648)
99.656% <= 8.103 milliseconds (cumulative count 19931136)
99.859% <= 9.103 milliseconds (cumulative count 19971840)
99.946% <= 10.103 milliseconds (cumulative count 19989248)
99.985% <= 11.103 milliseconds (cumulative count 19996928)
99.990% <= 12.103 milliseconds (cumulative count 19997952)
100.000% <= 13.103 milliseconds (cumulative count 20000000)

Summary:
  throughput summary: 2490349.75 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.957     0.024     2.687     5.439     6.895    12.839

When using RDMA with I/O threads and pipelining (e.g., -P 32), the
server crashes due to a `c->cmd_queue.len == 0` assertion or hangs
indefinitely (lost wakeup).

Root causes:
1. Crash: `connUpdateState(c->conn)` was called before the parsed
   commands were executed. For RDMA, this synchronously triggers the
   CQ, re-entering parsing while `cmd_queue` is heavily populated.
2. Hang: I/O threads leave unparsed residual data in `querybuf` due
   to batch limits. Because RDMA CQ is edge-triggered, returning to
   the event loop without draining the buffer causes a lost wakeup.

Fix:
- Track handled clients in `processIOThreadsReadDone` using a list.
- After draining the command queue via `processClientsCommandsBatch`,
  synchronously drain any residual `querybuf` using
  `processPendingCommandAndInputBuffer`.
- Safely call `connUpdateState` only after the command queue is fully
  empty to prevent re-entrancy.

This change strictly respects the edge-triggered nature of RDMA while
remaining 100% safe for standard TCP/TLS connections.

Fixes valkey-io#3112

Signed-off-by: Ada-Church-Closure <[email protected]>
@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.26%. Comparing base (fc217fc) to head (df67b5d).
⚠️ Report is 32 commits behind head on unstable.

Files with missing lines Patch % Lines
src/networking.c 78.57% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3335      +/-   ##
============================================
- Coverage     75.05%   74.26%   -0.79%     
============================================
  Files           129      130       +1     
  Lines         71553    72730    +1177     
============================================
+ Hits          53705    54016     +311     
- Misses        17848    18714     +866     
Files with missing lines Coverage Δ
src/memory_prefetch.c 80.88% <100.00%> (+4.01%) ⬆️
src/networking.c 91.23% <78.57%> (-0.55%) ⬇️

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast requested a review from pizhenwei March 9, 2026 23:12
Copy link
Contributor

@pizhenwei pizhenwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution and for discovering the root cause of this issue. Overall, the solution LGTM. I have only a few minor suggestions, and welcome further discussion.

Hi @zuiderkwast please correct me if I misunderstand anything.

Copy link
Contributor

@pizhenwei pizhenwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Co-authored-by: zhenwei pi [email protected]
Signed-off-by: zhenwei pi [email protected]

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the root cause of the issue and solving it!

I don't know if the extra linked list affects the performance in any way, with or without pipelining. I kicked off benchmark tests. Let's wait and see. We could perhaps try to use a fifo.{c,h} queue instead of a linked list if it makes it faster. Let's wait for the benchmark and see.

* for edge-triggered transports like RDMA). */
list *handled_clients = listCreate();

listNode *next = listFirst(server.clients_pending_io_read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This communication between IO threads and main thread is being changed completely in #3324. Lock-free queues are used instead of looping over the clients and checking each one if their IO is finished.

These lines are being deleted: https://github.com/valkey-io/valkey/pull/3324/changes#diff-252bce0cc340542712f0c1adf62e9035ea47a4a064321fbf40ec3dd4b814aaf2L6378

I think we can merge this fix first, so we can backport it to 9.0 and 8.x. (When was RDMA introduced?) But it might need to be rewritten later when #3324 gets merged.

src/networking.c Outdated

/* Track the client so we can drain any leftover buffered data even if no
* further network events arrive. */
listAddNodeTail(handled_clients, c);
Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some UAF concerns with storing direct client pointers. Is it possible that the client can already be dead / free but still be added to the list.

on 6448 - we call processPendingCommandAndInputBuffer, which calls processCommandAndResetClient is the client is ready for execution.

The description of this function says -

/* This function calls processCommand(), but also performs a few sub tasks
 * for the client that are useful in that context:
 *
 * 1. It sets the current client to the client 'c'.
 * 2. calls commandProcessed() if the command was handled.
 *
 * The function returns C_ERR in case the client was freed as a side effect
 * of processing the command, otherwise C_OK is returned. */

So is it possible that in case of an error, the client is already free, and we still can add the dead client to the list?

Please correct me if I am missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's a problem.

@zuiderkwast zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 10, 2026
@github-actions
Copy link

Benchmark ran on this commit: d8f8b39

Benchmark Comparison: HEAD vs HEAD (averaged) - rps metrics

Run Summary:

  • HEAD: 80 total runs, 16 configurations (avg 5.00 runs per config)
  • HEAD: 80 total runs, 16 configurations (avg 5.00 runs per config)

Statistical Notes:

  • CI99%: 99% Confidence Interval - range where the true population mean is likely to fall
  • PI99%: 99% Prediction Interval - range where a single future observation is likely to fall
  • CV: Coefficient of Variation - relative variability (σ/μ × 100%)

Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate averages from X runs with standard deviation Y, coefficient of variation Z%, 99% confidence interval margin of error ±W% of the mean, and 99% prediction interval margin of error ±V% of the mean. CI bounds [A, B] and PI bounds [C, D] show the actual interval ranges.

Configuration:

  • architecture: aarch64
  • benchmark_mode: duration
  • clients: 1600
  • cluster_mode: False
  • data_size: 16
  • duration: 180
  • tls: False
  • valkey_benchmark_threads: 90
  • warmup: 30
Command Metric Pipeline io_threads HEAD HEAD Diff % Change
GET rps 1 1 229462.798 (n=5, σ=918.117, CV=0.40%, CI99%=±0.824%, PI99%=±2.018%, CI[227572.383, 231353.213], PI[224832.246, 234093.350]) 227546.482 (n=5, σ=2442.552, CV=1.07%, CI99%=±2.210%, PI99%=±5.414%, CI[222517.235, 232575.729], PI[215227.392, 239865.572]) -1916.316 -0.835%
GET rps 1 9 1509044.650 (n=5, σ=12333.961, CV=0.82%, CI99%=±1.683%, PI99%=±4.122%, CI[1483648.854, 1534440.446], PI[1446837.908, 1571251.392]) 1508721.674 (n=5, σ=16980.628, CV=1.13%, CI99%=±2.317%, PI99%=±5.676%, CI[1473758.326, 1543685.022], PI[1423079.311, 1594364.037]) -322.976 -0.021%
GET rps 10 1 1223575.124 (n=5, σ=2088.586, CV=0.17%, CI99%=±0.351%, PI99%=±0.861%, CI[1219274.697, 1227875.551], PI[1213041.273, 1234108.975]) 1216917.498 (n=5, σ=7694.875, CV=0.63%, CI99%=±1.302%, PI99%=±3.189%, CI[1201073.645, 1232761.351], PI[1178108.142, 1255726.854]) -6657.626 -0.544%
GET rps 10 9 2777632.950 (n=5, σ=26161.368, CV=0.94%, CI99%=±1.939%, PI99%=±4.750%, CI[2723766.332, 2831499.568], PI[2645687.222, 2909578.678]) 2789579.700 (n=5, σ=26379.201, CV=0.95%, CI99%=±1.947%, PI99%=±4.769%, CI[2735264.560, 2843894.840], PI[2656535.323, 2922624.077]) 11946.750 +0.430%
SET rps 1 1 221214.284 (n=5, σ=559.814, CV=0.25%, CI99%=±0.521%, PI99%=±1.276%, CI[220061.619, 222366.949], PI[218390.842, 224037.726]) 221052.092 (n=5, σ=1332.809, CV=0.60%, CI99%=±1.241%, PI99%=±3.041%, CI[218307.819, 223796.365], PI[214330.023, 227774.161]) -162.192 -0.073%
SET rps 1 9 1440573.700 (n=5, σ=12614.608, CV=0.88%, CI99%=±1.803%, PI99%=±4.416%, CI[1414600.047, 1466547.353], PI[1376951.503, 1504195.897]) 1449371.024 (n=5, σ=8749.826, CV=0.60%, CI99%=±1.243%, PI99%=±3.045%, CI[1431355.011, 1467387.037], PI[1405240.985, 1493501.063]) 8797.324 +0.611%
SET rps 10 1 1040125.364 (n=5, σ=6583.199, CV=0.63%, CI99%=±1.303%, PI99%=±3.192%, CI[1026570.467, 1053680.261], PI[1006922.783, 1073327.945]) 1044384.926 (n=5, σ=5529.125, CV=0.53%, CI99%=±1.090%, PI99%=±2.670%, CI[1033000.382, 1055769.470], PI[1016498.602, 1072271.250]) 4259.562 +0.410%
SET rps 10 9 1950205.100 (n=5, σ=23290.046, CV=1.19%, CI99%=±2.459%, PI99%=±6.023%, CI[1902250.573, 1998159.627], PI[1832740.978, 2067669.222]) 1946232.524 (n=5, σ=14534.463, CV=0.75%, CI99%=±1.538%, PI99%=±3.767%, CI[1916305.865, 1976159.183], PI[1872927.479, 2019537.569]) -3972.576 -0.204%

Configuration:

  • architecture: aarch64
  • benchmark_mode: duration
  • clients: 1600
  • cluster_mode: False
  • data_size: 96
  • duration: 180
  • tls: False
  • valkey_benchmark_threads: 90
  • warmup: 30
Command Metric Pipeline io_threads HEAD HEAD Diff % Change
GET rps 1 1 219057.912 (n=5, σ=2581.004, CV=1.18%, CI99%=±2.426%, PI99%=±5.942%, CI[213743.589, 224372.235], PI[206040.532, 232075.292]) 220683.866 (n=5, σ=494.918, CV=0.22%, CI99%=±0.462%, PI99%=±1.131%, CI[219664.822, 221702.910], PI[218187.728, 223180.004]) 1625.954 +0.742%
GET rps 1 9 1469495.452 (n=5, σ=9457.135, CV=0.64%, CI99%=±1.325%, PI99%=±3.246%, CI[1450023.081, 1488967.823], PI[1421798.078, 1517192.826]) 1456182.826 (n=5, σ=8965.871, CV=0.62%, CI99%=±1.268%, PI99%=±3.105%, CI[1437721.974, 1474643.678], PI[1410963.158, 1501402.494]) -13312.626 -0.906%
GET rps 10 1 1159126.172 (n=5, σ=5735.214, CV=0.49%, CI99%=±1.019%, PI99%=±2.495%, CI[1147317.287, 1170935.057], PI[1130200.429, 1188051.915]) 1154969.226 (n=5, σ=9868.334, CV=0.85%, CI99%=±1.759%, PI99%=±4.309%, CI[1134650.191, 1175288.261], PI[1105197.957, 1204740.495]) -4156.946 -0.359%
GET rps 10 9 2232633.750 (n=5, σ=20302.271, CV=0.91%, CI99%=±1.872%, PI99%=±4.586%, CI[2190831.094, 2274436.406], PI[2130238.573, 2335028.927]) 2237195.650 (n=5, σ=17012.725, CV=0.76%, CI99%=±1.566%, PI99%=±3.835%, CI[2202166.215, 2272225.085], PI[2151391.407, 2322999.893]) 4561.900 +0.204%
SET rps 1 1 211289.888 (n=5, σ=1706.436, CV=0.81%, CI99%=±1.663%, PI99%=±4.073%, CI[207776.313, 214803.463], PI[202683.422, 219896.354]) 212113.028 (n=5, σ=1026.276, CV=0.48%, CI99%=±0.996%, PI99%=±2.440%, CI[209999.912, 214226.144], PI[206936.971, 217289.085]) 823.140 +0.390%
SET rps 1 9 1435656.950 (n=5, σ=18259.852, CV=1.27%, CI99%=±2.619%, PI99%=±6.415%, CI[1398059.662, 1473254.238], PI[1343562.779, 1527751.121]) 1431164.102 (n=5, σ=5554.419, CV=0.39%, CI99%=±0.799%, PI99%=±1.957%, CI[1419727.477, 1442600.727], PI[1403150.206, 1459177.998]) -4492.848 -0.313%
SET rps 10 1 1032731.478 (n=5, σ=3556.056, CV=0.34%, CI99%=±0.709%, PI99%=±1.737%, CI[1025409.510, 1040053.446], PI[1014796.391, 1050666.565]) 1033732.300 (n=5, σ=7318.891, CV=0.71%, CI99%=±1.458%, PI99%=±3.571%, CI[1018662.603, 1048801.997], PI[996819.232, 1070645.368]) 1000.822 +0.097%
SET rps 10 9 1813571.476 (n=5, σ=19437.983, CV=1.07%, CI99%=±2.207%, PI99%=±5.406%, CI[1773548.400, 1853594.552], PI[1715535.361, 1911607.591]) 1811598.374 (n=5, σ=36285.281, CV=2.00%, CI99%=±4.124%, PI99%=±10.102%, CI[1736886.479, 1886310.269], PI[1628592.353, 1994604.395]) -1973.102 -0.109%

@zuiderkwast
Copy link
Contributor

There is a crash in this job: https://github.com/valkey-io/valkey/actions/runs/22915719396/job/66500408197?pr=3335#step:9:25952

13008:S 10 Mar 2026 18:34:27.909 # valkey 255.255.255 crashed by signal: 11, si_code: 1
13008:S 10 Mar 2026 18:34:27.909 # Accessing address: (nil)
13008:S 10 Mar 2026 18:34:27.909 # Crashed running the instruction at: 0x55e732c92065

Stack trace main thread:

13008 valkey-server *
#0 0x7f2907e45330 <unknown>
#1 0x55e732c92065 connUpdateState at /home/runner/work/valkey/valkey/src/connection.h:516
#2 0x55e732c92065 processIOThreadsReadDone at /home/runner/work/valkey/valkey/src/networking.c:6471
#3 0x55e732ced56e beforeSleep at /home/runner/work/valkey/valkey/src/server.c:1946
#4 0x55e732bafe6f aeProcessEvents at /home/runner/work/valkey/valkey/src/ae.c:428
#5 0x55e732bafe6f aeProcessEvents at /home/runner/work/valkey/valkey/src/ae.c:411
#6 0x55e732bafe6f aeMain at /home/runner/work/valkey/valkey/src/ae.c:543
#7 0x55e732b9e807 main at /home/runner/work/valkey/valkey/src/server.c:7689
#8 0x7f2907e2a1ca <unknown>
#9 0x7f2907e2a28b <unknown>
#10 0x55e732ba0095 <unknown>

Signed-off-by: Ada-Church-Closure <[email protected]>
src/networking.c Outdated
listRewind(handled_clients_ids, &handled_li);
while ((handled_ln = listNext(&handled_li))) {
uint64_t id = (uint64_t)(uintptr_t)listNodeValue(handled_ln);
client *c = lookupClientByID(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client *c = lookupClientByID(id); brings more overhead, not sure the performance impact.

I have another choice:

int processIOThreadsReadDone(void) {
        ...
        int ret = addCommandToBatchAndProcessIfFull(c);
        /* If the command was not added to the commands batch, process it immediately */
        if (ret == C_ERR) {
            if (processPendingCommandAndInputBuffer(c) == C_OK) {
                 beforeNextClient(c);
                 listAddNodeTail(handled_clients, c);  /* NEW CODES */
            }
        }
}

And modify processClientsCommandsBatch:

void processClientsCommandsBatch(list *handled_clients) {
    ...

    for (size_t i = 0; i < batch->client_count; i++) {
        client *c = batch->clients[i];
        if (c == NULL) continue;

        /* Set the client to null immediately to avoid accessing it again recursively when ProcessingEventsWhileBlocked */
        batch->clients[i] = NULL;
        batch->executed_commands++;
        if (processPendingCommandAndInputBuffer(c) != C_ERR) {
               beforeNextClient(c);
               if (handled_clients) listAddNodeTail(handled_clients, c);  /* NEW CODES */
        }
}

Then we can handle the healthy clients as the previous version, this will avoid lookupClientByID

Signed-off-by: Ada-Church-Closure <[email protected]>
@pizhenwei
Copy link
Contributor

Hi @sarthakaggarwal97 would you please take a loot at the latest version?

Hi @zuiderkwast, would you please retry the test job to reproduce crash?

@Ada-Church-Closure
Copy link
Contributor Author

I used a shell to test this:

#!/bin/bash

# ==============================================================================
# Valkey Chaos Testing Script: Mixed TCP & RDMA workload with random disconnections
# Purpose: Verify the robustness of the I/O threads state machine and ensure 
#          no Use-After-Free (UAF) or Lost Wakeups occur during heavy batching.
# ==============================================================================

# 1: Unlock memory limits for RDMA Memory Registration (MR)
echo "[INFO] Unlocking memory limits for RDMA..."
sudo prlimit --memlock=unlimited --pid $$

HOST="10.0.0.1"
PORT="6379"

# Aggressive parameters to maximize pipeline depth and force batching logic
# Adjust -P (pipeline) and -c (connections) to push your specific hardware to the limit.
COMMON_ARGS="-h $HOST -p $PORT -d 256 --threads 16 -c 200 -P 64 -n 50000000 -t get"

echo "[INFO] Starting 4 TCP benchmark processes in the background..."
TCP_PIDS=()
for i in {1..4}; do
    ./src/valkey-benchmark $COMMON_ARGS > /dev/null 2>&1 &
    TCP_PIDS+=($!)
done
echo "[INFO] TCP benchmarks started. PIDs: ${TCP_PIDS[*]}"

echo "[INFO] Starting 4 RDMA benchmark processes in the background..."
RDMA_PIDS=()
for i in {1..4}; do
    ./src/valkey-benchmark $COMMON_ARGS --rdma > /dev/null 2>&1 &
    RDMA_PIDS+=($!)
done
echo "[INFO] RDMA benchmarks started. PIDs: ${RDMA_PIDS[*]}"

echo "[INFO] Letting the workloads saturate CPU and I/O threads (waiting 4 seconds)..."
sleep 4

echo "[INFO]  Chaos initiated Sending SIGINT (Ctrl+C) to random clients..."
kill -INT "${RDMA_PIDS[0]}" "${RDMA_PIDS[1]}"
kill -INT "${TCP_PIDS[0]}"

echo "[INFO] Killed RDMA PIDs: ${RDMA_PIDS[0]}, ${RDMA_PIDS[1]}"
echo "[INFO] Killed TCP PID: ${TCP_PIDS[0]}"
echo "[INFO] OS is now sending RST packets. Valkey server should survive without Segfault/UAF."

echo "[INFO] Waiting for the surviving benchmark processes to complete naturally..."
wait

echo "[SUCCESS] Chaos test completed! If the Valkey server is still running and responsive, the fix is rock solid."

and the server works fine:

~/Project/valkey fix-rdma-io-threads* 1m 23s ❯ sudo ./chaos_test.sh
[INFO] Unlocking memory limits for RDMA...
[INFO] Starting 4 TCP benchmark processes in the background...
[INFO] TCP benchmarks started. PIDs: 19370 19371 19372 19373
[INFO] Starting 4 RDMA benchmark processes in the background...
[INFO] RDMA benchmarks started. PIDs: 19374 19375 19376 19377
[INFO] Letting the workloads saturate CPU and I/O threads (waiting 4 seconds)...
[INFO]  Chaos initiated Sending SIGINT (Ctrl+C) to random clients...
[INFO] Killed RDMA PIDs: 19374, 19375
[INFO] Killed TCP PID: 19370
[INFO] OS is now sending RST packets. Valkey server should survive without Segfault/UAF.
[INFO] Waiting for the surviving benchmark processes to complete naturally...
[SUCCESS] Chaos test completed! If the Valkey server is still running and responsive, the fix is rock solid.

Copy link
Contributor

@pizhenwei pizhenwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CRASH] Does Valkey over RDMA not support I/O threads?

4 participants