Fix EFA SRD robustness: selective signaling, ENOMEM retry, WriteWithImm warning by dmvevents · Pull Request #2721 · meta-pytorch/monarch

dmvevents · 2026-02-21T00:36:28Z

Summary

Three hardware-verified robustness fixes for the EFA SRD path introduced in #2638, based on testing across 7 transfer sizes (64B–1MB) on P4d instances with 4x EFA NICs:

Forward signaled parameter to rdmaxcel_efa_post_write/read instead of hardcoding IBV_SEND_SIGNALED on every operation. Prevents CQ overflow under burst workloads and allows callers to control signaling frequency.
Add ENOMEM retry with CQ drain in post_write and post_read. When the send queue fills faster than HW completes, ibv_wr_complete() returns ENOMEM. Drains CQ to free slots and retries. We hit this consistently with 256KB+ transfers at sustained throughput.
Emit warning when WriteWithImm is requested on EFA (op_type=3). EFA SRD does not support RDMA Write with Immediate Data — the previous code silently mapped it to plain Write with no indication to callers.

Depends on #2638.

Test plan

Validated ENOMEM retry on P4d with 256KB and 1MB transfers (4 channels, 50 iterations per size)
Verified selective signaling (every 16th + last) produces correct completions
Confirmed WriteWithImm warning prints to stderr on EFA hardware
RC/mlx5 path unchanged (only EFA functions modified)

…rch#2638) Summary: - Add EFA SRD queue pair support alongside existing mlx5/RC transport - EFA device auto-detection via efadv_query_device, with RdmaQpType::Auto selecting EFA vs mlx5dv based on hardware - EFA connect (INIT→RTR→RTS + address handle creation) and post operations (write/read via extended verbs, recv via ibv_post_recv) implemented in C - EFA-specific defaults: gid_index=0, max_sge=1, no RDMA atomics - Static linking of libefa.a from rdma-core to avoid conflicts with dynamic libibverbs - RC/mlx5 path is unchanged — EFA uses early returns and separate helper methods (efa_connect, post_op_efa) to keep it isolated - Explicit `is_efa` field on RdmaQueuePair instead of inferring from dv_qp==0 - Replace float-sum checksum in rdma_pingpong with xxhash for exact byte-level verification - Add --buffer_type flag (tensor, bytearray, memoryview) to rdma_pingpong to exercise non-tensor RDMA paths - Consolidate small utility tests in test_rdma.py into focused test functions Test Plan: - Built from clean clone on EFA-equipped AWS H100 nodes - rdma_pingpong.py passes all 4 sizes (10/100/500/1000 MB) × 3 iterations - All data and pong verifications PASS - Throughput: 2.5–8.0 GB/s across sizes - RC/mlx5 path unaffected (existing tests continue to work) - rdma_pingpong --buffer_type=bytearray and --buffer_type=memoryview both PASS - buck build fbcode//monarch/python/tests:test_rdma succeeds Using [moodist](https://github.com/facebookresearch/moodist)'s approach to EFA via ibverbs (instead of libfabric) significantly reduced the complexity of this. Reviewed By: zdevito Differential Revision: D93543408 Pulled By: cpuhrsch

…mm warning Three hardware-verified fixes for EFA SRD operations: 1. Forward `signaled` parameter to EFA post functions instead of hardcoding IBV_SEND_SIGNALED on every operation. This prevents CQ overflow under burst workloads and allows callers to control signaling frequency (e.g., signal every Nth op + last). 2. Add ENOMEM retry with CQ drain in post_write and post_read. When the send queue fills faster than HW completes, ibv_wr_complete() returns ENOMEM. Drain CQ to free slots and retry up to 1000 times before failing. Validated on P4d with 256KB+ transfers. 3. Emit warning when WriteWithImm is requested on EFA (op_type=3). EFA SRD does not support RDMA Write with Immediate Data; the previous code silently mapped it to plain Write.

meta-cla · 2026-02-21T00:36:33Z

Hi @dmvevents!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

cpuhrsch and others added 2 commits February 19, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix EFA SRD robustness: selective signaling, ENOMEM retry, WriteWithImm warning#2721

Fix EFA SRD robustness: selective signaling, ENOMEM retry, WriteWithImm warning#2721
dmvevents wants to merge 2 commits intometa-pytorch:mainfrom
dmvevents:efa-srd-robustness

dmvevents commented Feb 21, 2026

Uh oh!

meta-cla bot commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

dmvevents commented Feb 21, 2026

Summary

Test plan

Uh oh!

meta-cla bot commented Feb 21, 2026

Action Required

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants