Skip to content

Comments

Fix EFA SRD robustness: selective signaling, ENOMEM retry, WriteWithImm warning#2721

Open
dmvevents wants to merge 2 commits intometa-pytorch:mainfrom
dmvevents:efa-srd-robustness
Open

Fix EFA SRD robustness: selective signaling, ENOMEM retry, WriteWithImm warning#2721
dmvevents wants to merge 2 commits intometa-pytorch:mainfrom
dmvevents:efa-srd-robustness

Conversation

@dmvevents
Copy link

Summary

Three hardware-verified robustness fixes for the EFA SRD path introduced in #2638, based on testing across 7 transfer sizes (64B–1MB) on P4d instances with 4x EFA NICs:

  • Forward signaled parameter to rdmaxcel_efa_post_write/read instead of hardcoding IBV_SEND_SIGNALED on every operation. Prevents CQ overflow under burst workloads and allows callers to control signaling frequency.
  • Add ENOMEM retry with CQ drain in post_write and post_read. When the send queue fills faster than HW completes, ibv_wr_complete() returns ENOMEM. Drains CQ to free slots and retries. We hit this consistently with 256KB+ transfers at sustained throughput.
  • Emit warning when WriteWithImm is requested on EFA (op_type=3). EFA SRD does not support RDMA Write with Immediate Data — the previous code silently mapped it to plain Write with no indication to callers.

Depends on #2638.

Test plan

  • Validated ENOMEM retry on P4d with 256KB and 1MB transfers (4 channels, 50 iterations per size)
  • Verified selective signaling (every 16th + last) produces correct completions
  • Confirmed WriteWithImm warning prints to stderr on EFA hardware
  • RC/mlx5 path unchanged (only EFA functions modified)

cpuhrsch and others added 2 commits February 19, 2026 18:05
…rch#2638)

Summary:
- Add EFA SRD queue pair support alongside existing mlx5/RC transport
  - EFA device auto-detection via efadv_query_device, with RdmaQpType::Auto selecting EFA vs mlx5dv based on hardware
  - EFA connect (INIT→RTR→RTS + address handle creation) and post operations (write/read via extended verbs, recv via ibv_post_recv) implemented in C
  - EFA-specific defaults: gid_index=0, max_sge=1, no RDMA atomics
  - Static linking of libefa.a from rdma-core to avoid conflicts with dynamic libibverbs
  - RC/mlx5 path is unchanged — EFA uses early returns and separate helper methods (efa_connect, post_op_efa) to keep it isolated
  - Explicit `is_efa` field on RdmaQueuePair instead of inferring from dv_qp==0
  - Replace float-sum checksum in rdma_pingpong with xxhash for exact byte-level verification
  - Add --buffer_type flag (tensor, bytearray, memoryview) to rdma_pingpong to exercise non-tensor RDMA paths
  - Consolidate small utility tests in test_rdma.py into focused test functions


Test Plan:
- Built from clean clone on EFA-equipped AWS H100 nodes
  - rdma_pingpong.py passes all 4 sizes (10/100/500/1000 MB) × 3 iterations
  - All data and pong verifications PASS
  - Throughput: 2.5–8.0 GB/s across sizes
  - RC/mlx5 path unaffected (existing tests continue to work)
  - rdma_pingpong --buffer_type=bytearray and --buffer_type=memoryview both PASS
  - buck build fbcode//monarch/python/tests:test_rdma succeeds

Using [moodist](https://github.com/facebookresearch/moodist)'s approach to EFA via ibverbs (instead of libfabric) significantly reduced the complexity of this.

Reviewed By: zdevito

Differential Revision: D93543408

Pulled By: cpuhrsch
…mm warning

Three hardware-verified fixes for EFA SRD operations:

1. Forward `signaled` parameter to EFA post functions instead of
   hardcoding IBV_SEND_SIGNALED on every operation. This prevents CQ
   overflow under burst workloads and allows callers to control
   signaling frequency (e.g., signal every Nth op + last).

2. Add ENOMEM retry with CQ drain in post_write and post_read. When
   the send queue fills faster than HW completes, ibv_wr_complete()
   returns ENOMEM. Drain CQ to free slots and retry up to 1000 times
   before failing. Validated on P4d with 256KB+ transfers.

3. Emit warning when WriteWithImm is requested on EFA (op_type=3).
   EFA SRD does not support RDMA Write with Immediate Data; the
   previous code silently mapped it to plain Write.
@meta-cla
Copy link

meta-cla bot commented Feb 21, 2026

Hi @dmvevents!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants