Fix EFA SRD robustness: selective signaling, ENOMEM retry, WriteWithImm warning#2721
Fix EFA SRD robustness: selective signaling, ENOMEM retry, WriteWithImm warning#2721dmvevents wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
…rch#2638) Summary: - Add EFA SRD queue pair support alongside existing mlx5/RC transport - EFA device auto-detection via efadv_query_device, with RdmaQpType::Auto selecting EFA vs mlx5dv based on hardware - EFA connect (INIT→RTR→RTS + address handle creation) and post operations (write/read via extended verbs, recv via ibv_post_recv) implemented in C - EFA-specific defaults: gid_index=0, max_sge=1, no RDMA atomics - Static linking of libefa.a from rdma-core to avoid conflicts with dynamic libibverbs - RC/mlx5 path is unchanged — EFA uses early returns and separate helper methods (efa_connect, post_op_efa) to keep it isolated - Explicit `is_efa` field on RdmaQueuePair instead of inferring from dv_qp==0 - Replace float-sum checksum in rdma_pingpong with xxhash for exact byte-level verification - Add --buffer_type flag (tensor, bytearray, memoryview) to rdma_pingpong to exercise non-tensor RDMA paths - Consolidate small utility tests in test_rdma.py into focused test functions Test Plan: - Built from clean clone on EFA-equipped AWS H100 nodes - rdma_pingpong.py passes all 4 sizes (10/100/500/1000 MB) × 3 iterations - All data and pong verifications PASS - Throughput: 2.5–8.0 GB/s across sizes - RC/mlx5 path unaffected (existing tests continue to work) - rdma_pingpong --buffer_type=bytearray and --buffer_type=memoryview both PASS - buck build fbcode//monarch/python/tests:test_rdma succeeds Using [moodist](https://github.com/facebookresearch/moodist)'s approach to EFA via ibverbs (instead of libfabric) significantly reduced the complexity of this. Reviewed By: zdevito Differential Revision: D93543408 Pulled By: cpuhrsch
…mm warning Three hardware-verified fixes for EFA SRD operations: 1. Forward `signaled` parameter to EFA post functions instead of hardcoding IBV_SEND_SIGNALED on every operation. This prevents CQ overflow under burst workloads and allows callers to control signaling frequency (e.g., signal every Nth op + last). 2. Add ENOMEM retry with CQ drain in post_write and post_read. When the send queue fills faster than HW completes, ibv_wr_complete() returns ENOMEM. Drain CQ to free slots and retry up to 1000 times before failing. Validated on P4d with 256KB+ transfers. 3. Emit warning when WriteWithImm is requested on EFA (op_type=3). EFA SRD does not support RDMA Write with Immediate Data; the previous code silently mapped it to plain Write.
|
Hi @dmvevents! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Summary
Three hardware-verified robustness fixes for the EFA SRD path introduced in #2638, based on testing across 7 transfer sizes (64B–1MB) on P4d instances with 4x EFA NICs:
signaledparameter tordmaxcel_efa_post_write/readinstead of hardcodingIBV_SEND_SIGNALEDon every operation. Prevents CQ overflow under burst workloads and allows callers to control signaling frequency.ibv_wr_complete()returns ENOMEM. Drains CQ to free slots and retries. We hit this consistently with 256KB+ transfers at sustained throughput.Depends on #2638.
Test plan