Skip to content

Comments

Fix transport_errors unit test for process global errors#746

Closed
zhiyongww wants to merge 1 commit intometa-pytorch:mainfrom
zhiyongww:export-D93790397
Closed

Fix transport_errors unit test for process global errors#746
zhiyongww wants to merge 1 commit intometa-pytorch:mainfrom
zhiyongww:export-D93790397

Conversation

@zhiyongww
Copy link
Contributor

Summary:
The transport_errors unit test was crashing (SIGSEGV) and failing assertions because:

  1. ctran::ibvwrap::ibvSymbols was not initialized — ibvInit() only initializes ibverbx::ibvSymbols, so wrap_ibv_event_type_str called from triageIbAsyncEvents dereferenced a null function pointer. Fixed by calling wrap_ibv_symbols() in test setup.

  2. The AsyncEventFound and AsyncEventLinkFlap tests asserted ncclCommGetAsyncError would return ncclSystemError, but the production code now records errors via ProcessGlobalErrorsUtil::setNic instead of ncclCommSetAsyncError. Updated both tests to verify errors through ProcessGlobalErrorsUtil::getAllState().

Differential Revision: D93790397

Summary:
The transport_errors unit test was crashing (SIGSEGV) and failing assertions because:

1. `ctran::ibvwrap::ibvSymbols` was not initialized — `ibvInit()` only initializes `ibverbx::ibvSymbols`, so `wrap_ibv_event_type_str` called from `triageIbAsyncEvents` dereferenced a null function pointer. Fixed by calling `wrap_ibv_symbols()` in test setup.

2. The `AsyncEventFound` and `AsyncEventLinkFlap` tests asserted `ncclCommGetAsyncError` would return `ncclSystemError`, but the production code now records errors via `ProcessGlobalErrorsUtil::setNic` instead of `ncclCommSetAsyncError`. Updated both tests to verify errors through `ProcessGlobalErrorsUtil::getAllState()`.

Differential Revision: D93790397
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 19, 2026

@zhiyongww has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93790397.

@meta-codesync
Copy link

meta-codesync bot commented Feb 20, 2026

This pull request has been merged in f0051d2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants