Skip to content

Comments

Add ibError logging logic#630

Open
SuhitK wants to merge 1 commit intometa-pytorch:mainfrom
SuhitK:export-D92873340
Open

Add ibError logging logic#630
SuhitK wants to merge 1 commit intometa-pytorch:mainfrom
SuhitK:export-D92873340

Conversation

@SuhitK
Copy link
Contributor

@SuhitK SuhitK commented Feb 11, 2026

Summary:
This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Differential Revision: D92873340

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 11, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 11, 2026

@SuhitK has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92873340.

SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 11, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Differential Revision: D92873340
@SuhitK SuhitK force-pushed the export-D92873340 branch 2 times, most recently from 196881f to b461090 Compare February 18, 2026 21:11
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 18, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 18, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 19, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 20, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 20, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 20, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 21, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
SuhitK added a commit to SuhitK/torchcomms that referenced this pull request Feb 21, 2026
Summary:

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
Summary:
Pull Request resolved: meta-pytorch#630

This diff adds structured IB completion error reporting from the transport layer (net_ib.cc) to the NCCLX comms analyzer thrift service (GetCommsResponse). Errors are stored in the process-global ProcessGlobalErrorsUtil singleton and read directly by the handler. Each error is reported only once: the handler flushes the IB error state after populating the response.

IB errors are per-rank, not per-communicator, so ibErrors is a top-level field on GetCommsResponse. The handler reads from ProcessGlobalErrorsUtil::getAllState() directly, bypassing the per-comm commDump.cc serialization path.

Reviewed By: YulunW

Differential Revision: D92873340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant