Skip to content

[misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts#492

Open
Ziyi-Wang wants to merge 12 commits intomainfrom
ziyi218
Open

[misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts#492
Ziyi-Wang wants to merge 12 commits intomainfrom
ziyi218

Conversation

@Ziyi-Wang
Copy link
Copy Markdown
Collaborator

@Ziyi-Wang Ziyi-Wang commented Feb 18, 2026

Summary

  • Replace the default NCCL process group with a dedicated Gloo group (2h timeout) during dcp.save() coordination, preventing NCCL timeouts on slow I/O during HuggingFace
    safetensor checkpoint writes.
  • Add per-rank timing logs for shard write and consolidation phases via _TimedHuggingFaceStorageWriter mixin, improving checkpoint I/O observability.
  • Add overall wall-clock timing for save_hf_safetensor and propagate is_rank_0 to the distributed save path for consistent rank-0 gating.

Example log of the job saving a Qwen3-4B model to mounted HDFS

[save_safetensor_utils.py:70] 03/16/2026 06:36:32 >> Skipping weight not in HF weight_map: lm_head.weight
[save_safetensor_utils.py:147] 03/16/2026 06:36:32 >> Starting dcp.save()...
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 3] Shard write took 2.79s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 1] Shard write took 2.81s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 4] Shard write took 2.95s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 7] Shard write took 2.99s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 5] Shard write took 3.06s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 0] Shard write took 3.16s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 2] Shard write took 3.22s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 6] Shard write took 3.38s
[save_safetensor_utils.py:99] 03/16/2026 06:36:35 >> [Rank 0] Start consolidation
[save_safetensor_utils.py:103] 03/16/2026 06:38:33 >> [Rank 0] finish (consolidation) took 117.67s
[save_safetensor_utils.py:162] 03/16/2026 06:38:34 >> dcp.save() save took 121.97s
[save_safetensor_utils.py:278] 03/16/2026 06:38:34 >> save_hf_safetensor total time: 123.06s

@github-actions github-actions bot added ckpt Checkpoint related. fix labels Feb 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces changes to support saving HuggingFace safetensors to FUSE paths by using a local temporary directory for consolidation and then copying the files to the final destination. This addresses potential EOPNOTSUPP errors on mounted filesystems. The changes involve importing necessary modules, modifying the _save_hf_safetensor_distributed function to handle the temporary directory, and updating the save_hf_safetensor function to pass the is_rank_0 flag. The approach seems sound for handling the filesystem limitations.



class _TimedHuggingFaceStorageWriter:
"""Mixin that adds per-rank timing logs to HuggingFaceStorageWriter.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ziyi-Wang Ziyi-Wang changed the title [ckpt] fix: support fuse path when save hf safetensor [misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts Mar 6, 2026
@github-actions github-actions bot added the misc Every misc label Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ckpt Checkpoint related. fix misc Every misc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants