[misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts#492
Conversation
Contributor
There was a problem hiding this comment.
Code Review
The pull request introduces changes to support saving HuggingFace safetensors to FUSE paths by using a local temporary directory for consolidation and then copying the files to the final destination. This addresses potential EOPNOTSUPP errors on mounted filesystems. The changes involve importing necessary modules, modifying the _save_hf_safetensor_distributed function to handle the temporary directory, and updating the save_hf_safetensor function to pass the is_rank_0 flag. The approach seems sound for handling the filesystem limitations.
Luosuu
reviewed
Feb 27, 2026
|
|
||
|
|
||
| class _TimedHuggingFaceStorageWriter: | ||
| """Mixin that adds per-rank timing logs to HuggingFaceStorageWriter. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dcp.save()coordination, preventing NCCL timeouts on slow I/O during HuggingFacesafetensor checkpoint writes.
_TimedHuggingFaceStorageWritermixin, improving checkpoint I/O observability.save_hf_safetensorand propagateis_rank_0to the distributed save path for consistent rank-0 gating.Example log of the job saving a Qwen3-4B model to mounted HDFS
[save_safetensor_utils.py:70] 03/16/2026 06:36:32 >> Skipping weight not in HF weight_map: lm_head.weight
[save_safetensor_utils.py:147] 03/16/2026 06:36:32 >> Starting dcp.save()...
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 3] Shard write took 2.79s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 1] Shard write took 2.81s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 4] Shard write took 2.95s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 7] Shard write took 2.99s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 5] Shard write took 3.06s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 0] Shard write took 3.16s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 2] Shard write took 3.22s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 6] Shard write took 3.38s
[save_safetensor_utils.py:99] 03/16/2026 06:36:35 >> [Rank 0] Start consolidation
[save_safetensor_utils.py:103] 03/16/2026 06:38:33 >> [Rank 0] finish (consolidation) took 117.67s
[save_safetensor_utils.py:162] 03/16/2026 06:38:34 >> dcp.save() save took 121.97s
[save_safetensor_utils.py:278] 03/16/2026 06:38:34 >> save_hf_safetensor total time: 123.06s