[misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts by Ziyi-Wang · Pull Request #492 · ByteDance-Seed/VeOmni

Ziyi-Wang · 2026-02-18T23:37:30Z

Summary

Replace the default NCCL process group with a dedicated Gloo group (2h timeout) during dcp.save() coordination, preventing NCCL timeouts on slow I/O during HuggingFace
safetensor checkpoint writes.
Add per-rank timing logs for shard write and consolidation phases via _TimedHuggingFaceStorageWriter mixin, improving checkpoint I/O observability.
Add overall wall-clock timing for save_hf_safetensor and propagate is_rank_0 to the distributed save path for consistent rank-0 gating.

Example log of the job saving a Qwen3-4B model to mounted HDFS

[save_safetensor_utils.py:70] 03/16/2026 06:36:32 >> Skipping weight not in HF weight_map: lm_head.weight
[save_safetensor_utils.py:147] 03/16/2026 06:36:32 >> Starting dcp.save()...
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 3] Shard write took 2.79s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 1] Shard write took 2.81s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 4] Shard write took 2.95s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 7] Shard write took 2.99s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 5] Shard write took 3.06s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 0] Shard write took 3.16s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 2] Shard write took 3.22s
[save_safetensor_utils.py:94] 03/16/2026 06:36:35 >> [Rank 6] Shard write took 3.38s
[save_safetensor_utils.py:99] 03/16/2026 06:36:35 >> [Rank 0] Start consolidation
[save_safetensor_utils.py:103] 03/16/2026 06:38:33 >> [Rank 0] finish (consolidation) took 117.67s
[save_safetensor_utils.py:162] 03/16/2026 06:38:34 >> dcp.save() save took 121.97s
[save_safetensor_utils.py:278] 03/16/2026 06:38:34 >> save_hf_safetensor total time: 123.06s

gemini-code-assist

Code Review

The pull request introduces changes to support saving HuggingFace safetensors to FUSE paths by using a local temporary directory for consolidation and then copying the files to the final destination. This addresses potential EOPNOTSUPP errors on mounted filesystems. The changes involve importing necessary modules, modifying the _save_hf_safetensor_distributed function to handle the temporary directory, and updating the save_hf_safetensor function to pass the is_rank_0 flag. The approach seems sound for handling the filesystem limitations.

veomni/utils/save_safetensor_utils.py

Luosuu · 2026-02-27T19:22:37Z

veomni/utils/save_safetensor_utils.py



+class _TimedHuggingFaceStorageWriter:
+    """Mixin that adds per-rank timing logs to HuggingFaceStorageWriter.


@piyifan123

github-actions bot added ckpt Checkpoint related. fix labels Feb 18, 2026

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

Ziyi-Wang requested a review from piyifan123 February 19, 2026 00:00

Ziyi-Wang force-pushed the ziyi218 branch from 8b87b38 to f8b041a Compare February 27, 2026 18:12

Luosuu reviewed Feb 27, 2026

View reviewed changes

Ziyi-Wang changed the title ~~[ckpt] fix: support fuse path when save hf safetensor~~ [misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts Mar 6, 2026

github-actions bot added the misc Every misc label Mar 6, 2026

Ziyi-Wang force-pushed the ziyi218 branch from 48133d5 to 0c41b9f Compare March 6, 2026 04:01

Ziyi-Wang added 12 commits March 16, 2026 06:05

[ckpt] fix: support fuse path when save hf safetensor

1748d2b

mod

ffddab2

mod

7f522aa

mod

f5b62e3

safetensor latest

4f18180

mod

544da28

increase NCCL timeout

889bca9

mod

135e44d

gloo timeout

5458818

mod

4acfc07

remove fuse part

8ef4695

mod

ada6021

Ziyi-Wang force-pushed the ziyi218 branch from 0c41b9f to ada6021 Compare March 15, 2026 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts#492

[misc] fix: use dedicated Gloo process group for HF safetensor save to avoid NCCL timeouts#492
Ziyi-Wang wants to merge 12 commits intomainfrom
ziyi218

Ziyi-Wang commented Feb 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luosuu Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		class _TimedHuggingFaceStorageWriter:
		"""Mixin that adds per-rank timing logs to HuggingFaceStorageWriter.

Conversation

Ziyi-Wang commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example log of the job saving a Qwen3-4B model to mounted HDFS

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luosuu Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ziyi-Wang commented Feb 18, 2026 •

edited

Loading