Checklist / 检查清单
Bug Description / Bug 描述
sft脚本如下:
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ IMAGE_MAX_TOKEN_NUM=1024 \ VIDEO_MAX_TOKEN_NUM=128 \ FPS_MAX_FRAMES=16 \ MASTER_PORT=29500 \ NPROC_PER_NODE=8 \ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ megatron sft \ --model /mnt/public_data/modelscope/Qwen/Qwen3-VL-30B-A3B-Instruct_ \ --load_safetensors true \ --save_safetensors true \ --model_type qwen3_moe_vl \ --dataset "data/traindata/jsonl/data.jsonl" \ --split_dataset_ratio 0.01 \ --loss_scale default \ --loss_scale ignore_empty_think \ --train_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --tensor_model_parallel_size 4 \ --expert_model_parallel_size 8 \ --moe_permute_fusion true \ --moe_grouped_gemm true \ --moe_shared_expert_overlap true \ --moe_aux_loss_coeff 1e-6 \ --micro_batch_size 8 \ --global_batch_size 32 \ --recompute_granularity full \ --recompute_method uniform \ --recompute_num_layers 1 \ --max_epochs 10 \ --finetune true \ --cross_entropy_loss_fusion true \ --freeze_vit false \ --freeze_aligner false \ --lr 1e-4 \ --vit_lr 1e-5 \ --aligner_lr 2e-5 \ --min_lr 1e-5 \ --lr_warmup_fraction 0.05 \ --save checkpoints/finetune \ --eval_interval 300 \ --save_interval 300 \ --max_length 16384 \ --num_workers 8 \ --dataset_num_proc 8 \ --no_save_optim true \ --no_save_rng true \ --sequence_parallel true \ --megatron_extra_kwargs '{"log-energy": true}' \ --attention-backend flash
模型权重保存目录下,存在一个"checkpoint-xxx"和"checkpoint-xxx-merged"两个文件,尝试使用
swift export \ --adapters checkpoints/finetune/v0-xxx/checkpoint-xxx \ --merge_lora true合并权重,但是报错Traceback (most recent call last): File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/cli/export.py", line 5, in <module> export_main() File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/export/export.py", line 53, in export_main return SwiftExport(args).main() ^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 19, in __init__ self.args = self._parse_args(args) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 31, in _parse_args args, remaining_argv = parse_args(self.args_class, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/utils/utils.py", line 190, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/transformers/hf_argparser.py", line 345, in parse_args_into_dataclasses obj = dtype(**inputs) ^^^^^^^^^^^^^^^ File "<string>", line 120, in __init__ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/argument/export_args.py", line 144, in __post_init__ BaseArguments.__post_init__(self) File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/argument/base_args/base_args.py", line 194, in __post_init__ assert self._check_is_adapter(adapter), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: /home/jovyan/user/llama-factory/sase_vl_train/models/qingming_festive_finetune/Qwen3-VL-30B-A3B-Instruct-0404-finetune-merged-stable-test3/v0-20260406-091352/checkpoint-900is not an adapter, please try using--model to pass it. 按照要求加了--model也无济于事。
另一种尝试
CUDA_VISIBLE_DEVICES=6,7 \ NPROC_PER_NODE=2 \ megatron export \ --adapter_load checkpoints/v0-xxxx/checkpoint-xxx \ --tensor_model_parallel_size 2 \ --to_hf true \ --merge_lora true \ --torch_dtype bfloat16 \ --save checkpoints/v0-xxxx/checkpoint-xxx-merged \ --test_convert_precision true
报错
[rank4]: Traceback (most recent call last): [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/cli/_megatron/export.py", line 7, in <module> [rank4]: megatron_export_main() [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 142, in megatron_export_main [rank4]: return MegatronExport(args).main() [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main [rank4]: result = self.run() [rank4]: ^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 30, in run [rank4]: self.convert_mcore2hf() [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 58, in convert_mcore2hf [rank4]: load_checkpoint([mg_model], None, None, load_arg='adapter_load', strict=False) [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1533, in load_checkpoint [rank4]: state_dict, checkpoint_name, release, ckpt_type = _load_base_checkpoint( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1106, in _load_base_checkpoint [rank4]: return _load_global_dist_base_checkpoint( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1000, in _load_global_dist_base_checkpoint [rank4]: state_dict = dist_checkpointing.load(sharded_state_dict, checkpoint_name, load_strategy, strict=args.dist_ckpt_strictness) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/core/dist_checkpointing/serialization.py", line 161, in load [rank4]: loaded_state_dict = sharded_strategy.load(sharded_state_dict, checkpoint_dir) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/core/dist_checkpointing/strategies/torch.py", line 956, in load [rank4]: checkpoint.load_state_dict( [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/typing_extensions.py", line 3004, in wrapper [rank4]: return arg(*args, **kwargs) [rank4]: ^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 41, in load_state_dict [rank4]: return _load_state_dict( [rank4]: ^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 229, in _load_state_dict [rank4]: central_plan: LoadPlan = distW.reduce_scatter("plan", local_step, global_step) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 191, in reduce_scatter [rank4]: raise result [rank4]: torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0, 1, 2, 3, 4, 5, 6, 7])
有几个问题想问一下:1. checkpoint-xxx保存的是lora权重吗?为何没有lora_adapater,该权重能合并到基模里面吗?2. 有什么办法可以merge"checkpoint-xxx"?
How to Reproduce / 如何复现
运行代码即可发现
Additional Information / 补充信息
环境信息:
ms-swift: 3.12.3
megatron: r15.0.0
flash_attn: 2.8.1
torch: 2.6.0 + cu124
transformer:4.57.3
Checklist / 检查清单
Bug Description / Bug 描述
sft脚本如下:
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ IMAGE_MAX_TOKEN_NUM=1024 \ VIDEO_MAX_TOKEN_NUM=128 \ FPS_MAX_FRAMES=16 \ MASTER_PORT=29500 \ NPROC_PER_NODE=8 \ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ megatron sft \ --model /mnt/public_data/modelscope/Qwen/Qwen3-VL-30B-A3B-Instruct_ \ --load_safetensors true \ --save_safetensors true \ --model_type qwen3_moe_vl \ --dataset "data/traindata/jsonl/data.jsonl" \ --split_dataset_ratio 0.01 \ --loss_scale default \ --loss_scale ignore_empty_think \ --train_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --tensor_model_parallel_size 4 \ --expert_model_parallel_size 8 \ --moe_permute_fusion true \ --moe_grouped_gemm true \ --moe_shared_expert_overlap true \ --moe_aux_loss_coeff 1e-6 \ --micro_batch_size 8 \ --global_batch_size 32 \ --recompute_granularity full \ --recompute_method uniform \ --recompute_num_layers 1 \ --max_epochs 10 \ --finetune true \ --cross_entropy_loss_fusion true \ --freeze_vit false \ --freeze_aligner false \ --lr 1e-4 \ --vit_lr 1e-5 \ --aligner_lr 2e-5 \ --min_lr 1e-5 \ --lr_warmup_fraction 0.05 \ --save checkpoints/finetune \ --eval_interval 300 \ --save_interval 300 \ --max_length 16384 \ --num_workers 8 \ --dataset_num_proc 8 \ --no_save_optim true \ --no_save_rng true \ --sequence_parallel true \ --megatron_extra_kwargs '{"log-energy": true}' \ --attention-backend flash模型权重保存目录下,存在一个"checkpoint-xxx"和"checkpoint-xxx-merged"两个文件,尝试使用
swift export \ --adapters checkpoints/finetune/v0-xxx/checkpoint-xxx \ --merge_lora true合并权重,但是报错Traceback (most recent call last): File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/cli/export.py", line 5, in <module> export_main() File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/export/export.py", line 53, in export_main return SwiftExport(args).main() ^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 19, in __init__ self.args = self._parse_args(args) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 31, in _parse_args args, remaining_argv = parse_args(self.args_class, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/utils/utils.py", line 190, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/transformers/hf_argparser.py", line 345, in parse_args_into_dataclasses obj = dtype(**inputs) ^^^^^^^^^^^^^^^ File "<string>", line 120, in __init__ File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/argument/export_args.py", line 144, in __post_init__ BaseArguments.__post_init__(self) File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/argument/base_args/base_args.py", line 194, in __post_init__ assert self._check_is_adapter(adapter), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError:/home/jovyan/user/llama-factory/sase_vl_train/models/qingming_festive_finetune/Qwen3-VL-30B-A3B-Instruct-0404-finetune-merged-stable-test3/v0-20260406-091352/checkpoint-900is not an adapter, please try using--modelto pass it.按照要求加了--model也无济于事。另一种尝试
CUDA_VISIBLE_DEVICES=6,7 \ NPROC_PER_NODE=2 \ megatron export \ --adapter_load checkpoints/v0-xxxx/checkpoint-xxx \ --tensor_model_parallel_size 2 \ --to_hf true \ --merge_lora true \ --torch_dtype bfloat16 \ --save checkpoints/v0-xxxx/checkpoint-xxx-merged \ --test_convert_precision true报错
[rank4]: Traceback (most recent call last): [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/cli/_megatron/export.py", line 7, in <module> [rank4]: megatron_export_main() [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 142, in megatron_export_main [rank4]: return MegatronExport(args).main() [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main [rank4]: result = self.run() [rank4]: ^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 30, in run [rank4]: self.convert_mcore2hf() [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 58, in convert_mcore2hf [rank4]: load_checkpoint([mg_model], None, None, load_arg='adapter_load', strict=False) [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1533, in load_checkpoint [rank4]: state_dict, checkpoint_name, release, ckpt_type = _load_base_checkpoint( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1106, in _load_base_checkpoint [rank4]: return _load_global_dist_base_checkpoint( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1000, in _load_global_dist_base_checkpoint [rank4]: state_dict = dist_checkpointing.load(sharded_state_dict, checkpoint_name, load_strategy, strict=args.dist_ckpt_strictness) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/core/dist_checkpointing/serialization.py", line 161, in load [rank4]: loaded_state_dict = sharded_strategy.load(sharded_state_dict, checkpoint_dir) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/core/dist_checkpointing/strategies/torch.py", line 956, in load [rank4]: checkpoint.load_state_dict( [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/typing_extensions.py", line 3004, in wrapper [rank4]: return arg(*args, **kwargs) [rank4]: ^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 41, in load_state_dict [rank4]: return _load_state_dict( [rank4]: ^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 229, in _load_state_dict [rank4]: central_plan: LoadPlan = distW.reduce_scatter("plan", local_step, global_step) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 191, in reduce_scatter [rank4]: raise result [rank4]: torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0, 1, 2, 3, 4, 5, 6, 7])有几个问题想问一下:1. checkpoint-xxx保存的是lora权重吗?为何没有lora_adapater,该权重能合并到基模里面吗?2. 有什么办法可以merge"checkpoint-xxx"?
How to Reproduce / 如何复现
运行代码即可发现
Additional Information / 补充信息
环境信息:
ms-swift: 3.12.3
megatron: r15.0.0
flash_attn: 2.8.1
torch: 2.6.0 + cu124
transformer:4.57.3