megatron sft保存的lora无法merge

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

sft脚本如下：
`PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
IMAGE_MAX_TOKEN_NUM=1024 \
VIDEO_MAX_TOKEN_NUM=128 \
FPS_MAX_FRAMES=16 \
MASTER_PORT=29500 \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
    --model /mnt/public_data/modelscope/Qwen/Qwen3-VL-30B-A3B-Instruct_ \
    --load_safetensors true \
    --save_safetensors true \
    --model_type qwen3_moe_vl \
    --dataset "data/traindata/jsonl/data.jsonl" \
    --split_dataset_ratio 0.01 \
    --loss_scale default \
    --loss_scale ignore_empty_think \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --tensor_model_parallel_size 4 \
    --expert_model_parallel_size 8 \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-6 \
    --micro_batch_size 8 \
    --global_batch_size 32 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --max_epochs 10 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --freeze_vit false \
    --freeze_aligner false \
    --lr 1e-4 \
    --vit_lr 1e-5 \
    --aligner_lr 2e-5 \
    --min_lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --save checkpoints/finetune \
    --eval_interval 300 \
    --save_interval 300 \
    --max_length 16384 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --megatron_extra_kwargs '{"log-energy": true}' \
    --attention-backend flash`

模型权重保存目录下，存在一个"checkpoint-xxx"和"checkpoint-xxx-merged"两个文件，尝试使用
`swift export \
    --adapters checkpoints/finetune/v0-xxx/checkpoint-xxx \
    --merge_lora true`合并权重，但是报错`Traceback (most recent call last):
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/cli/export.py", line 5, in <module>
    export_main()
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/export/export.py", line 53, in export_main
    return SwiftExport(args).main()
           ^^^^^^^^^^^^^^^^^
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 19, in __init__
    self.args = self._parse_args(args)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 31, in _parse_args
    args, remaining_argv = parse_args(self.args_class, args)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/utils/utils.py", line 190, in parse_args
    args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/transformers/hf_argparser.py", line 345, in parse_args_into_dataclasses
    obj = dtype(**inputs)
          ^^^^^^^^^^^^^^^
  File "<string>", line 120, in __init__
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/argument/export_args.py", line 144, in __post_init__
    BaseArguments.__post_init__(self)
  File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/argument/base_args/base_args.py", line 194, in __post_init__
    assert self._check_is_adapter(adapter), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: `/home/jovyan/user/llama-factory/sase_vl_train/models/qingming_festive_finetune/Qwen3-VL-30B-A3B-Instruct-0404-finetune-merged-stable-test3/v0-20260406-091352/checkpoint-900` is not an adapter, please try using `--model` to pass it.`  按照要求加了--model也无济于事。

另一种尝试
`CUDA_VISIBLE_DEVICES=6,7 \
NPROC_PER_NODE=2 \
megatron export \
    --adapter_load checkpoints/v0-xxxx/checkpoint-xxx \
    --tensor_model_parallel_size 2 \
    --to_hf true \
    --merge_lora true \
    --torch_dtype bfloat16 \
    --save checkpoints/v0-xxxx/checkpoint-xxx-merged \
    --test_convert_precision true`
报错
`[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/cli/_megatron/export.py", line 7, in <module>
[rank4]:     megatron_export_main()
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 142, in megatron_export_main
[rank4]:     return MegatronExport(args).main()
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/llm/base.py", line 49, in main
[rank4]:     result = self.run()
[rank4]:              ^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 30, in run
[rank4]:     self.convert_mcore2hf()
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/swift/megatron/export/export.py", line 58, in convert_mcore2hf
[rank4]:     load_checkpoint([mg_model], None, None, load_arg='adapter_load', strict=False)
[rank4]:   File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1533, in load_checkpoint
[rank4]:     state_dict, checkpoint_name, release, ckpt_type = _load_base_checkpoint(
[rank4]:                                                       ^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1106, in _load_base_checkpoint
[rank4]:     return _load_global_dist_base_checkpoint(
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/training/checkpointing.py", line 1000, in _load_global_dist_base_checkpoint
[rank4]:     state_dict = dist_checkpointing.load(sharded_state_dict, checkpoint_name, load_strategy, strict=args.dist_ckpt_strictness)
[rank4]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/core/dist_checkpointing/serialization.py", line 161, in load
[rank4]:     loaded_state_dict = sharded_strategy.load(sharded_state_dict, checkpoint_dir)
[rank4]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/code/Megatron-LM-r0.15.0/megatron/core/dist_checkpointing/strategies/torch.py", line 956, in load
[rank4]:     checkpoint.load_state_dict(
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/typing_extensions.py", line 3004, in wrapper
[rank4]:     return arg(*args, **kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 41, in load_state_dict
[rank4]:     return _load_state_dict(
[rank4]:            ^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 229, in _load_state_dict
[rank4]:     central_plan: LoadPlan = distW.reduce_scatter("plan", local_step, global_step)
[rank4]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/jovyan/user/env/wdh-swift312-torch26-cu124/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 191, in reduce_scatter
[rank4]:     raise result
[rank4]: torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0, 1, 2, 3, 4, 5, 6, 7])
`

有几个问题想问一下：1. checkpoint-xxx保存的是lora权重吗？为何没有lora_adapater，该权重能合并到基模里面吗？2. 有什么办法可以merge"checkpoint-xxx"?

### How to Reproduce / 如何复现

运行代码即可发现

### Additional Information / 补充信息

环境信息：
ms-swift: 3.12.3
megatron: r15.0.0
flash_attn: 2.8.1
torch: 2.6.0 + cu124
transformer:4.57.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

megatron sft保存的lora无法merge #9030

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

megatron sft保存的lora无法merge #9030

Description

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions