Skip to content

[Ascend]: Qwen3多模态MOE架构模型使用EP训练报错 #643

@lxjhunan

Description

@lxjhunan

Add a description

模型可以正常加载,30分钟左右后出现超时,报错信息如下:
c10_npu::acl::AclrtSynchronizeStreamWithTimeout(copy_stream), error code is 107020

训练参数设置:
--train.micro_batch_size 1
--train.global_batch_size 32
--train.profile.with_stack false
--train.accelerator.ep_size 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    ascendeverything about Ascend support

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions