[model]feat: add NPU support for Qwen3.5#628
[model]feat: add NPU support for Qwen3.5#628yanghw116 wants to merge 21 commits intoByteDance-Seed:mainfrom
Conversation
# Conflicts: # veomni/ops/__init__.py
Resolve conflicts in qwen3_5_moe patch gen config by combining OpSlot-based dispatch (branch) with new imports, model init, and dummy vars from main (ByteDance-Seed#602). Regenerated generated files from merged config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a unified kernel registry and dispatch mechanism using 'OpSlot' to replace fragmented kernel selection methods (like environment variables and monkey-patching) across VeOmni. It adds support for Qwen3.5 MoE models on NPU, implements sequence parallel (SP) optimizations for multimodal inputs, and provides a structured way to register and resolve kernel implementations. My feedback identifies a critical issue in the NPU cross-entropy kernel registration, documentation errors in configuration help strings, a performance concern regarding MoE expert linear projections on NPU, and an inconsistency in the load-balancing loss dispatch mechanism between GPU and NPU paths.
| if os.environ.get("VEOMNI_ENABLE_CHUNK_LOSS", "0") == "1": | ||
| LOSS_MAPPING["ForCausalLM"] = chunk_loss_function | ||
| _cross_entropy = eager_cross_entropy | ||
| register_npu_kernel_cross_entropy(eager_cross_entropy) |
There was a problem hiding this comment.
NPU的fused cross-entropy内核注册存在错误。当前代码在 apply_veomni_loss_patch 中调用 register_npu_kernel_cross_entropy(eager_cross_entropy),这导致名为 npu_fused 的内核实际上使用的是未优化的eager版本,与内核名称和设计目标不符。
这会导致用户在配置中使用 cross_entropy_loss_implementation: npu_fused 时,无法获得预期的性能提升。
NPU上优化的损失函数似乎是 chunk_loss_function。请修改注册逻辑,将 chunk_loss_function (或其他正确的NPU融合函数) 注册为 npu_fused 内核。
| register_npu_kernel_cross_entropy(eager_cross_entropy) | |
| register_npu_kernel_cross_entropy(chunk_loss_function) |
veomni/models/transformers/qwen3_5_moe/qwen3_5_moe_npu_patch_gen_config.py
Show resolved
Hide resolved
veomni/models/transformers/qwen3_5_moe/qwen3_5_moe_npu_patch_gen_config.py
Show resolved
Hide resolved
fd33fd6 to
d2acbf8
Compare
What does this PR do?
Qwen3.5模型适配昇腾NPU卡(附带piyifan/kernel-patch-exp的更改)
Checklist Before Starting
[{modules}] {type}: {description}format (see check_pr_title.yml for the full list of allowed modules and types)[BREAKING]— e.g.[BREAKING][parallel, model] feat: dynamic batchingTest
API and Usage Example
Design & Code Changes
参考GPU patch方案,增加NPU patch代码。
主要更改点:
1、自定义算子使用OpSlot注册;
2、fused_moe_forward通过判断硬件的形式注册(采用后面的方案会报错),其他算子通过执行patch注册。
Checklist Before Submitting
tasks/training scripts were moved or renamed: updateddocs/examples and verifiedpython3 scripts/ci/check_doc_task_paths.pypasses (also enforced by the Check doc task paths CI workflow)