Skip to content

[model]feat: add NPU support for Qwen3.5#628

Open
yanghw116 wants to merge 21 commits intoByteDance-Seed:mainfrom
yanghw116:main
Open

[model]feat: add NPU support for Qwen3.5#628
yanghw116 wants to merge 21 commits intoByteDance-Seed:mainfrom
yanghw116:main

Conversation

@yanghw116
Copy link
Copy Markdown

What does this PR do?

Qwen3.5模型适配昇腾NPU卡(附带piyifan/kernel-patch-exp的更改)

Checklist Before Starting

  • Search for relative PRs/issues and link here: ...
  • PR title follows [{modules}] {type}: {description} format (see check_pr_title.yml for the full list of allowed modules and types)
    • Breaking changes: prepend [BREAKING] — e.g. [BREAKING][parallel, model] feat: dynamic batching

Test

Validation results (training curves, eval metrics) for changes not covered by CI.

API and Usage Example

Show API changes and usage examples if applicable.

Design & Code Changes

参考GPU patch方案,增加NPU patch代码。
主要更改点:
1、自定义算子使用OpSlot注册;
2、fused_moe_forward通过判断硬件的形式注册(采用后面的方案会报错),其他算子通过执行patch注册。

Checklist Before Submitting

  • Read the Contribute Guide
  • Applied pre-commit checks
  • Added/updated documentation
  • If tasks/ training scripts were moved or renamed: updated docs/ examples and verified python3 scripts/ci/check_doc_task_paths.py passes (also enforced by the Check doc task paths CI workflow)
  • Added tests to CI workflow (or explained why not feasible)

piyifan123 and others added 17 commits March 16, 2026 06:40
Resolve conflicts in qwen3_5_moe patch gen config by combining OpSlot-based
dispatch (branch) with new imports, model init, and dummy vars from main (ByteDance-Seed#602).
Regenerated generated files from merged config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the ascend everything about Ascend support label Apr 2, 2026
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 2, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a unified kernel registry and dispatch mechanism using 'OpSlot' to replace fragmented kernel selection methods (like environment variables and monkey-patching) across VeOmni. It adds support for Qwen3.5 MoE models on NPU, implements sequence parallel (SP) optimizations for multimodal inputs, and provides a structured way to register and resolve kernel implementations. My feedback identifies a critical issue in the NPU cross-entropy kernel registration, documentation errors in configuration help strings, a performance concern regarding MoE expert linear projections on NPU, and an inconsistency in the load-balancing loss dispatch mechanism between GPU and NPU paths.

if os.environ.get("VEOMNI_ENABLE_CHUNK_LOSS", "0") == "1":
LOSS_MAPPING["ForCausalLM"] = chunk_loss_function
_cross_entropy = eager_cross_entropy
register_npu_kernel_cross_entropy(eager_cross_entropy)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

NPU的fused cross-entropy内核注册存在错误。当前代码在 apply_veomni_loss_patch 中调用 register_npu_kernel_cross_entropy(eager_cross_entropy),这导致名为 npu_fused 的内核实际上使用的是未优化的eager版本,与内核名称和设计目标不符。

这会导致用户在配置中使用 cross_entropy_loss_implementation: npu_fused 时,无法获得预期的性能提升。

NPU上优化的损失函数似乎是 chunk_loss_function。请修改注册逻辑,将 chunk_loss_function (或其他正确的NPU融合函数) 注册为 npu_fused 内核。

Suggested change
register_npu_kernel_cross_entropy(eager_cross_entropy)
register_npu_kernel_cross_entropy(chunk_loss_function)

@yanghw116 yanghw116 force-pushed the main branch 3 times, most recently from fd33fd6 to d2acbf8 Compare April 9, 2026 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ascend everything about Ascend support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants