[model]feat: add NPU support for Qwen3.5 by yanghw116 · Pull Request #628 · ByteDance-Seed/VeOmni

yanghw116 · 2026-04-02T06:36:18Z

What does this PR do?

Qwen3.5模型适配昇腾NPU卡（附带piyifan/kernel-patch-exp的更改）

Checklist Before Starting

Search for relative PRs/issues and link here: ...
PR title follows [{modules}] {type}: {description} format (see check_pr_title.yml for the full list of allowed modules and types)
- Breaking changes: prepend [BREAKING] — e.g. [BREAKING][parallel, model] feat: dynamic batching

Test

Validation results (training curves, eval metrics) for changes not covered by CI.

API and Usage Example

Show API changes and usage examples if applicable.

Design & Code Changes

参考GPU patch方案，增加NPU patch代码。
主要更改点：
1、自定义算子使用OpSlot注册；
2、fused_moe_forward通过判断硬件的形式注册（采用后面的方案会报错），其他算子通过执行patch注册。

Checklist Before Submitting

Read the Contribute Guide
Applied pre-commit checks
Added/updated documentation
If tasks/ training scripts were moved or renamed: updated docs/ examples and verified python3 scripts/ci/check_doc_task_paths.py passes (also enforced by the Check doc task paths CI workflow)
Added tests to CI workflow (or explained why not feasible)

# Conflicts: # veomni/ops/__init__.py

Resolve conflicts in qwen3_5_moe patch gen config by combining OpSlot-based dispatch (branch) with new imports, model init, and dummy vars from main (ByteDance-Seed#602). Regenerated generated files from merged config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CLAassistant · 2026-04-02T06:36:33Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces a unified kernel registry and dispatch mechanism using 'OpSlot' to replace fragmented kernel selection methods (like environment variables and monkey-patching) across VeOmni. It adds support for Qwen3.5 MoE models on NPU, implements sequence parallel (SP) optimizations for multimodal inputs, and provides a structured way to register and resolve kernel implementations. My feedback identifies a critical issue in the NPU cross-entropy kernel registration, documentation errors in configuration help strings, a performance concern regarding MoE expert linear projections on NPU, and an inconsistency in the load-balancing loss dispatch mechanism between GPU and NPU paths.

gemini-code-assist · 2026-04-02T06:41:14Z

veomni/ops/fused_cross_entropy/__init__.py

        if os.environ.get("VEOMNI_ENABLE_CHUNK_LOSS", "0") == "1":
            LOSS_MAPPING["ForCausalLM"] = chunk_loss_function
        _cross_entropy = eager_cross_entropy
+        register_npu_kernel_cross_entropy(eager_cross_entropy)


NPU的fused cross-entropy内核注册存在错误。当前代码在 apply_veomni_loss_patch 中调用 register_npu_kernel_cross_entropy(eager_cross_entropy)，这导致名为 npu_fused 的内核实际上使用的是未优化的eager版本，与内核名称和设计目标不符。

这会导致用户在配置中使用 cross_entropy_loss_implementation: npu_fused 时，无法获得预期的性能提升。

NPU上优化的损失函数似乎是 chunk_loss_function。请修改注册逻辑，将 chunk_loss_function (或其他正确的NPU融合函数) 注册为 npu_fused 内核。

Suggested change

register_npu_kernel_cross_entropy(eager_cross_entropy)

register_npu_kernel_cross_entropy(chunk_loss_function)

veomni/arguments/arguments_types.py

veomni/models/transformers/qwen3_5_moe/qwen3_5_moe_npu_patch_gen_config.py

piyifan123 and others added 17 commits March 16, 2026 06:40

suvey transformers v5

66afd79

rmsnorm gated

30a0422

init design

7ad6f8a

refine design

3d17fc5

interface update

2ab40a4

v1 impl

b280974

update for qwen3

b70ee7a

test_models_patch

08bc134

fix e2e tests

ccc606d

Merge remote-tracking branch 'origin/main' into piyifan/kernel-patch-exp

d1cecbf

# Conflicts: # veomni/ops/__init__.py

fix ci

53db75b

ruff format

e7ab4fb

simplify moe arg 1

60e33d0

new register

6ba40d2

fix original loss

f532e2d

fix patch gen

343af1f

github-actions bot added the ascend everything about Ascend support label Apr 2, 2026

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

yanghw116 force-pushed the main branch from d3584f1 to 0c8e33c Compare April 3, 2026 02:30

yanghw116 added 2 commits April 4, 2026 12:33

Merge remote-tracking branch 'origin/main' into qwen3.5-npu

7722edf

add NPU support for Qwen3.5

a2ba551

yanghw116 force-pushed the main branch 3 times, most recently from fd33fd6 to d2acbf8 Compare April 9, 2026 05:01

yanghw116 added 2 commits April 9, 2026 13:10

revise codes

e103262

add apply_rotary_pos_emb to OpSlot

508306c

yanghw116 force-pushed the main branch from d2acbf8 to 508306c Compare April 9, 2026 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model]feat: add NPU support for Qwen3.5#628

[model]feat: add NPU support for Qwen3.5#628
yanghw116 wants to merge 21 commits intoByteDance-Seed:mainfrom
yanghw116:main

yanghw116 commented Apr 2, 2026

Uh oh!

CLAassistant commented Apr 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	register_npu_kernel_cross_entropy(eager_cross_entropy)
	register_npu_kernel_cross_entropy(chunk_loss_function)

Conversation

yanghw116 commented Apr 2, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Apr 2, 2026 •

edited

Loading