Skip to content

add afmoe model support#1395

Closed
CodeMan62 wants to merge 11 commits intoPrimeIntellect-ai:mainfrom
CodeMan62:afmoe-support
Closed

add afmoe model support#1395
CodeMan62 wants to merge 11 commits intoPrimeIntellect-ai:mainfrom
CodeMan62:afmoe-support

Conversation

@CodeMan62
Copy link

@CodeMan62 CodeMan62 commented Dec 7, 2025

This PR adds support of afmoe model to prime-rl.

GitHub Issue: #1343
Linear Issue: Resolves N/A


Note

Introduce afmoe CausalLM model (config, modeling, MoE routing) with HF↔Prime state dict converters, register it in auto-mapping, and add unit tests.

  • Models:
    • Add afmoe package with AfMoeConfig, AfMoeModel, AfMoeForCausalLM, and AfMoePreTrainedModel implementing MoE (token-choice routing, shared experts), rotary embeddings, and attention backends.
    • Implement state dict converters convert_hf_to_tt_moe/convert_tt_to_hf_moe and per-layer variants to translate HF ↔ Prime formats.
    • Register "afmoe" with AutoConfig and map AfMoeConfigAfMoeForCausalLM in AutoModelForCausalLMPrimeRL.
  • Tests:
    • Add GPU unit tests validating attention-only, MLP/MoE-only, full forward/grad parity, HF↔Prime conversion round-trip, and dense vs MoE layer placement.

Written by Cursor Bugbot for commit 4941b3b. This will update automatically on new commits. Configure here.

self.layer_types = layer_types
if num_key_value_heads is None:
self.num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unconditional assignment overwrites conditional default value

The num_key_value_heads assignment logic is broken. Line 201 correctly sets self.num_key_value_heads = num_attention_heads when num_key_value_heads is None, but line 202 unconditionally overwrites it with the original num_key_value_heads parameter (which is still None). This results in config.num_key_value_heads being None instead of defaulting to num_attention_heads, which will cause errors when building the attention layers.

Fix in Cursor Fix in Web

model_config=config,
)
self.rotary_emb = RotaryEmbedding(rotary_config)
self.gradient_checkpointing = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Rotary embedding only initialized in else branch

The self.rotary_emb and self.gradient_checkpointing are only initialized inside the else block when rope_scaling is not a dict. When config.rope_scaling is a dictionary (specifying custom rope parameters), these attributes won't be created, causing an AttributeError when forward() calls self.rotary_emb() on line 228. Comparing with glm4_moe, llama, and qwen3_moe implementations shows the RotaryEmbeddingConfig and subsequent initialization should be outside the else block.

Fix in Cursor Fix in Web

@mikasenghaas
Copy link
Member

nice, thanks! did you do any small sft/ rl sanity checks?

@CodeMan62
Copy link
Author

I have tested all changes on colab's T4 GPU

@CodeMan62
Copy link
Author

@mikasenghaas can you tell me is there anything else i have to do here ?

@samsja
Copy link
Member

samsja commented Dec 10, 2025

@mikasenghaas can you tell me is there anything else i have to do here ?

the pr looks good, we will do some testing internally before merging it. Highly appreciate the work and we will try to merge it asap

Copy link
Member

@Jackmin801 Jackmin801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The modeling code LGTM. If it passes test against the HF one I think should be good to merge

@CodeMan62 CodeMan62 requested a review from Jackmin801 December 11, 2025 09:50
@Jackmin801
Copy link
Member

Ah you need the custom config from AutoConfig.from_pretrained('arcee-ai/Trinity-Mini', trust_remote_code=True) so it loads the custom model impl. One way to get it to work right now is to load the config from arcee-ai/Trinity-Mini and change the n_layers and num_experts to make it smaller for the unit tests.

We can also wait for the next transformers release and up transformers version.

@CodeMan62
Copy link
Author

CodeMan62 commented Dec 12, 2025

Let me do this we can change the it when we get next transformers release in another patch.

@CodeMan62
Copy link
Author

@Jackmin801 please take a look

@Jackmin801
Copy link
Member

@CodeMan62 Can you make sure that uv run pytest -vs tests/unit/train/models/test_afmoe.py works? Right now theres some config attribute mismatch issues and I believe if you solve those, there are then state dict issues as the models have different norms and moe params.

@CodeMan62
Copy link
Author

let us wait for next transformers release. Thanks for review @Jackmin801

@CodeMan62 CodeMan62 closed this Jan 12, 2026
@CodeMan62 CodeMan62 deleted the afmoe-support branch January 31, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants