add afmoe model support#1395
Conversation
| self.layer_types = layer_types | ||
| if num_key_value_heads is None: | ||
| self.num_key_value_heads = num_attention_heads | ||
| self.num_key_value_heads = num_key_value_heads |
There was a problem hiding this comment.
Bug: Unconditional assignment overwrites conditional default value
The num_key_value_heads assignment logic is broken. Line 201 correctly sets self.num_key_value_heads = num_attention_heads when num_key_value_heads is None, but line 202 unconditionally overwrites it with the original num_key_value_heads parameter (which is still None). This results in config.num_key_value_heads being None instead of defaulting to num_attention_heads, which will cause errors when building the attention layers.
| model_config=config, | ||
| ) | ||
| self.rotary_emb = RotaryEmbedding(rotary_config) | ||
| self.gradient_checkpointing = False |
There was a problem hiding this comment.
Bug: Rotary embedding only initialized in else branch
The self.rotary_emb and self.gradient_checkpointing are only initialized inside the else block when rope_scaling is not a dict. When config.rope_scaling is a dictionary (specifying custom rope parameters), these attributes won't be created, causing an AttributeError when forward() calls self.rotary_emb() on line 228. Comparing with glm4_moe, llama, and qwen3_moe implementations shows the RotaryEmbeddingConfig and subsequent initialization should be outside the else block.
|
nice, thanks! did you do any small sft/ rl sanity checks? |
|
I have tested all changes on colab's T4 GPU |
|
@mikasenghaas can you tell me is there anything else i have to do here ? |
the pr looks good, we will do some testing internally before merging it. Highly appreciate the work and we will try to merge it asap |
Jackmin801
left a comment
There was a problem hiding this comment.
Thanks for the PR! The modeling code LGTM. If it passes test against the HF one I think should be good to merge
1ffa8b5 to
f9810f2
Compare
|
Ah you need the custom config from We can also wait for the next transformers release and up transformers version. |
|
Let me do this we can change the it when we get next transformers release in another patch. |
|
@Jackmin801 please take a look |
|
@CodeMan62 Can you make sure that |
|
let us wait for next transformers release. Thanks for review @Jackmin801 |
This PR adds support of afmoe model to prime-rl.
GitHub Issue: #1343
Linear Issue: Resolves N/A
Note
Introduce
afmoeCausalLM model (config, modeling, MoE routing) with HF↔Prime state dict converters, register it in auto-mapping, and add unit tests.afmoepackage withAfMoeConfig,AfMoeModel,AfMoeForCausalLM, andAfMoePreTrainedModelimplementing MoE (token-choice routing, shared experts), rotary embeddings, and attention backends.convert_hf_to_tt_moe/convert_tt_to_hf_moeand per-layer variants to translate HF ↔ Prime formats."afmoe"withAutoConfigand mapAfMoeConfig→AfMoeForCausalLMinAutoModelForCausalLMPrimeRL.Written by Cursor Bugbot for commit 4941b3b. This will update automatically on new commits. Configure here.