ROCm: default GPT-OSS to BF16 and disable AITER#4021
ROCm: default GPT-OSS to BF16 and disable AITER#4021danielhanchen wants to merge 3380 commits intomainfrom
Conversation
for more information, see https://pre-commit.ci
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update _utils.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [FIX] [Transformers] VLM input embeds fix for gradients (#3715) * Fix get_input_embeds call for VLMs * patch input_require_grads instead * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup old patch * cleanup old patch * cleanup * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestion from @danielhanchen * use logger instead of prints * Move unsloth present set * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <[email protected]> * Update rope_embedding.py * Fixes * Update _utils.py * Update import_fixes.py * Update rl_replacements.py * fix_openenv_no_vllm * Fix * Update __init__.py * Update __init__.py * Update __init__.py * Update import_fixes.py * Update import_fixes.py * Update import_fixes.py * logger * Update __init__.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update __init__.py * Update import_fixes.py * Update __init__.py * Update import_fixes.py * Update import_fixes.py * Update import_fixes.py * Update import_fixes.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update import_fixes.py * Update unsloth/import_fixes.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Datta Nimmaturi <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Silence fbgemm TMA print Also safer .push_to_hub * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
for more information, see https://pre-commit.ci
* Update _utils.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [FIX] [Transformers] VLM input embeds fix for gradients (#3715) * Fix get_input_embeds call for VLMs * patch input_require_grads instead * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup old patch * cleanup old patch * cleanup * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestion from @danielhanchen * use logger instead of prints * Move unsloth present set * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <[email protected]> * Update rope_embedding.py * Fixes * Update _utils.py * Update import_fixes.py * Update rl_replacements.py * fix_openenv_no_vllm * Fix * Update __init__.py * Update __init__.py * Update __init__.py * Update import_fixes.py * Update import_fixes.py * Update import_fixes.py * logger * Update __init__.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update __init__.py * Update import_fixes.py * Update __init__.py * Update import_fixes.py * Update import_fixes.py * Update import_fixes.py * Update import_fixes.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update import_fixes.py * Update unsloth/import_fixes.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update save.py * [fbgemm] Silence tma fbgemm (#3735) * Silence fbgemm TMA print Also safer .push_to_hub * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update loader.py * Update save.py * Update save.py * Update _utils.py * Update _utils.py * Diffusers warnings * Update pyproject.toml * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Datta Nimmaturi <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
for more information, see https://pre-commit.ci
|
Validation update for ROCm:
NVIDIA compatibility notes:
|
|
Recheck update for PR #4021 against current ROCm notebook reruns. Re-evaluated items
Findings
Action
|
|
Follow-up adjustment: HIP GPT-OSS routing is now capability-based, not blanket BF16. What changed
Logic
Why
Validation (ROCm)
Note
Commit
|
|
Recheck complete. I cleaned up the PR to remove unintended fallback behavior that made the diff look botched.\n\nWhat I removed\n- HIP-specific fallback in that silently downgraded when is missing.\n- HIP-specific behavior in that auto-dropped / and auto-tokenized string inputs.\n\nWhat remains intentionally in this PR\n- : ROCm AITER defaults via so explicit user env overrides still win.\n- : capability-gated GPT-OSS BF16 fallback using + (Radeon can still use prequantized checkpoints where supported).\n- : defensive Trainer init patch guard + TorchAO with TypeError fallback for older TorchAO.\n\nCommit pushed\n- \n\nI will continue the AMD notebook reruns and keep logging outcomes in and . |
|
Recheck complete. I cleaned up the PR to remove unintended fallback behavior that made the diff look botched. What I removed
What remains intentionally in this PR
Commit pushed
I will continue the AMD notebook reruns and keep logging outcomes in |
|
Applied a fresh cleanup/refactor for the duplicated HIP GPT-OSS routing block in What changed
Behavior
Validation
Commit
|
|
Follow-up cleanup applied:
Safety note
Validation
Commit
|
|
Quick status update from the latest ROCm notebook cycle.
I will continue the remaining notebook sweep and report any |
|
Added follow-up AMD notebook stability fixes:\n\n- Commit: \n- Files:\n - \n - \n - \n\nWhat changed:\n- Added definitions for HIP/XPU branches so downstream checks have a stable callable.\n- Hooked Deepseek OCR patch invocation in with a guarded import/call path.\n- Added an offline GGUF guard path in () so Ollama notebook flows do not fail hard when GGUF conversion dependencies are unavailable.\n\nValidation evidence (ROCm runs):\n- Deepseek OCR fix path: (fail) -> (success).\n- Ollama flow: (success).\n- Full notebook latest status remains green in after this cycle (all 25 tracked notebooks currently latest=SUCCESS). |
|
Follow-up AMD notebook stability fixes are now pushed.
What changed:
Validation evidence (ROCm runs):
|
|
Additional follow-up pushed:
Change:
Why this was added:
Note on PR #4029:
|
|
Superseding my prior malformed CLI comment with corrected content. Re-review update with fresh A/B checks on ROCm:
Additional note from sanity run:
|
MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942), but was missing from is_cdna(), causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash. Also includes ROCm GPT-OSS BF16 routing and dequant buffer dtype fix from PR unslothai#4021 by @danielhanchen, cherry-picked for MI355X validation. Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1 - Vision RL GRPO (Qwen2.5-VL-7B): 5/5 steps - Code RL GRPO (gpt-oss-20b BF16): 20/20 steps - gpt-oss-120b GRPO: 5/5 steps (B200 OOM'd on this) - MoE expert LoRA + save_pretrained_merged: success
W7900 (gfx1100, RDNA3) ValidationEnvironment: ROCm 7.1 | PyTorch 2.8.0+rocm7.1 | Triton 3.4.0 @danielhanchen Tested the AITER disable logic on W7900 — confirmed AITER is not available on RDNA3 and the Re: the GPT-OSS BF16 default — on RDNA3, MXFP4 is not supported (no HW matrix engine), so defaulting to BF16 is the correct choice. Also note: we have submitted #4109 (compile disable for Gemma3 on HIP) and #4110 (RMS LayerNorm eps float32 fix) which complement this PR for broader RDNA stability. |
734649e to
a147b56
Compare
Summary
Testing