Releases: pytorch/torchtitan
Releases · pytorch/torchtitan
v0.2.2
Dependency
PyTorch Version: torch-2.12.0.dev20260220+cu126
TorchAO Version: torchao-0.17.0.dev20260220+cu126
What's changed
🚀 Features
- [CP] Refactor Context Parallel to use new PyTorch CP APIs (#2144) by @fegin
- [CP] Enable FlexCP for llama3 (#2145) by @fegin
- [Compiler Toolkit] Add option for full inductor (#2150) by @aditvenk
- Add docs to explain
COMM_MODE(#2162) by @fegin - Disable dynamo LRU cache when AC is enabled (#2204) by @soulitzer
- feat(gpt-oss): add YaRN RoPE extensions with mscale for extended context (#2216) by @eous
- [ROCm] Support mxfp8 on gfx950 (#2222) by @RuibinCheung
- Enable memory snapshot for generic devices (#2228) by @frost-intel
- GQA without kv repeats (#2259) by @francesco-bertolotti
- [rl] GQA attention enablement in torchtitan vllm wrapper (#2299) by @wwwjn
- Add peak flops for NVIDIA H20 GPUs (#2307) by @DamonFool
- Add missing
job_config.maybe_log()calls (#2308) by @EquationWalker - [DeepEP] Implement shared_experts overlap with deepep.combine() (#2310) by @vivekgoe
- Separate out training for fault tolerance (#2311) by @tushar00jain
- Torchtitan changes to integrate into Verl (#2333) by @acisseJZhong
- Maintain same LR schedule for early stop debug runs (#2340) by @acisseJZhong
- [Compiler Toolkit] Separate process groups for FSDP AG/RS comm overlap (#2368) by @yiming0416
- [AC] Set
preserve_rng_state=Trueas default for activation checkpointing (#2380) by @soulitzer
🧠 Model
- Add attention scaling to varlen for Qwen3 (#2178) by @liangel-02
- Make get TP mesh optional in Llama4 parallelize (#2185) by @danielvegamyhre
- [GPT-OSS] Graduate from experiments to main (#2203) by @shuhuayu
- [autoparallel] Update
local_map_deepseek_v3device mesh usage (#2231) by @xmfan - [varlen_attn] Change
is_causaltowindow_size(#2267) by @liangel-02 - Remove
_ScaledPartialplacement (#2337) by @Aidyn-A - Added custom
trunc_normal(#2342) by @francesco-bertolotti - Removed weight initialization from model
__init__(#2361) by @francesco-bertolotti
🐛 Bug Fixes
- [docs] Fix missing
--model.flavorflags in compiler_toolkit README (#2201) by @BryanBradfo - Fix loss computation by handling valid token imbalance in train loop (#2206) by @wwwjn
- [MoE] Fix experts DTensor metadata bug for DCP (#2227) by @shuhuayu
- Fix sdpa-varlen attention mismatch in Qwen3 (#2229) by @francesco-bertolotti
- Weight tying fix for Qwen3 (#2253) by @francesco-bertolotti
- Fix grad norm clipping for AutoP and DSv3 model init (#2270) by @sanketpurandare
- [MoE] DeepEP refactor and fix memory leak during training and inference (#2296) by @shuhuayu
- [docs] Fix type mismatch in model layers comments (#2306) by @DamonFool
- Fix FLUX attention by exposing
is_causalin SDPA (#2309) by @wwwjn - Fixing
global_max_losscomputation (#2314) by @Shagun-G - Fix the CI loss issue (#2315) by @fegin
- Fix gpt-oss implementation (MoE router gate bias + top-k renorm) (#2319) by @linyuhongg
- [SimpleFSDP] Fix HSDP placement mismatch in
_distribute_dtensor(#2329) by @SongyuanZhao - [Bugfix] Fix bitwise determinism after vLLM
SiluAndMulchange (#2358) by @Lucaskabela - [Bugfix] Fix
simple_rl_multiprocess.pyto be runnable with recent vLLM version (#2359) by @Lucaskabela - Fixing extra averaging performed in validation error (#2366) by @Shagun-G
- Bug fix: don't swallow
OutOfMemoryErrorwhenenable_memory_snapshot=True(#2374) by @weifengpy - Fix: restrict completion logging to rank 0 (#2383) by @fatih-uzlmz
🧪 Experiments / CI / Infra
- [Experimental][rl][vllm compat] Update simple_rl example to work with vLLM nightly (#2219) by @Lucaskabela
- [Experimental][rl][unified] Update
infer.pyexample to work with vLLM nightly (#2226) by @Lucaskabela - Add ROCm support for H100 tests (#2202) by @akashveramd
- Add ROCm CI support for simple FSDP experiments test (#2220) by @akashveramd
- Add test for DSv3 with flexattn + fsdp + ep + pp + sac op (#2234) by @shuhuayu
- Add ROCm CI support for Auto Parallel & Compiler Toolkit experiments (#2248) by @akashveramd
- Add ROCm CI support for Transformers Modeling Backend & VLM experiments (#2276) by @akashveramd
- Update CPU unit test to use
linux_job_v2(#2287) by @joecummings - [BE week] Disable CPU wheel builds in nightly CI (#2289) by @joecummings
- [BE][NFC] Add integration test for simplefsdp + CP deepseek_v3 (#2301) by @aditvenk
- Fixed autoparallel integration tests on ROCm (#2321) by @wenchenvincent
- [rl][ez] Squash landing import and git fixes (#2331) by @zhxchen17
- [ci] Add DSv3 SimpleFSDP
auto_bucketingto H100 CI jobs (#2347) by @IvanKobzarev - Bump
tj-actions/changed-filesfrom 47.0.1 to 47.0.2 (#2367) by @dependabot - [CI] Disable NVLS (#2372) by @fegin
- Bump
tj-actions/changed-filesfrom 47.0.2 to 47.0.4 (#2390) by @dependabot - [rl] Install vllm from pre-built wheels (#2397) by @wwwjn
- [rl/unified] Update default
model-ckpt-pathininfer.pyto the one from README (#2405) by @daniellepintz
🔧 Typing / Lint Cleanup
- [lint] Ignore all existing pyrefly errors (#2240) by @xmfan
- [Typing] Fix CI Typing Issues (#2245) by @fegin
- [Typing] Improve
ModelProtocoltyping (#2246) by @fegin - [Typing] Remove deprecated
enable_symm_mem_for_group(#2260) by @fegin - [Typing] Remove unused pyrefly ignore (#2280) by @fegin
- [Typing] Fix pyrefly-ignore in
train.py(#2282) by @fegin - [Typing] Fix pyrefly ignores in
checkpoint.py(#2283) by @fegin - [Typing] Fix the ignores in
activation_checkpoint.py(#2284) by @fegin - [Typing] Fix the ignores in
tokenizer.py(#2285) by @fegin - [Typing] Fix the ignores in
validate.py(#2286) by @fegin - [Typing] Fix some pyrefly ignores in
optimizer.py(#2294) by @fegin - [Typing] Improve typing for some distributed modules (#2295) by @fegin
- [Typing] Fix the pyrefly ignores in llama3
model.py(#2302) by @fegin - [Typing] Fix pyrefly ignores in llama4
model.py(#2303) by @fegin - [Typing] Fix pyrefly ignores in qwen3
model.py(#2304) by @fegin - [Typing] Fix pyrefly ignores in deepseek
model.py(#2305) by @fegin
v0.2.1
Dependency
pytorch verison: torch-2.11.0.dev20251226+cu126
torchao version: torchao-0.16.0.dev20251226+cu126
What's Changed
Features
- Use new DeviceMesh unflatten to rewrite parallel_dims by @fegin in #1660
- Re:Run Torchtitan ROCm workflow on cron schedule & push to Main branch only by @akashveramd in #2018
- adding variable length attention to llama3 8b by @liangel-02 in #2000
- [Local Tensor] Replace dry_run.py with fake mode implementation by @fegin in #2057
Model
- Enable PP and EP overlap for MoE by @H-Huang in #1721
- Integrate DeepEP to torchtitan by @elfiegg in #2107
- [MoE] Add node limited routing support by @shuhuayu in #2111
- Add Context Parallelism to Flux model training by @limou102 in #1851
- gpt-oss model enablement by @wwwjn in #1754
- [GPT-OSS] Add HF state dict adapter to support loading from HF checkpoints by @shuhuayu in #2021
Bug Fix
- [FLOPs] Fix attention FLOPs estimate by @shuhuayu in #1923
- Fix apply_compile called multiple times in PP initialization by @xmfan in #2135
- Fix qwen3 attention scaling calculation by @wwwjn in #2173
Experiments
- Exploring toolkit-style use of the compiler stack @SherlockNoMad @yiming0416 : [Compiler Toolkit] JointGraph-based Training Prototype for llama3 by @SherlockNoMad in #1794
- Bit-wise identity RL between torchtitan Trainer and vLLM sampler: Add deterministic RL training experiment with vLLM by @bwasti in #1975
- Train models from
transformerswith torchtitan: 3outeille/transformers backend (Dense model only) by @3outeille in #2048 - Auto Parallel Examples @wconstab @xmfan: Autoparallel as an experiment in main by @xmfan in #2054
- Unified Model definition in RL loop @wwwjn @acisseJZhong @zhxchen17 : Run vLLM inference using torchtitan model definition (single GPU) by @wwwjn in #2119
Full Changelog: v0.2.0...v0.2.1
v0.2.0
Dependency
pytorch verison: torch-2.10.0.dev20251019+cu126
torchao version: torchao-0.15.0.dev20251015+cu126
Full Changelog: v0.1.0...v0.2.0