20 Feb 22:47

wwwjn

v0.2.2 Pre-release

Pre-release

Dependency

PyTorch Version: torch-2.12.0.dev20260220+cu126
TorchAO Version: torchao-0.17.0.dev20260220+cu126

What's changed

[CP] Refactor Context Parallel to use new PyTorch CP APIs (#2144) by @fegin
[CP] Enable FlexCP for llama3 (#2145) by @fegin
[Compiler Toolkit] Add option for full inductor (#2150) by @aditvenk
Add docs to explain COMM_MODE (#2162) by @fegin
Disable dynamo LRU cache when AC is enabled (#2204) by @soulitzer
feat(gpt-oss): add YaRN RoPE extensions with mscale for extended context (#2216) by @eous
[ROCm] Support mxfp8 on gfx950 (#2222) by @RuibinCheung
Enable memory snapshot for generic devices (#2228) by @frost-intel
GQA without kv repeats (#2259) by @francesco-bertolotti
[rl] GQA attention enablement in torchtitan vllm wrapper (#2299) by @wwwjn
Add peak flops for NVIDIA H20 GPUs (#2307) by @DamonFool
Add missing job_config.maybe_log() calls (#2308) by @EquationWalker
[DeepEP] Implement shared_experts overlap with deepep.combine() (#2310) by @vivekgoe
Separate out training for fault tolerance (#2311) by @tushar00jain
Torchtitan changes to integrate into Verl (#2333) by @acisseJZhong
Maintain same LR schedule for early stop debug runs (#2340) by @acisseJZhong
[Compiler Toolkit] Separate process groups for FSDP AG/RS comm overlap (#2368) by @yiming0416
[AC] Set preserve_rng_state=True as default for activation checkpointing (#2380) by @soulitzer

Add attention scaling to varlen for Qwen3 (#2178) by @liangel-02
Make get TP mesh optional in Llama4 parallelize (#2185) by @danielvegamyhre
[GPT-OSS] Graduate from experiments to main (#2203) by @shuhuayu
[autoparallel] Update local_map_deepseek_v3 device mesh usage (#2231) by @xmfan
[varlen_attn] Change is_causal to window_size (#2267) by @liangel-02
Remove _ScaledPartial placement (#2337) by @Aidyn-A
Added custom trunc_normal (#2342) by @francesco-bertolotti
Removed weight initialization from model __init__ (#2361) by @francesco-bertolotti

[docs] Fix missing --model.flavor flags in compiler_toolkit README (#2201) by @BryanBradfo
Fix loss computation by handling valid token imbalance in train loop (#2206) by @wwwjn
[MoE] Fix experts DTensor metadata bug for DCP (#2227) by @shuhuayu
Fix sdpa-varlen attention mismatch in Qwen3 (#2229) by @francesco-bertolotti
Weight tying fix for Qwen3 (#2253) by @francesco-bertolotti
Fix grad norm clipping for AutoP and DSv3 model init (#2270) by @sanketpurandare
[MoE] DeepEP refactor and fix memory leak during training and inference (#2296) by @shuhuayu
[docs] Fix type mismatch in model layers comments (#2306) by @DamonFool
Fix FLUX attention by exposing is_causal in SDPA (#2309) by @wwwjn
Fixing global_max_loss computation (#2314) by @Shagun-G
Fix the CI loss issue (#2315) by @fegin
Fix gpt-oss implementation (MoE router gate bias + top-k renorm) (#2319) by @linyuhongg
[SimpleFSDP] Fix HSDP placement mismatch in _distribute_dtensor (#2329) by @SongyuanZhao
[Bugfix] Fix bitwise determinism after vLLM SiluAndMul change (#2358) by @Lucaskabela
[Bugfix] Fix simple_rl_multiprocess.py to be runnable with recent vLLM version (#2359) by @Lucaskabela
Fixing extra averaging performed in validation error (#2366) by @Shagun-G
Bug fix: don't swallow OutOfMemoryError when enable_memory_snapshot=True (#2374) by @weifengpy
Fix: restrict completion logging to rank 0 (#2383) by @fatih-uzlmz

[Experimental][rl][vllm compat] Update simple_rl example to work with vLLM nightly (#2219) by @Lucaskabela
[Experimental][rl][unified] Update infer.py example to work with vLLM nightly (#2226) by @Lucaskabela
Add ROCm support for H100 tests (#2202) by @akashveramd
Add ROCm CI support for simple FSDP experiments test (#2220) by @akashveramd
Add test for DSv3 with flexattn + fsdp + ep + pp + sac op (#2234) by @shuhuayu
Add ROCm CI support for Auto Parallel & Compiler Toolkit experiments (#2248) by @akashveramd
Add ROCm CI support for Transformers Modeling Backend & VLM experiments (#2276) by @akashveramd
Update CPU unit test to use linux_job_v2 (#2287) by @joecummings
[BE week] Disable CPU wheel builds in nightly CI (#2289) by @joecummings
[BE][NFC] Add integration test for simplefsdp + CP deepseek_v3 (#2301) by @aditvenk
Fixed autoparallel integration tests on ROCm (#2321) by @wenchenvincent
[rl][ez] Squash landing import and git fixes (#2331) by @zhxchen17
[ci] Add DSv3 SimpleFSDP auto_bucketing to H100 CI jobs (#2347) by @IvanKobzarev
Bump tj-actions/changed-files from 47.0.1 to 47.0.2 (#2367) by @dependabot
[CI] Disable NVLS (#2372) by @fegin
Bump tj-actions/changed-files from 47.0.2 to 47.0.4 (#2390) by @dependabot
[rl] Install vllm from pre-built wheels (#2397) by @wwwjn
[rl/unified] Update default model-ckpt-path in infer.py to the one from README (#2405) by @daniellepintz

fegin, zhxchen17, and 32 other contributors

Assets 2

26 Dec 23:29

wwwjn

v0.2.1 Pre-release

Pre-release

pytorch verison: torch-2.11.0.dev20251226+cu126
torchao version: torchao-0.16.0.dev20251226+cu126

Use new DeviceMesh unflatten to rewrite parallel_dims by @fegin in #1660
Re:Run Torchtitan ROCm workflow on cron schedule & push to Main branch only by @akashveramd in #2018
adding variable length attention to llama3 8b by @liangel-02 in #2000
[Local Tensor] Replace dry_run.py with fake mode implementation by @fegin in #2057

Enable PP and EP overlap for MoE by @H-Huang in #1721
Integrate DeepEP to torchtitan by @elfiegg in #2107
[MoE] Add node limited routing support by @shuhuayu in #2111
Add Context Parallelism to Flux model training by @limou102 in #1851
gpt-oss model enablement by @wwwjn in #1754
[GPT-OSS] Add HF state dict adapter to support loading from HF checkpoints by @shuhuayu in #2021

Exploring toolkit-style use of the compiler stack @SherlockNoMad @yiming0416 : [Compiler Toolkit] JointGraph-based Training Prototype for llama3 by @SherlockNoMad in #1794
Bit-wise identity RL between torchtitan Trainer and vLLM sampler: Add deterministic RL training experiment with vLLM by @bwasti in #1975
Train models fromtransformers with torchtitan: 3outeille/transformers backend (Dense model only) by @3outeille in #2048
Auto Parallel Examples @wconstab @xmfan: Autoparallel as an experiment in main by @xmfan in #2054
Unified Model definition in RL loop @wwwjn @acisseJZhong @zhxchen17 : Run vLLM inference using torchtitan model definition (single GPU) by @wwwjn in #2119