Improve compilation time (reduce from ~50 seconds to ~15s for vLLM) by Lucaskabela · Pull Request #3145 · pytorch/torchtitan

Lucaskabela · 2026-04-28T22:11:39Z

Summary

We make significant improvements to the vLLM compilation, saving ~40s (20s from cudagraph, 1s per step, and ~13s from Dynamo) from the following changes:

Since we are using FA based attention which is compatible with traceability, we can use FullGraph cudagraphs (which yields -1s per step)
We adjust the max cudagraph capture size based on default from configs - this cuts cudagraph capture time from 30s to 11s
We move compilation to use the same compile pipeline that trainer model does. While this misses out on some of vLLM's custom passes, we have more observability and controllability to ensure unified definition. This has the added benefit of leveraging the regional compile - reducing compile from O(n) to O(1) (as we compile one transformer layer then reuse it) cutting Dynamo compile time from 17s to 4s

Test plan

python torchtitan/experiments/rl/grpo.py --module rl --config rl_grpo_qwen3_0_6b

Test results:

Before

INFO 04-28 16:20:20 [backends.py:1128] [actor=<root>.<torchtitan.experiments.rl.actors.generator.VLLMGenerator generator{'gpus': 0/4}>] Dynamo bytecode transform time: 15.26 s
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:31<00:00,  1.13it/s]
...
[actor=<root>] Step  9 | Loss: +0.0020 | Reward: +0.743 (correctness=+0.450, format=+0.293) | Avg tokens: 100 | Logprob diff: mean=-6.4715e-05, max=2.2314e-01 | Time: 2.7s
[actor=<root>] Post-training validation
[actor=<root>] Summary:
  Pre:  mean_reward=+0.365 (correctness=+0.200, format=+0.165)
  Post: mean_reward=+0.700 (correctness=+0.400, format=+0.300)

After (this PR)

Capturing CUDA graphs (mixed prefill-decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00,  1.34s/it]
...
[actor=<root>] Step  9 | Loss: +0.0020 | Reward: +0.743 (correctness=+0.450, format=+0.293) | Avg tokens: 100 | Logprob diff: mean=-6.4715e-05, max=2.2314e-01 | Time: 1.8s
[actor=<root>] Post-training validation
[actor=<root>] Summary:
  Pre:  mean_reward=+0.365 (correctness=+0.200, format=+0.165)
  Post: mean_reward=+0.700 (correctness=+0.400, format=+0.300)

tianyu-l · 2026-04-29T00:19:11Z

+    graph capture, which is vLLM-specific, is controlled here.
+    """
+
+    cudagraph_mode: Literal["none", "full"] = "full"


let's follow compile config style and have enable since it's a boolean option https://github.com/pytorch/torchtitan/blob/main/torchtitan/config/configs.py#L323

tianyu-l · 2026-04-29T00:19:40Z

+    See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes"""

    @property
    def is_eager(self) -> bool:


don't need this aliasing field anymore

tianyu-l · 2026-04-29T00:21:33Z

+        When set, this is passed to vLLM EngineArgs and used to derive
+        CUDA graph capture sizes, avoiding captures for batch sizes that
+        will never be reached.  Auto-computed by RLTrainer from
+        num_prompts_per_step * sampling.n when not explicitly set."""


what's the benefit of not using this default?

Better flexibility - if we scale either of num_prompts_per_step or sampling.n this automatically scales with us

please elaborate a bit more on what it means by "if we scale ..., this scales with us". My naive thought is that we will change them per job, which will change this cudagraph setting.

Sorry that is what I meant here - if we change num_prompts_per_step for a given job, we do not have to remember to also bump this value

tianyu-l · 2026-04-29T00:23:15Z

        *,
        model_spec: ModelSpec,
        model_path: str,
+        compile_config: CompileConfig | None = None,


should be part of Generator.Config? If we believe trainer and generator should always use the same compilation config (which I think is fine to assume for now), then it should be part of RLTrainer.Config?

Will up level to RLTrainer.Config then

tianyu-l · 2026-04-29T00:24:51Z

        model_spec: ModelSpec,
        vllm_config: VllmConfig,
        prefix: str = "",
+        compile_config: CompileConfig | None = None,


let's make it required in the call chain

tianyu-l · 2026-04-29T00:26:30Z

            disable_log_stats=True,
        )
+        if config.max_num_seqs is not None:
+            engine_kwargs["max_num_seqs"] = config.max_num_seqs


what does this field do?
what if we don't set it here?

Commented above - this controls cudagraphs we capture, not setting it defaults to the behavior today on main

I can go ahead and make this set by default to avoid any sort of silent slowdown

wait these are two different kwargs -- the other is for vllm cudagraph behavior, what's this additional kwarg for?

max_num_seqs controls other things like the padding for max size (used in kv cache)

tianyu-l · 2026-04-29T00:27:11Z

+        kwargs: dict = dict(cudagraph_mode=self.cudagraph_mode, mode=0)
+
+        if max_num_seqs is not None and self.cudagraph_mode != "none":
+            kwargs["cudagraph_capture_sizes"] = self._compute_cudagraph_capture_sizes(


what if we don't set it when cudagraph is enabled?

Defaults to the 256 which captures ~35 different sizes (ranging from 1 to 256) so no incorrectness, just more memory and startup time used

tianyu-l · 2026-04-29T00:28:54Z

+    require vLLM's whole-model torch.compile to split the graph around
+    non-capturable ops, which conflicts with per-layer compile.
+    See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes"""


What happens when we enable cudagraph for per-layer compile?

We can save compile time, but what's the impact on run time, e.g. when going to GB200 with significant CPU overhead.

The test plan shows the time impact - we observe speedup over piecewise in this particular setup

also would like to check if it works with EP, being enabled in #3142

MoE has dynamic shape, despite being full graph torch-compilable

tianyu-l · 2026-04-29T00:29:36Z

+    require vLLM's whole-model torch.compile to split the graph around
+    non-capturable ops, which conflicts with per-layer compile.
+    See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes"""


Also, it seems we move compile back to torchtitan, but cudagraph application still in vllm. How far are we from moving cudagraph application also in torchtitan?

I would estimate 2-3weeks but I can leave a TODO here that we should unify cudagraph config once we have it on trainer side

tianyu-l · 2026-04-29T21:21:17Z

        model_spec: ModelSpec,
        hf_assets_path: str = "",
        generator_dtype: str = "",
+        compile_config: CompileConfig,


does this work in python? putting a arg without default to be after args with defaults

tianyu-l · 2026-04-29T21:21:46Z

+    # CUDA graph capture despite being torch.compile-compatible.
+
+    @staticmethod
+    def _compute_cudagraph_capture_sizes(max_num_seqs: int) -> list[int]:


maybe inline this function into get_vllm_compilation_config

tianyu-l · 2026-04-29T21:22:19Z

-    def is_eager(self) -> bool:
-        """Inferred from backend and cudagraph_mode."""
-        return self.backend == "none" and self.cudagraph_mode == "none"
+    torch.compile is configured separately via ``CompileConfig`` at the


shall we add a TODO to explore moving cudagraph application to torchtitan model as well?

tianyu-l · 2026-04-29T21:23:08Z

+                "max_num_seqs must be set (auto-computed by RLTrainer, or "
+                "set explicitly when using VLLMGenerator standalone)"
+            )
+        engine_kwargs["max_num_seqs"] = config.max_num_seqs


add comment to educate

max_num_seqs controls other things like the padding for max size (used in kv cache)

and what happens if not set explicitly

tianyu-l · 2026-04-29T21:25:24Z

+        # varying tensor types (AsyncCollectiveTensor vs plain tensor from
+        # TP collectives). Each combination triggers a dynamo recompile.
+        if compile_config.enable:
+            torch._dynamo.config.recompile_limit = 12


despite the comments above, 12 still looks a magical number. Could you elaborate further

where 12 comes from

what happens if we don't set it

tianyu-l · 2026-04-29T21:29:54Z

+        """Maximum number of concurrent sequences the engine will batch.
+        When set, this is passed to vLLM EngineArgs.  Auto-computed by
+        RLTrainer from num_prompts_per_step * sampling.n when not
+        explicitly set."""


if we change num_prompts_per_step for a given job, we do not have to remember to also bump this value

Not sure if I understand. My (probably wrong) understanding was

"Maximum number of concurrent sequences the engine will batch" should always be obtained from num_prompts_per_step * sampling.n per RL job, and there won't be any room to set it smaller.

If not set explicitly in vllm kwargs, it will be inferred by vllm and causing potential overhead.

There is no benefit to give control to users, and we should always set it to be num_prompts_per_step * sampling.n explicitly.

Where does the argument break?

Ah I think I misunderstood this line of questioning (thought it was more why do we need this at all) - I agree we can just hide this from users

pytorch-bot Bot added the ciflow/8gpu label Apr 28, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 28, 2026

Lucaskabela force-pushed the lucaskabela/improve_vllm_comp branch 3 times, most recently from fadcce5 to e81ea28 Compare April 28, 2026 23:01

Lucaskabela linked an issue Apr 28, 2026 that may be closed by this pull request

[rl] [feature] Cache cudagraphs for unified model generator #3071

Open

Lucaskabela force-pushed the lucaskabela/improve_vllm_comp branch 2 times, most recently from 8bac4cf to 671cc0a Compare April 29, 2026 00:04

Lucaskabela requested review from acisseJZhong, daniellepintz, tianyu-l and wwwjn and removed request for tianyu-l April 29, 2026 00:11

Lucaskabela marked this pull request as ready for review April 29, 2026 00:11

tianyu-l reviewed Apr 29, 2026

View reviewed changes

Lucaskabela force-pushed the lucaskabela/improve_vllm_comp branch from 671cc0a to f587e8a Compare April 29, 2026 18:00

tianyu-l reviewed Apr 29, 2026

View reviewed changes

Lucaskabela marked this pull request as draft April 29, 2026 23:59

Improve compilation time (reduce from ~50 seconds to ~15s for vLLM)

8509698

Lucaskabela force-pushed the lucaskabela/improve_vllm_comp branch from f587e8a to 8509698 Compare April 30, 2026 00:10

Conversation

Lucaskabela commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Test results:

Before

After (this PR)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lucaskabela Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lucaskabela commented Apr 28, 2026 •

edited

Loading

tianyu-l Apr 29, 2026 •

edited

Loading

Lucaskabela Apr 29, 2026 •

edited

Loading