Improve compilation time (reduce from ~50 seconds to ~15s for vLLM)#3145
Improve compilation time (reduce from ~50 seconds to ~15s for vLLM)#3145Lucaskabela wants to merge 1 commit intopytorch:mainfrom
Conversation
fadcce5 to
e81ea28
Compare
8bac4cf to
671cc0a
Compare
| graph capture, which is vLLM-specific, is controlled here. | ||
| """ | ||
|
|
||
| cudagraph_mode: Literal["none", "full"] = "full" |
There was a problem hiding this comment.
let's follow compile config style and have enable since it's a boolean option https://github.com/pytorch/torchtitan/blob/main/torchtitan/config/configs.py#L323
| See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes""" | ||
|
|
||
| @property | ||
| def is_eager(self) -> bool: |
There was a problem hiding this comment.
don't need this aliasing field anymore
| When set, this is passed to vLLM EngineArgs and used to derive | ||
| CUDA graph capture sizes, avoiding captures for batch sizes that | ||
| will never be reached. Auto-computed by RLTrainer from | ||
| num_prompts_per_step * sampling.n when not explicitly set.""" |
There was a problem hiding this comment.
what's the benefit of not using this default?
There was a problem hiding this comment.
Better flexibility - if we scale either of num_prompts_per_step or sampling.n this automatically scales with us
There was a problem hiding this comment.
please elaborate a bit more on what it means by "if we scale ..., this scales with us". My naive thought is that we will change them per job, which will change this cudagraph setting.
There was a problem hiding this comment.
Sorry that is what I meant here - if we change num_prompts_per_step for a given job, we do not have to remember to also bump this value
| *, | ||
| model_spec: ModelSpec, | ||
| model_path: str, | ||
| compile_config: CompileConfig | None = None, |
There was a problem hiding this comment.
should be part of Generator.Config? If we believe trainer and generator should always use the same compilation config (which I think is fine to assume for now), then it should be part of RLTrainer.Config?
There was a problem hiding this comment.
Will up level to RLTrainer.Config then
| model_spec: ModelSpec, | ||
| vllm_config: VllmConfig, | ||
| prefix: str = "", | ||
| compile_config: CompileConfig | None = None, |
There was a problem hiding this comment.
let's make it required in the call chain
| disable_log_stats=True, | ||
| ) | ||
| if config.max_num_seqs is not None: | ||
| engine_kwargs["max_num_seqs"] = config.max_num_seqs |
There was a problem hiding this comment.
what does this field do?
what if we don't set it here?
There was a problem hiding this comment.
Commented above - this controls cudagraphs we capture, not setting it defaults to the behavior today on main
I can go ahead and make this set by default to avoid any sort of silent slowdown
There was a problem hiding this comment.
wait these are two different kwargs -- the other is for vllm cudagraph behavior, what's this additional kwarg for?
There was a problem hiding this comment.
max_num_seqs controls other things like the padding for max size (used in kv cache)
| kwargs: dict = dict(cudagraph_mode=self.cudagraph_mode, mode=0) | ||
|
|
||
| if max_num_seqs is not None and self.cudagraph_mode != "none": | ||
| kwargs["cudagraph_capture_sizes"] = self._compute_cudagraph_capture_sizes( |
There was a problem hiding this comment.
what if we don't set it when cudagraph is enabled?
There was a problem hiding this comment.
Defaults to the 256 which captures ~35 different sizes (ranging from 1 to 256) so no incorrectness, just more memory and startup time used
| require vLLM's whole-model torch.compile to split the graph around | ||
| non-capturable ops, which conflicts with per-layer compile. | ||
| See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes""" |
There was a problem hiding this comment.
What happens when we enable cudagraph for per-layer compile?
We can save compile time, but what's the impact on run time, e.g. when going to GB200 with significant CPU overhead.
There was a problem hiding this comment.
The test plan shows the time impact - we observe speedup over piecewise in this particular setup
There was a problem hiding this comment.
also would like to check if it works with EP, being enabled in #3142
MoE has dynamic shape, despite being full graph torch-compilable
| require vLLM's whole-model torch.compile to split the graph around | ||
| non-capturable ops, which conflicts with per-layer compile. | ||
| See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes""" |
There was a problem hiding this comment.
Also, it seems we move compile back to torchtitan, but cudagraph application still in vllm. How far are we from moving cudagraph application also in torchtitan?
There was a problem hiding this comment.
I would estimate 2-3weeks but I can leave a TODO here that we should unify cudagraph config once we have it on trainer side
671cc0a to
f587e8a
Compare
| model_spec: ModelSpec, | ||
| hf_assets_path: str = "", | ||
| generator_dtype: str = "", | ||
| compile_config: CompileConfig, |
There was a problem hiding this comment.
does this work in python? putting a arg without default to be after args with defaults
| # CUDA graph capture despite being torch.compile-compatible. | ||
|
|
||
| @staticmethod | ||
| def _compute_cudagraph_capture_sizes(max_num_seqs: int) -> list[int]: |
There was a problem hiding this comment.
maybe inline this function into get_vllm_compilation_config
| def is_eager(self) -> bool: | ||
| """Inferred from backend and cudagraph_mode.""" | ||
| return self.backend == "none" and self.cudagraph_mode == "none" | ||
| torch.compile is configured separately via ``CompileConfig`` at the |
There was a problem hiding this comment.
shall we add a TODO to explore moving cudagraph application to torchtitan model as well?
| "max_num_seqs must be set (auto-computed by RLTrainer, or " | ||
| "set explicitly when using VLLMGenerator standalone)" | ||
| ) | ||
| engine_kwargs["max_num_seqs"] = config.max_num_seqs |
There was a problem hiding this comment.
add comment to educate
max_num_seqs controls other things like the padding for max size (used in kv cache)
and what happens if not set explicitly
| # varying tensor types (AsyncCollectiveTensor vs plain tensor from | ||
| # TP collectives). Each combination triggers a dynamo recompile. | ||
| if compile_config.enable: | ||
| torch._dynamo.config.recompile_limit = 12 |
There was a problem hiding this comment.
despite the comments above, 12 still looks a magical number. Could you elaborate further
- where 12 comes from
- what happens if we don't set it
| """Maximum number of concurrent sequences the engine will batch. | ||
| When set, this is passed to vLLM EngineArgs. Auto-computed by | ||
| RLTrainer from num_prompts_per_step * sampling.n when not | ||
| explicitly set.""" |
There was a problem hiding this comment.
if we change num_prompts_per_step for a given job, we do not have to remember to also bump this value
Not sure if I understand. My (probably wrong) understanding was
- "Maximum number of concurrent sequences the engine will batch" should always be obtained from
num_prompts_per_step * sampling.nper RL job, and there won't be any room to set it smaller. - If not set explicitly in vllm kwargs, it will be inferred by vllm and causing potential overhead.
- There is no benefit to give control to users, and we should always set it to be
num_prompts_per_step * sampling.nexplicitly.
Where does the argument break?
There was a problem hiding this comment.
Ah I think I misunderstood this line of questioning (thought it was more why do we need this at all) - I agree we can just hide this from users
f587e8a to
8509698
Compare
Fixes #3119 and #3071
Summary
We make significant improvements to the vLLM compilation, saving ~40s (20s from cudagraph, 1s per step, and ~13s from Dynamo) from the following changes:
Test plan
Test results:
Before
After (this PR)