Skip to content

Commit 25e0082

Browse files
authored
[Core][Doc][CI/Build][Bugfix][Profiling] Multi-replica routing polices, prefix caching, uv, and a much faster and lighter Vidur (#56)
Add documentation: How to run Vidur for different models, GPU SKUs, workloads, etc.? How to run Vidur Config Explorer aka 100s of simulations in parallel? Support replica wise metrics e.g. prefill_e2e_time_replicawise to show difference in TTFT across replicas. Implement and refine several routing policies, see vidur/scheduler/global_scheduler. Port prefix caching support from vllm_v1 and the vllm_v1 replica scheduler. Switch to uv from mamba. Several quality-of-life enhancements: Reproducibility of output 4X reduction in RAM usage per simulation, high context (128K), number of requests (~25k), QPS support.
1 parent 7d47513 commit 25e0082

File tree

155 files changed

+873773
-147487
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

155 files changed

+873773
-147487
lines changed

.github/workflows/lint.yml

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,14 @@ jobs:
1616
steps:
1717
- name: "Checkout Repository"
1818
uses: actions/checkout@v3
19-
- name: Install Conda environment from environment-dev.yml
20-
uses: mamba-org/setup-micromamba@v1
19+
- name: Install uv
20+
uses: astral-sh/setup-uv@v5
2121
with:
22-
environment-file: environment-dev.yml
23-
- name: "Run black lint"
24-
run: make lint/black
25-
- name: "Run isort check"
26-
run: make lint/isort
22+
# Install a specific version of uv.
23+
version: "0.7.3"
24+
- name: Install the project
25+
run: uv sync --locked --all-extras --dev
26+
- name: Run black
27+
run: uv run black vidur
28+
- name: Run isort
29+
run: uv run isort --profile black vidur

.gitignore

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ cache
165165
cache_random_forrest
166166
cache_linear_regression
167167
cache*
168-
simulator_output
168+
simulator_outputs
169169
wandb
170170
train.zip
171171
profiling_outputs
@@ -177,3 +177,8 @@ config_optimizer_output_tmpfs
177177
profiler_traces*
178178
experiments/profiling/get_profiled_data_from_trace.ipynb
179179
env_3
180+
config_optimizer_output*
181+
experiments/miscellaneous/request_length_trace_analysis.ipynb
182+
vidur/config_optimizer/config_explorer/config/config_llama3_8b.yml
183+
experiments/global_scheduler/get_uniform_trace.ipynb
184+
prefill_throughput_output

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.10

.vscode/settings.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,6 @@
33
"INTERNLM",
44
"QWEN",
55
"vidur"
6-
]
6+
],
7+
"python.analysis.fixAll" : ["source.unusedImports"]
78
}

README.md

Lines changed: 28 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -20,110 +20,69 @@ Vidur is a high-fidelity and extensible LLM inference system simulator. It can h
2020

2121
## Supported Models
2222

23-
__Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
24-
25-
| Model / Device | A100 80GB DGX | H100 DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |
23+
| Model / Device | H100 DGX | A100 80GB DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |
2624
| --- | --- | --- | --- | --- |
27-
| `meta-llama/Meta-Llama-3-8B` || |||
28-
| `meta-llama/Meta-Llama-3-70B` || |||
25+
| `meta-llama/Meta-Llama-3-8B` || |||
26+
| `meta-llama/Meta-Llama-3-70B` || |||
2927
| `meta-llama/Llama-2-7b-hf` |||||
3028
| `codellama/CodeLlama-34b-Instruct-hf"` |||||
3129
| `meta-llama/Llama-2-70b-hf` |||||
3230
| `internlm/internlm-20b` |||||
3331
| `Qwen/Qwen-72B` |||||
3432

35-
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length by passing additional CLI params:
36-
37-
```text
38-
--random_forrest_execution_time_predictor_config_prediction_max_prefill_chunk_size 16384 \
39-
--random_forrest_execution_time_predictor_config_prediction_max_batch_size 512 \
40-
--random_forrest_execution_time_predictor_config_prediction_max_tokens_per_request 16384
41-
```
42-
33+
* __Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
34+
* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length.
4335
* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.
4436
* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.
4537
* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.
4638
* You can use any combination of TP and PP. For example, you can run LLaMA2-70B on TP2-PP2 on a 4xA100 80GB Pairwise NVLink Node.
4739

48-
## Setup
40+
## Setup (using `uv`)
4941

50-
### Using `mamba`
42+
1. Install [uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods)
43+
2. At project root, run `uv venv` to create a new virtual environment.
44+
3. Activate the environment using `source .venv/bin/activate`.
45+
4. Install dependencies using `uv sync`. The environment is now ready for use.
5146

52-
To run the simulator, create a mamba environment with the given dependency file.
53-
54-
```sh
55-
mamba env create -p ./env -f ./environment.yml
56-
mamba env update -f environment-dev.yml
57-
```
58-
59-
### Using `venv`
60-
61-
1. Ensure that you have Python 3.10 installed on your system. Refer <https://www.bitecode.dev/p/installing-python-the-bare-minimum>
62-
2. `cd` into the repository root
63-
3. Create a virtual environment using `venv` module using `python3.10 -m venv .venv`
64-
4. Activate the virtual environment using `source .venv/bin/activate`
65-
5. Install the dependencies using `python -m pip install -r requirements.txt`
66-
6. Run `deactivate` to deactivate the virtual environment
67-
68-
### Using `conda` (Least recommended)
69-
70-
To run the simulator, create a conda environment with the given dependency file.
71-
72-
```sh
73-
conda env create -p ./env -f ./environment.yml
74-
conda env update -f environment-dev.yml
75-
```
76-
77-
### Setting up wandb (Optional)
47+
## Setting up wandb (Optional)
7848

7949
First, setup your account on `https://<your-org>.wandb.io/` or public wandb, obtain the api key and then run the following command,
8050

8151
```sh
8252
wandb login --host https://<your-org>.wandb.io
8353
```
8454

85-
To opt out of wandb, pick any one of the following methods:
86-
87-
1. `export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc`.
88-
2. Set `wandb_project` and `wandb_group` as `""` in `vidur/config/default.yml`. Also, remove these CLI params from the shell command with which the simulator is invoked.
55+
To opt out of wandb, set `export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc` or `source ~/.bashrc`.
8956

9057
## Running the simulator
9158

92-
To run the simulator, execute the following command from the repository root,
93-
94-
```sh
95-
python -m vidur.main
96-
```
97-
98-
or a big example with all the parameters,
59+
To run the simulator, execute the following command from the repository root:
9960

10061
```sh
101-
python -m vidur.main \
102-
--replica_config_device a100 \
62+
python -m vidur.main \
63+
--time_limit 10800 \
10364
--replica_config_model_name meta-llama/Meta-Llama-3-8B \
104-
--cluster_config_num_replicas 1 \
65+
--replica_config_device h100 \
66+
--replica_config_network_device h100_dgx \
67+
--cluster_config_num_replicas 8 \
10568
--replica_config_tensor_parallel_size 1 \
10669
--replica_config_num_pipeline_stages 1 \
10770
--request_generator_config_type synthetic \
108-
--synthetic_request_generator_config_num_requests 512 \
71+
--synthetic_request_generator_config_num_requests 128 \
10972
--length_generator_config_type trace \
110-
--trace_request_length_generator_config_max_tokens 16384 \
111-
--trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \
73+
--trace_request_length_generator_config_trace_file ./data/processed_traces/mooncake_conversation_trace.csv \
11274
--interval_generator_config_type poisson \
113-
--poisson_request_interval_generator_config_qps 6.45 \
114-
--replica_scheduler_config_type sarathi \
115-
--sarathi_scheduler_config_batch_size_cap 512 \
116-
--sarathi_scheduler_config_chunk_size 512 \
117-
--random_forrest_execution_time_predictor_config_prediction_max_prefill_chunk_size 16384 \
118-
--random_forrest_execution_time_predictor_config_prediction_max_batch_size 512 \
119-
--random_forrest_execution_time_predictor_config_prediction_max_tokens_per_request 16384
75+
--poisson_request_interval_generator_config_qps 8.0 \
76+
--global_scheduler_config_type round_robin \
77+
--replica_scheduler_config_type vllm_v1 \
78+
--vllm_v1_scheduler_config_chunk_size 512 \
79+
--vllm_v1_scheduler_config_batch_size_cap 512 \
80+
--cache_config_enable_prefix_caching
12081
```
12182

122-
or to get information on all parameters,
83+
The command above simulates a scenario with a H100 DGX node running 8 replicas of the `Meta-Llama-3-8B` model, with synthetic requests generated at a QPS of 8. The `mooncake_conversation` trace file is used for request lengths, and the scheduler is set to `vllm_v1` which has been taken from the [vLLM V1](https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py).
12384

124-
```sh
125-
python -m vidur.main -h
126-
```
85+
__The simulator supports a plethora of parameters for different simulation scenarios, see [docs/how_to_run.md](docs/how_to_run.md). Also run `python -m vidur.main -n` to get helptext on all parameters.__
12786

12887
## Simulator Output
12988

assets/batch_size.png

11.6 KB
Loading
-8.4 KB
Loading

assets/prefill_e2e_time.png

-11.6 KB
Loading

assets/request_e2e_time.png

-12.6 KB
Loading

0 commit comments

Comments
 (0)