Hybrid execution time predictor to solve extrapolation issues#52

Closed

yl3469 wants to merge 1 commit intomicrosoft:mainfrom

yl3469 commented Apr 3, 2025

The current default random forest predictor is overfitting on the training set and is not able to generalize the batch execution time prediction to unseen token numbers or batch sizes. For example, it would use the largest time needed for the largest num_tokens for any other extrapolation.

This causes inaccuracy of end-to-end latency, especially for long context sequences.

Prefill Latency (TTFT):

Predictor	0.5	0.9	0.95	0.99
random-forest	2.09	8.59	10.49	12.81
hybrid-forest	10.11	37.01	43.33	47.36

E2E Latency:

Predictor	0.5	0.9	0.95	0.99
random-forest	10.97	28.08	30.50	49.27
hybrid-forest	47.04	90.13	100.81	111.52

We verified that utilizing a polynomial fit for extrapolation along with a random forest for interpolation would yield the best MEAP results and bring them closest to the real system.


          Hybrid execution time predictor to solve extrapolation issues

61db1a1

Author

yl3469 commented Apr 6, 2025

@microsoft-github-policy-service agree

Collaborator

AgrawalAmey commented Apr 6, 2025

@nitinkedia7 please review this PR, it introduces an important fix

nitinkedia7 self-requested a review

April 9, 2025 11:16

nitinkedia7 requested changes

View reviewed changes

Collaborator

nitinkedia7 left a comment

Hi @yl3469, the hybrid predictor really helps when the simulator is used for predictions beyond the min and max datapoint. Most of my comments are with organizing the code. The core logic looks fine. It would be great if you can fix these. Thanks!

vidur/config/config.py



		@dataclass
		class HybridForestExecutionTimePredictorConfig(BaseExecutionTimePredictorConfig):

Collaborator

nitinkedia7 Apr 9, 2025

Is it possible for HybridForestExecutionTimePredictorConfig to extend both RandomForrestExecutionTimePredictorConfig and LinearRegressionExecutionTimePredictorConfig?
That way we'll need to specify only the extra parameters which are mainly dealing with when to switch between models.

If this is not possible, one way is to write the parameters in the order below:

# Random Forrest parameters
...
# Linear Regression parameters
...
# Hybrid parameters
...

vidur/config/config.py

+                      default_factory=lambda: [(1, 99)],
+                      metadata={"help": "Percentile range to use if boundary_method='percentile'."},
+                  )
+                  # rf_params : dict = field(

Collaborator

nitinkedia7 Apr 9, 2025

Please remove unused parameters.

vidur/events/base_event.py

                   @property
                   def event_type(self):
+                      return self._event_type
                       pass

Collaborator

nitinkedia7 Apr 9, 2025

Remove pass

vidur/execution_time_predictor/execution_time_predictor_registry.py

               )
+              ExecutionTimePredictorRegistry.register(
+                  ExecutionTimePredictorType.HYBRID_FOREST, HybridForestExecutionTimePredictor

Collaborator

nitinkedia7 Apr 9, 2025

The current naming HYBRID_FOREST doesn't mention the use of Polynomial Regression. A better name would be HYBRID_RF_PR.

vidur/execution_time_predictor/hybrid_execution_time_predictor.py

+                  Parameters
+                  ----------
+                  boundary_method : str, default='auto'

Collaborator

nitinkedia7 Apr 9, 2025

Please move the descriptions to inline config.py itself to have a single source of truth.

vidur/execution_time_predictor/hybrid_execution_time_predictor.py

+                      Whether to fit an intercept in the LinearRegression model.
+                  """
+                  fit_intercept = True

Collaborator

nitinkedia7 Apr 9, 2025

We do not need each parameter repeated here and in the arguments of __init__ again. This class could accept an object of class HybridForestExecutionTimePredictorConfig itself and store it in self._config.
Similarly, default values of the parameters repeated here and again __init__ function creates 3 sources which might start to mismatch. Please have default values of the parameters in config.py only.

vidur/execution_time_predictor/hybrid_execution_time_predictor.py

+                          Returns self.
+                      """
+                      # Initialize models
+                      self.rf_model_ = RandomForestRegressor(

Collaborator

nitinkedia7 Apr 9, 2025

The codebase uses convention self._rf_model instead of self.rf_model_. Please follow this in other variable naming too.

vidur/execution_time_predictor/hybrid_execution_time_predictor.py

+                      self.rf_model_.fit(X, y)
+                      # For lower boundary extrapolation
+                      if isinstance(X, np.ndarray):

Collaborator

nitinkedia7 Apr 9, 2025

Is there a reason to use only the lower 20% of the data and upper 60% of the data. What if all the data is used here?
If it is indeed required to use partial data then let's have:

1 hyperparameter. For example, the lower uses < 50th percentile and the upper uses > 50th percentile.
2 hyperparameter.
In any case, please extract this constant to a variable in HybridForestExecutionTimePredictorConfig.

vidur/execution_time_predictor/hybrid_execution_time_predictor.py

+                      # For lower boundary extrapolation
+                      if isinstance(X, np.ndarray):
+                          mask_lower = X_feature <= np.percentile(X_feature, 20)  # Use bottom 20% of data

Collaborator

nitinkedia7 Apr 9, 2025

These if else block is very repititive and can be condensed.

Suggested change

      
                        mask_lower = X_feature <= np.percentile(X_feature, 20)  # Use bottom 20% of data
          
            # For lower boundary extrapolation
          
            mask_lower = X_feature <= np.percentile(X_feature, 20)  # Use bottom 20% of data
          
            X_lower = X[mask_lower]
          
            y_lower = y[mask_lower] if isinstance(X, np.ndarray) else y[mask_lower.to_list()]
          
            self._lr_model_lower.fit(X_lower, y_lower)
          
            # For upper boundary extrapolation
          
            mask_upper = X_feature >= np.percentile(X_feature, 60)  # Use top 20% of data
          
            X_upper = X[mask_upper]
          
            y_lower = y[mask_upper] if isinstance(X, np.ndarray) else y[mask_upper.to_list()]
          
            self._lr_model_upper_.fit(X_upper, y_upper)

vidur/execution_time_predictor/sklearn_execution_time_predictor.py

                       config_str = str(self.to_dict())
                       if df is None:
+                          # import pdb; pdb.set_trace()

Collaborator

nitinkedia7 Apr 9, 2025 •

edited

Loading

Please remove these debugging comments and print statements from this file and other edited files in this PR.

nitinkedia7 closed this

Collaborator

nitinkedia7 commented Jun 26, 2025

Hi @yl3469 , I spent quite a bit of time with this PR because the core issue it aims to solve (of extrapolating to say higher context length) makes Vidur usable for points it hasn't been profiled. However, I found several critical issues with this implementation:

GridSearchCV is used to train multiple predictors for the same operation e.g. mlp_up_proj and the best one is chosen. Each of the model uses a different hyperparameter for say random forest. Now say we have three predictors (lower, middle, upper) for the same operation. And there are L unique hyperparameter sets for lower, M for middle, and U for upper. Then the search space for grid search in current implementation blows up to L * M * U. Cache preparation didn't even finish in an hour when I tried to create predictors for 128k / 256k prefill attention operation. We need to drastically reduce performance cost.
Note that this performance problem is not fundamental. We can find the best lower, middle, and upper model separately and then combine them only at inference. Search space is L + M + U instead of L * M * U.
We do not need a separate lower predictor because all profiling starts from the minimum (e.g. batch size of 1, context length of 2 etc.). There is a significant performance cost to adding each predictor (explained above) so if we can do without a predictor, we should.
We cache the results of predictions and use them a lookup tables at runtime. Say we have data till 128 batch size, and we want to predict till 512. A simpler implementation would be like this: The training function should train both a RF and a LR model with the full profiling data (i.e. till 128). When predicting below 128 it should use RF and for greater than 128 it should use LR. We remove the additional hyperparameter "say top 20%" for the data for upper model too.

This PR remains compatible with the codebase after the recent big PR #56 too. However, the implementation needs to be made much more performant and few new parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet