Skip to content

Comments

Hybrid execution time predictor to solve extrapolation issues#52

Closed
yl3469 wants to merge 1 commit intomicrosoft:mainfrom
yl3469:main
Closed

Hybrid execution time predictor to solve extrapolation issues#52
yl3469 wants to merge 1 commit intomicrosoft:mainfrom
yl3469:main

Conversation

@yl3469
Copy link

@yl3469 yl3469 commented Apr 3, 2025

The current default random forest predictor is overfitting on the training set and is not able to generalize the batch execution time prediction to unseen token numbers or batch sizes. For example, it would use the largest time needed for the largest num_tokens for any other extrapolation.

This causes inaccuracy of end-to-end latency, especially for long context sequences.

Prefill Latency (TTFT):

Predictor 0.5 0.9 0.95 0.99
random-forest 2.09 8.59 10.49 12.81
hybrid-forest 10.11 37.01 43.33 47.36

E2E Latency:

Predictor 0.5 0.9 0.95 0.99
random-forest 10.97 28.08 30.50 49.27
hybrid-forest 47.04 90.13 100.81 111.52

We verified that utilizing a polynomial fit for extrapolation along with a random forest for interpolation would yield the best MEAP results and bring them closest to the real system.

@yl3469
Copy link
Author

yl3469 commented Apr 6, 2025

@microsoft-github-policy-service agree

@AgrawalAmey
Copy link
Collaborator

@nitinkedia7 please review this PR, it introduces an important fix

@nitinkedia7 nitinkedia7 self-requested a review April 9, 2025 11:16
Copy link
Collaborator

@nitinkedia7 nitinkedia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yl3469, the hybrid predictor really helps when the simulator is used for predictions beyond the min and max datapoint. Most of my comments are with organizing the code. The core logic looks fine. It would be great if you can fix these. Thanks!



@dataclass
class HybridForestExecutionTimePredictorConfig(BaseExecutionTimePredictorConfig):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for HybridForestExecutionTimePredictorConfig to extend both RandomForrestExecutionTimePredictorConfig and LinearRegressionExecutionTimePredictorConfig?
That way we'll need to specify only the extra parameters which are mainly dealing with when to switch between models.

If this is not possible, one way is to write the parameters in the order below:

# Random Forrest parameters
...
# Linear Regression parameters
...
# Hybrid parameters
...

default_factory=lambda: [(1, 99)],
metadata={"help": "Percentile range to use if boundary_method='percentile'."},
)
# rf_params : dict = field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unused parameters.

@property
def event_type(self):
return self._event_type
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove pass

)

ExecutionTimePredictorRegistry.register(
ExecutionTimePredictorType.HYBRID_FOREST, HybridForestExecutionTimePredictor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current naming HYBRID_FOREST doesn't mention the use of Polynomial Regression. A better name would be HYBRID_RF_PR.


Parameters
----------
boundary_method : str, default='auto'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the descriptions to inline config.py itself to have a single source of truth.

Whether to fit an intercept in the LinearRegression model.

"""
fit_intercept = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need each parameter repeated here and in the arguments of __init__ again. This class could accept an object of class HybridForestExecutionTimePredictorConfig itself and store it in self._config.
Similarly, default values of the parameters repeated here and again __init__ function creates 3 sources which might start to mismatch. Please have default values of the parameters in config.py only.

Returns self.
"""
# Initialize models
self.rf_model_ = RandomForestRegressor(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codebase uses convention self._rf_model instead of self.rf_model_. Please follow this in other variable naming too.

self.rf_model_.fit(X, y)

# For lower boundary extrapolation
if isinstance(X, np.ndarray):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use only the lower 20% of the data and upper 60% of the data. What if all the data is used here?
If it is indeed required to use partial data then let's have:

  1. 1 hyperparameter. For example, the lower uses < 50th percentile and the upper uses > 50th percentile.
  2. 2 hyperparameter.
    In any case, please extract this constant to a variable in HybridForestExecutionTimePredictorConfig.


# For lower boundary extrapolation
if isinstance(X, np.ndarray):
mask_lower = X_feature <= np.percentile(X_feature, 20) # Use bottom 20% of data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These if else block is very repititive and can be condensed.

Suggested change
mask_lower = X_feature <= np.percentile(X_feature, 20) # Use bottom 20% of data
# For lower boundary extrapolation
mask_lower = X_feature <= np.percentile(X_feature, 20) # Use bottom 20% of data
X_lower = X[mask_lower]
y_lower = y[mask_lower] if isinstance(X, np.ndarray) else y[mask_lower.to_list()]
self._lr_model_lower.fit(X_lower, y_lower)
# For upper boundary extrapolation
mask_upper = X_feature >= np.percentile(X_feature, 60) # Use top 20% of data
X_upper = X[mask_upper]
y_lower = y[mask_upper] if isinstance(X, np.ndarray) else y[mask_upper.to_list()]
self._lr_model_upper_.fit(X_upper, y_upper)

config_str = str(self.to_dict())

if df is None:
# import pdb; pdb.set_trace()
Copy link
Collaborator

@nitinkedia7 nitinkedia7 Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these debugging comments and print statements from this file and other edited files in this PR.

@nitinkedia7
Copy link
Collaborator

Hi @yl3469 , I spent quite a bit of time with this PR because the core issue it aims to solve (of extrapolating to say higher context length) makes Vidur usable for points it hasn't been profiled. However, I found several critical issues with this implementation:

  1. GridSearchCV is used to train multiple predictors for the same operation e.g. mlp_up_proj and the best one is chosen. Each of the model uses a different hyperparameter for say random forest. Now say we have three predictors (lower, middle, upper) for the same operation. And there are L unique hyperparameter sets for lower, M for middle, and U for upper. Then the search space for grid search in current implementation blows up to L * M * U. Cache preparation didn't even finish in an hour when I tried to create predictors for 128k / 256k prefill attention operation. We need to drastically reduce performance cost.
    Note that this performance problem is not fundamental. We can find the best lower, middle, and upper model separately and then combine them only at inference. Search space is L + M + U instead of L * M * U.

  2. We do not need a separate lower predictor because all profiling starts from the minimum (e.g. batch size of 1, context length of 2 etc.). There is a significant performance cost to adding each predictor (explained above) so if we can do without a predictor, we should.

  3. We cache the results of predictions and use them a lookup tables at runtime. Say we have data till 128 batch size, and we want to predict till 512. A simpler implementation would be like this: The training function should train both a RF and a LR model with the full profiling data (i.e. till 128). When predicting below 128 it should use RF and for greater than 128 it should use LR. We remove the additional hyperparameter "say top 20%" for the data for upper model too.

This PR remains compatible with the codebase after the recent big PR #56 too. However, the implementation needs to be made much more performant and few new parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants