Replies: 1 comment
-
|
Hi @tpfau , not sure if this is still relevant, but I guess this is more of a question about how KEDA scales rather than the HTTP Addon. Note: Closing this discussion as I'm doing a cleanup right now, let me know if something should be reopened. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
So here is my situation:
I have a somewhat heterogeneous kubernetes cluster with nodes with several different types of GPUs (lets assume A and B). I would like to host some LLMs on that cluster and scale those depending on demand.
However, some of the models I want to provide require more memory than provided by some of the GPUs (e.g. A has only 48GB, while the model needs 60GB )but can be run on multiple of those GPUs, while they can run on a single GPU of type B (which provides 96GB).
I could easily create two deployments for those models, one that specifies the resources and requirements for multiple GPUs of type A and another for type B which only requires one GPU.
And here is my problem/question:
Is there a way to have the Auto-scaler work, such that it scales the total replica count of the combination of Deployments of type A and type B ?
And/or does someone know whether there is a good way to achieve this or whether this is just out of scope for kubernetes?
Beta Was this translation helpful? Give feedback.
All reactions