Skip to content

VPA recommender (Prometheus history): support cluster label matcher to avoid cross-cluster metric mixing #9491

@maichouni-mitek

Description

@maichouni-mitek

Which component are you using?:
vertical-pod-autoscaler (recommender, --storage=prometheus mode)

/area vertical-pod-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
When VPA recommender uses a shared Prometheus/Thanos endpoint that contains metrics from multiple clusters, there is no built-in way to scope history queries to a single cluster label/value.
In multi-cluster backends, this can lead to cross-cluster metric mixing (especially when namespace/pod names overlap across clusters), which affects recommendation correctness.

Describe the solution you'd like.:
Add optional recommender flags to scope Prometheus history queries by cluster label, for example:

  • --prometheus-cluster-label=<label_name> (example: cluster)
  • --prometheus-cluster-id=<cluster_value> (example: qa)
    When set, append {<label_name>="<cluster_value>"} to the internal PromQL used for:
  • CPU history query
  • memory history query
  • pod-label lookup query (--metric-for-pod-labels path)
    If these flags are not set, keep current behavior for full backward compatibility.

I believe this would be a relatively small change. I am not familiar with golang, so I might be mistaken. These would be the files that need to be amended:

  • vertical-pod-autoscaler/pkg/recommender/config/config.go
  • vertical-pod-autoscaler/pkg/recommender/main.go
  • vertical-pod-autoscaler/pkg/recommender/input/history/history_provider.go
  • vertical-pod-autoscaler/pkg/recommender/input/history/history_provider_test.go

Describe any alternative solutions you've considered.:
I have not tested any of these (yet):

  1. Per-cluster isolated Prometheus/Thanos endpoints (should work, but requires additional infra/routing and is not always feasible).
  2. Maintaining an internal VPA fork that hardcodes or adds this matcher (should work, but creates long-term maintenance burden and upgrade friction).

Additional context.:
Environment details from my testing:

  • VPA recommender 1.6.0
  • --storage=prometheus
  • shared Thanos endpoint serving multiple clusters
    Current VPA supports several query-related flags (job name, metric names, pod/container label names), but there is no explicit cluster matcher control for history query scoping.
    Happy to test and provide feedback on a proposed implementation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions