Skip to content

VPA: OOM bump-up creates self-reinforcing recommendation loop when maxAllowed caps memory below real peak #9521

@Sanil2108

Description

@Sanil2108

Which component are you using?:

/area vertical-pod-autoscaler

What version of the component are you using?:

VPA v1.4.1

Component version: v1.4.1 (recommender deployed with --oom-bump-up-ratio=2.0, --target-memory-percentile=0.99)

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.15+k3s1

What environment is this in?:

Production Kubernetes cluster on a managed cloud provider. Reproducible on any cluster running VPA recommender with --oom-bump-up-ratio > 1.0 and a VPA whose maxAllowed.memory is below the workload's real peak memory need.

What did you expect to happen?:

VPA recommendations should converge to a stable value near actual observed usage once the workload stabilizes. When maxAllowed caps requests below the workload's real peak, occasional OOMs are expected — but they should not feed back into the recommender in a way that permanently inflates uncappedTarget far beyond any value the workload has ever actually used.

What happened instead?:

VPA enters a self-reinforcing feedback loop that pins uncappedTarget at roughly maxAllowed × oom-bump-up-ratio, regardless of real usage:

  1. Pod is created with memory request = maxAllowed (e.g., 30 GiB), because VPA's recommendation is capped there.
  2. Rare heavy job exceeds 30 GiB → pod is OOMKilled.
  3. VPA's RecordOOM inserts a synthetic memory sample at max(requestedMemory, memoryPeak) × oom-bump-up-ratio = 30 GiB × 2.0 = 60 GiB.
  4. Histogram P99 now lands in the bucket containing 60 GiB → uncappedTarget ≈ 61 GiB.
  5. VPA wants 61 GiB → capped back to 30 GiB by maxAllowed.
  6. Next heavy job → another OOM → another 60 GiB synthetic sample → loop.

With the 14-day histogram decay half-life, a single OOM keeps significant weight in the histogram for 30–60 days, so recommendations never settle back toward actual usage.

Observed on one workload:

Value
uncappedTarget.memory 61.3 GiB
target.memory (capped) 30 GiB
Highest memory any pod ever actually used (90d, container_memory_working_set_bytes) 19.57 GiB
P99 memory (14d) 18.2 GiB
Avg memory (14d) 4.0 GiB
OOMKills in 90d 3

uncappedTarget is ~3× the highest memory the workload has ever used.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy VPA recommender with --oom-bump-up-ratio=2.0 (non-default; upstream default is 1.2). Any value > 1.0 reproduces the loop; higher ratios amplify it.
  2. Create a VPA with maxAllowed.memory set below the workload's actual peak memory need. Example:
    resourcePolicy:
      containerPolicies:
        - containerName: publish
          maxAllowed:
            memory: 30Gi
  3. Run a workload whose occasional heavy jobs exceed maxAllowed, causing OOMKills.
  4. After at least one OOM, inspect VPA status:
    kubectl get vpa <name> -n <ns> -o yaml | sed -n '/status:/,$p'
    Observe that status.recommendation.containerRecommendations[].uncappedTarget.memory is approximately maxAllowed × oom-bump-up-ratio, while target.memory stays pinned at maxAllowed.
  5. Confirm uncappedTarget exceeds any real observed usage, e.g. via Prometheus:
    max by(pod) (max_over_time(
      container_memory_working_set_bytes{namespace="<ns>", container="<c>"}[90d]
    )) / 1024 / 1024 / 1024
    
  6. Confirm the OOMKill history that's driving the synthetic samples:
    sum(max_over_time(kube_pod_container_status_last_terminated_reason{
      namespace="<ns>", reason="OOMKilled"
    }[90d]))
    

Anything else we need to know?:

Root cause — in pkg/recommender/model/container.go RecordOOM:

// Get max of the request and the recent usage-based memory peak.
// Omitting oomPeak here to protect against recommendation running too high on subsequent OOMs.
memoryUsed := ResourceAmountMax(requestedMemory, container.memoryPeak)
memoryNeeded := ResourceAmountMax(
    memoryUsed + MemoryAmountFromBytes(container.GetOOMMinBumpUp()),
    ScaleResource(memoryUsed, container.GetOOMBumpUpRatio()),
)

When maxAllowed caps the request below real need, requestedMemory equals maxAllowed at OOM time, so the synthetic sample becomes maxAllowed × oom-bump-up-ratio. This value is by construction greater than maxAllowed, so the next OOM produces the same synthetic sample — a stable loop rather than a converging one. The existing comment ("Omitting oomPeak here to protect against recommendation running too high on subsequent OOMs") shows the maintainers have already patched a similar amplification path; the maxAllowed interaction appears to be an unhandled case.

Proposed fix — cap the base used for bump-up at maxAllowed (or the current recommendation) so synthetic samples never exceed what VPA is actually allowed to recommend:

baseMemory := ResourceAmountMax(requestedMemory, container.memoryPeak)
if maxAllowed > 0 {
    baseMemory = ResourceAmountMin(baseMemory, maxAllowed)
}
memoryNeeded := ResourceAmountMax(
    baseMemory + MemoryAmountFromBytes(container.GetOOMMinBumpUp()),
    ScaleResource(baseMemory, container.GetOOMBumpUpRatio()),
)

This preserves OOM bump-up behavior for the common case where maxAllowed is not the binding constraint, while preventing the feedback loop when it is.

Notes:

  • The upstream default oom-bump-up-ratio=1.2 reduces but does not eliminate the loop — 1.2 × maxAllowed is still > maxAllowed, so the synthetic sample still sits permanently above the cap.
  • The interaction with --target-memory-percentile=0.99 makes this worse, since P99 is more sensitive to a small number of outlier synthetic samples than P90 would be.
  • Decay half-life of 336h (14d) means a single OOM's synthetic sample retains meaningful weight for 30–60 days.
  • The issue is cluster-specific: only VPAs with OOM history and maxAllowed below real peak exhibit it. Workloads that never OOM see normal behavior.

I'd be happy to pick this up and open a PR with the fix plus a unit test covering the capped-OOM case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/vertical-pod-autoscalerkind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions