Skip to content

locationConstraint not respected in ProvReq - node provisioned in wrong zone #9520

@mruoss

Description

@mruoss

Which component are you using?:

We're using Kueue with DWS on GKE (classic) to get nodes with GPUs provisioned.

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.34.4-gke.1193000
WARNING: version difference between client (1.32) and server (1.34) exceeds the supported minor version skew of +/-1

What environment is this in?:

GKE

What did you expect to happen?:

Have a look at the following provreq that was created by queue. It has a locationConstraint set to europe-west4-a (this is a deliberate choice). I'd expect the autoscaler to respect this constraint and provision the node in this zone.

What happened instead?:

As you can see in the status, the SelectedZone was europe-west4-b. The problem is that this is set on Kueue's resource flavor which results in a nodeSelector (topology.kubernetes.io/zone: europe-west4-a) being added to the pod. Since the node is provisioned in europe-west4-b but the pod requires europe-west4-a, the pod never gets scheduled and the capacity booking for the provisioning request expires.

Privisioning Request manifest

apiVersion: autoscaling.x-k8s.io/v1
kind: ProvisioningRequest
metadata:
  creationTimestamp: "2026-04-17T13:55:48Z"
  generation: 1
  labels:
    kueue.x-k8s.io/managed: "true"
  name: pod-fjjfwxiq-1f00b-europe-west4-a-1
  namespace: production
  ownerReferences:
  - apiVersion: kueue.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Workload
    name: pod-fjjfwxiq-1f00b
    uid: 3b684cd6-ccc2-45ea-9b35-905ab16cb326
  resourceVersion: "1776437424362047021"
  uid: 3d49e75f-7c8f-47c5-b33d-d7333c2ea464
spec:
  parameters:
    locationConstraint: europe-west4-a
  podSets:
  - count: 1
    podTemplateRef:
      name: ppt-pod-fjjfwxiq-1f00b-europe-west4-a-1-main
  provisioningClassName: queued-provisioning.gke.io
status:
  conditions:
  - lastTransitionTime: "2026-04-17T13:56:07Z"
    message: Provisioning Request was successfully queued.
    observedGeneration: 1
    reason: SuccessfullyQueued
    status: "True"
    type: Accepted
  - lastTransitionTime: "2026-04-17T14:40:24Z"
    message: Provisioning Request was successfully provisioned.
    observedGeneration: 1
    reason: Provisioned
    status: "True"
    type: Provisioned
  - lastTransitionTime: "2026-04-17T14:50:24Z"
    message: Capacity booking for the Provisioning Request has expired and the nodes
      are now candidates for scale down when underutilized.
    observedGeneration: 1
    reason: BookingExpired
    status: "True"
    type: BookingExpired
  provisioningClassDetails:
    AcceleratorType: nvidia-tesla-a100
    NodeGroupName: gke-saas-gke-cluster-nap-a2-highgpu-1-0a2fd8a7-grp
    NodePoolAutoProvisioned: "true"
    NodePoolName: nap-a2-highgpu-1g-gpu1-rlp1cspi
    PodTemplateName: ppt-pod-fjjfwxiq-1f00b-europe-west4-a-1-main
    ProvisioningMode: resize_request
    ResizeRequestName: gke-production-pod-fjjfwxiq-1f00b-e-a3cc38d07087bd9a
    SelectedZone: europe-west4-b

Node selectors on Pod

  nodeSelector:
    autoscaling.gke.io/provisioning-request: gke-production-pod-fjjfwxiq-1f00b-e-a3cc38d07087bd9a
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
    topology.kubernetes.io/zone: europe-west4-a

Kueue resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  annotations:
    argocd.argoproj.io/tracking-id: saas-gke-cluster-blue-kueue:kueue.x-k8s.io/ResourceFlavor:kueue-system/nvidia-tesla-a100-flavor
  creationTimestamp: "2024-10-11T10:04:08Z"
  finalizers:
  - kueue.x-k8s.io/resource-in-use
  generation: 2
  name: nvidia-tesla-a100-flavor
  resourceVersion: "1775716960047423022"
  uid: 3ac1dab6-66a3-4f49-96d1-9761ce87b71d
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
    topology.kubernetes.io/zone: europe-west4-a

Kueue ProvisioningRequestConfig

apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
  name: europe-west4-a
spec:
  provisioningClassName: queued-provisioning.gke.io
  managedResources:
    - nvidia.com/gpu
  parameters:
    locationConstraint: "europe-west4-a"

How to reproduce it (as minimally and precisely as possible):

It's not easy to reproduce as most of the time, we actually get a node in the requested zone. Maybe to reproduce it, it would be better to set the constraint to europe-west4-b and then get a node provisioned in -a.

Anything else we need to know?:

Metadata

Metadata

Assignees

Labels

area/cluster-autoscalerkind/bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions