Skip to content

Add minimal EKS Auto Mode test tooling#4285

Open
pcnudde wants to merge 6 commits intoNVIDIA:mainfrom
pcnudde:feature/aws-k8s
Open

Add minimal EKS Auto Mode test tooling#4285
pcnudde wants to merge 6 commits intoNVIDIA:mainfrom
pcnudde:feature/aws-k8s

Conversation

@pcnudde
Copy link
Collaborator

@pcnudde pcnudde commented Mar 10, 2026

Summary

  • add a minimal EKS Auto Mode cluster config under tests/tools/aws/eks
  • add a tiny non-FLARE workload manifest to verify pod scheduling on the cluster
  • document the direct eksctl and kubectl workflow in the README

Testing

  • not run (documentation and manifest additions only)

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR adds minimal Kubernetes cluster test tooling for three cloud providers — AWS EKS Auto Mode, Azure AKS Automatic, and GCP GKE Autopilot — under tests/tools/, including shell scripts for cluster lifecycle management, small inflate.yaml workloads to verify pod scheduling, and README guides for each provider. All issues raised in the previous review round have been properly addressed (removed --sort-by from the EKS watch command, removed --overwrite-existing from AKS credential fetch, removed --no-wait from AKS group delete, replaced the real GCP project ID in the GKE README, and added GKE firewall-rule cleanup before VPC deletion).

Key observations:

  • All scripts use set -euo pipefail consistently and provide environment-variable overrides for customisation.
  • EKS teardown uses --wait for synchronous cluster deletion; AKS teardown deletes the entire resource group synchronously; GKE teardown deletes the cluster, cleans up residual gke-* firewall rules, then removes the VPC — a solid multi-step teardown pattern.
  • The three inflate.yaml manifests are intentionally minimal pause-container deployments; the EKS one adds an eks.amazonaws.com/compute-type: auto nodeSelector and a pod-level security context which are AWS-specific requirements.
  • The GKE delete_cluster.sh firewall-rule filter (name~'^gke-') covers the well-known gke-{cluster} prefix; GKE Autopilot can occasionally create rules with a gk3- prefix as well, so the filter may leave a small number of orphaned rules in edge cases.

Confidence Score: 5/5

  • This PR is safe to merge — it adds net-new test tooling files with no impact on existing NVFLARE source code.
  • All changes are documentation, shell scripts, and YAML manifests under tests/tools/. They do not touch any NVFLARE library or application code, carry no secrets, and all previous review concerns have been addressed. The one minor observation (GKE firewall filter potentially missing gk3- prefixed rules) is a very low-probability edge case in test tooling and does not block merging.
  • No files require special attention; tests/tools/gcp/gke/delete_cluster.sh has a minor edge-case note worth being aware of.

Important Files Changed

Filename Overview
tests/tools/aws/eks/inflate.yaml EKS-specific inflate workload using pause container on Auto Mode nodes; includes pod-level security context, nodeSelector, and allowPrivilegeEscalation: false.
tests/tools/azure/aks/create_cluster.sh Creates resource group, AKS Automatic cluster, and fetches credentials; --overwrite-existing was removed per previous review; no remaining issues.
tests/tools/azure/aks/delete_cluster.sh Synchronous resource group deletion (--no-wait was removed per previous review); clean and consistent with the other cloud scripts.
tests/tools/gcp/gke/create_cluster.sh Creates auto-mode VPC if absent, then creates GKE Autopilot cluster and fetches credentials; PROJECT_ID fallback with validation is robust.
tests/tools/gcp/gke/delete_cluster.sh Synchronous cluster deletion followed by cleanup of residual gke-* firewall rules and VPC removal; previous feedback on firewall cleanup was addressed, though the name~'^gke-' filter may miss any non-gke- prefixed rules left by GKE Autopilot.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant CS as create_cluster.sh
    participant CP as Cloud Provider API
    participant K as kubectl
    participant DS as delete_cluster.sh

    U->>CS: ./create_cluster.sh
    CS->>CP: Create VPC/Network (GKE only)
    CP-->>CS: Network ready
    CS->>CP: Create cluster (eksctl / az aks / gcloud)
    CP-->>CS: Cluster ready
    CS->>CP: Get credentials (kubeconfig)
    CP-->>CS: kubeconfig updated
    CS-->>U: Cluster live

    U->>K: kubectl apply -f inflate.yaml
    K->>CP: Schedule Deployment (1 replica)
    CP-->>K: Node provisioned, pod running
    U->>K: kubectl get pods / nodes
    K-->>U: Pod + node status

    U->>K: kubectl delete -f inflate.yaml
    K->>CP: Remove Deployment
    CP-->>K: Node scaled down

    U->>DS: ./delete_cluster.sh
    DS->>CP: Delete cluster (--wait / synchronous)
    CP-->>DS: Cluster deleted
    DS->>CP: Delete residual firewall rules (GKE only)
    CP-->>DS: Firewall rules removed
    DS->>CP: Delete VPC/Network (GKE only)
    CP-->>DS: Network deleted
    DS-->>U: Teardown complete
Loading

Last reviewed commit: d695247

Comment on lines +27 to +30
gcloud compute firewall-rules list \
--filter="network=${NETWORK_NAME} AND name~'^gke-'" \
--format="value(name)" \
--project "${PROJECT_ID}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GKE Autopilot gk3- firewall rules may not be caught by this filter

GKE Autopilot (as opposed to Standard) sometimes creates firewall rules with a gk3- prefix (the "3" denoting the Autopilot generation) rather than the classic gke- prefix. The current filter name~'^gke-' would miss those rules, leaving them attached to the network and potentially causing the subsequent gcloud compute networks delete to fail.

Consider broadening the prefix filter to match both known prefixes:

Suggested change
gcloud compute firewall-rules list \
--filter="network=${NETWORK_NAME} AND name~'^gke-'" \
--format="value(name)" \
--project "${PROJECT_ID}"
gcloud compute firewall-rules list \
--filter="network=${NETWORK_NAME} AND (name~'^gke-' OR name~'^gk3-')" \
--format="value(name)" \
--project "${PROJECT_ID}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant