Skip to content

Conversation

@jpayne3506
Copy link
Contributor

Reason for Change:

Resolves lingering issues from #4175.
Adds validate template in an effort to unify CI/CD.

Issues:

  • Device (eth0) Change/drop | Requires upstream changes to fix.
    • Removed systemd-networkd restart test from cilium nightly until upstream fix is merged.
    • This resulted in splitting out the test from TestStateValidation. Added make recipes to support and retain original intent
  • Error logged during k8sE2E test when running service conformance test. Non impacting, but caught in cilium agent logs which fails the pipeline.
    • Testing confirmed that failure was a false negative from too many tests occurring at the same time. Added a lever to control the amount of parallelized processes and restricted the nightly to be serial. This also has the side benefit of giving us complete logs from every test.

Issue Fixed:

Requirements:

Notes:

@jpayne3506 jpayne3506 self-assigned this Jan 8, 2026
@jpayne3506 jpayne3506 added the ci Infra or tooling. label Jan 8, 2026
Copilot AI review requested due to automatic review settings January 8, 2026 22:26
@jpayne3506 jpayne3506 requested a review from a team as a code owner January 8, 2026 22:26
@jpayne3506
Copy link
Contributor Author

Will address .pipelines/cni templates in a follow up PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses lingering issues from #4175 by refactoring validation tests and adding better control over test parallelization. The main focus is on separating the systemd-networkd restart validation from general state validation and introducing a unified validation template for CI/CD pipelines.

Key changes:

  • Split systemd-networkd restart test into a separate function (ValidateNetworkdRestart) and test case (TestValidateNetworkdRestart)
  • Added a new unified validate-tests.yaml template to consolidate duplicate validation logic across multiple pipeline files
  • Introduced a processes parameter to control parallelization in k8s E2E tests, setting it to 1 (serial) for Cilium nightly to prevent race conditions and false negatives

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/validate/validate.go Extracted networkd restart validation into a separate public method ValidateNetworkdRestart
test/integration/load/load_test.go Added ValidateNetworkdRestart config field and TestValidateNetworkdRestart test function; updated calls to use ValidateStateFile instead of Validate
Makefile Added test-validate meta-target and test-validate-networkd-restart target to support separate test execution
.pipelines/templates/validate-tests.yaml New unified template for validation tests with conditional execution based on validation type
.pipelines/templates/cilium-tests.yaml Refactored to use the new validate-tests.yaml template
.pipelines/templates/cilium-nightly-checks.yaml Updated to use test-validate instead of test-validate-state
.pipelines/singletenancy//e2e.yaml Multiple pipeline files refactored to use the new validate-tests.yaml template and add VALIDATE_NETWORKDRESTART flag
.pipelines/cni/k8s-e2e/k8s-e2e-job-template.yaml Added configurable processes parameter with default of 8
.pipelines/cni/cilium/nightly-release-test.yml Set processes to 1 for serial test execution in nightly builds
Comments suppressed due to low confidence (1)

.pipelines/singletenancy/cilium-overlay/cilium-overlay-e2e.steps.yaml:69

  • The export VALIDATE_NETWORKDRESTART=true on line 62 is set but not explicitly passed to the make test-load command on lines 65-69. While the export may be inherited by sudo -E, it would be more consistent and explicit to pass VALIDATE_NETWORKDRESTART=true directly in the make command, as done in the cilium-overlay-e2e-step-template.yaml file (line 71) and other similar files.
          export VALIDATE_NETWORKDRESTART=true
      fi
      kubectl get po -owide -A
      sudo -E env "PATH=$PATH" make test-load \
        SCALE_UP=32 OS_TYPE=linux VALIDATE_STATEFILE=true \
        INSTALL_CNS=true INSTALL_OVERLAY=true CLEANUP=true \
        AZURE_IPAM_VERSION=$(AZURE_IPAM_VERSION) CNS_VERSION=$(CNS_VERSION) \
        IPAM_IMAGE_NAME_OVERRIDE=$(IPAM_IMAGE_NAME_OVERRIDE) CNS_IMAGE_NAME_OVERRIDE=$(CNS_IMAGE_NAME_OVERRIDE)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

done
sudo -E env "PATH=$PATH" make test-load \
SCALE_UP=32 OS_TYPE=linux CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \
SCALE_UP=32 OS_TYPE=windows CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \
Copy link

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OS_TYPE value appears to be incorrect. This line is in the azure-cni-overlay-stateless e2e steps for Linux, but the OS_TYPE has been changed from "linux" to "windows". This seems like an unintended change that would cause the test to run with the wrong OS type.

Suggested change
SCALE_UP=32 OS_TYPE=windows CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \
SCALE_UP=32 OS_TYPE=linux CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stateless is restricted to windows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only windows ? Was this incorrect at the first place ? cc: @behzad-mir

Co-authored-by: Copilot <[email protected]>
Signed-off-by: John Payne <[email protected]>
@camrynl
Copy link
Contributor

camrynl commented Jan 8, 2026

is there an issue you can link for systemd-networkd restart fix?

camrynl
camrynl previously approved these changes Jan 8, 2026
parameters:
validate: "nodeRestart"
cni: "cilium"
os: ${{ parameters.os }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is os a parameter passed in here? if this is an issue there might be other places without the os like .pipelines/singletenancy/cilium-overlay/cilium-overlay-e2e-step-template.yaml etc. unless I'm missing where it's defined (is it like a pipeline var?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is passed in from any parent template the parameters.os will be populated. Varying on the scenario it is passed in from pipeline.yaml or the job template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I had a parameter for a stage template, and then we had a template for a job, and a template for a task-- in the task I could use the stage template's parameters? wondering if it would be too much hassle to explicitly declare the parameters and pass them down into the template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure this covers all variations

so if I had a parameter for a stage template, and then we had a template for a job, and a template for a task-- in the task I could use the stage template's parameters?

  • you cannot explicitly reference a given stage, job, task parameter.
  • a template call will take any parameter. Defined or not in the respective template that is being called
  • If a given/expected parameter is not passed into the template then it will use the default parameters we define at the top of our template files. We leverage the defaults as empty strings as helpers to people using them to know that this template requires A,B,C,X,Y,Z parameters to function.
    parameters:
    name: ""
    clusterName: ""
    testHubble: false
    testLRP: false
    scaleup: ""
    nightly: false
  • if you have 3 successive template calls A calls B, B calls C, C calls D.. that all pass the same parameter.os, then the 1st template call value will bubble all the way to the 3rd.
    • pass-ing here means that when using - template: you explictily define what parameter.os is.
    • Note if the 2nd template call does not pass parameter.os then the 3rd template call will use the default of template file being called (if there even is one).

wondering if it would be too much hassle to explicitly declare the parameters and pass them down into the template?

I don't think it would be too much a hassle, but some parameters just don't need to be defined as the file name is descriptive enough to know that a given template should have a default value of cilium and linux for example.

If we were to merge job/step templates into single files then we 100% should be descriptive as to what parameters we want to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

John to update the PR so we don't have a scenario where A calls B and B uses a parameter that A never passed in and B never set as a default/defined

dns: true
portforward: true
service: true
processes: 1 # Serialize testing to get granular data, prevent race conditions, false-negatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "prevent race conditions" ? Is that specific to the test cases running in a specific order ? It should be good to have non serialize right ? (to catch races)

Copy link
Contributor Author

@jpayne3506 jpayne3506 Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are cleaning up namespace for the test suite before cilium can assign a CEP. The race is due to multiple parallel tests flooding the agent. The tests themselves have 0 to do with actual race conditions within cilium or the tests. It is just reporting as an error within cilium logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the processes from 1 back to 8 ( the default ) in favor of just not deleting the namespaces during testing.

dependsOn: ""
sub: ""
cni: cni
processes: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much. Idea is to have each processes take care of potentially 2 tests as most test blocks only have at most 15-20 tests. I don't think we would benefit of having (test count/processes) < 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be added to other pipelines too if not already present ? private/PR

@jpayne3506 jpayne3506 force-pushed the jpayne3506/nightly-temp branch from 5f6a06c to cb34c85 Compare January 12, 2026 19:20
@jpayne3506 jpayne3506 force-pushed the jpayne3506/nightly-temp branch from db80699 to 3ee06cc Compare January 12, 2026 20:20
@jpayne3506 jpayne3506 force-pushed the jpayne3506/nightly-temp branch from b857270 to caf2fe6 Compare January 12, 2026 22:59
@jpayne3506 jpayne3506 force-pushed the jpayne3506/nightly-temp branch from af04f49 to 5d372db Compare January 20, 2026 21:10
@jpayne3506 jpayne3506 force-pushed the jpayne3506/nightly-temp branch from 5d372db to 710645f Compare January 20, 2026 22:25
@jpayne3506 jpayne3506 force-pushed the jpayne3506/nightly-temp branch from cbb7bc6 to cfa58f0 Compare January 21, 2026 02:58
kubectl cluster-info
kubectl get po -owide -A
if [ "$CILIUM_VERSION_TAG" = "cilium-nightly-pipeline" ]; then
echo "Running nightly"
Copy link
Contributor

@QxBytes QxBytes Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we branch on the nightly variable, but we also have this branch here in the else (which seems to be for non-nightly pipeline runs). Are we concerned that this template could be called without nightly being properly specified? Is it complex to use the parameter passed in inside these scripts as a replacement for checking CILIUM_VERSION_TAG (also done below when we "Start Azilium E2E Tests on Overlay Cluster")

dependsOn: ""
sub: ""
cni: cni
processes: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be added to other pipelines too if not already present ? private/PR

done
sudo -E env "PATH=$PATH" make test-load \
SCALE_UP=32 OS_TYPE=linux CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \
SCALE_UP=32 OS_TYPE=windows CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only windows ? Was this incorrect at the first place ? cc: @behzad-mir

CNI_TYPE=stateless RESTART_CASE=true go test -timeout 30m -tags load -run ^TestValidateState$
displayName: "Validate Node Restart"
retryCountOnTaskFailure: 3
- template: ../../templates/validate-tests.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it does the load test and then run the noderestart etc. It should not impact anything but want to make sure we are aware of this change.

echo "Check for additional errors."
# Check for any other errors aside from node restart
if grep -v "Non-zero (1)" output.log | grep -v "previous terminated container" | grep "check-log-errors" | grep "Error"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous terminated container what is the error line for this ?

@vipul-21
Copy link
Contributor

lgtm waiting for a green pipeline run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Infra or tooling.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants