-
Notifications
You must be signed in to change notification settings - Fork 263
ci: General Cilium Nightly fixes ( cont. ) #4191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Will address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses lingering issues from #4175 by refactoring validation tests and adding better control over test parallelization. The main focus is on separating the systemd-networkd restart validation from general state validation and introducing a unified validation template for CI/CD pipelines.
Key changes:
- Split systemd-networkd restart test into a separate function (
ValidateNetworkdRestart) and test case (TestValidateNetworkdRestart) - Added a new unified
validate-tests.yamltemplate to consolidate duplicate validation logic across multiple pipeline files - Introduced a
processesparameter to control parallelization in k8s E2E tests, setting it to 1 (serial) for Cilium nightly to prevent race conditions and false negatives
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/validate/validate.go | Extracted networkd restart validation into a separate public method ValidateNetworkdRestart |
| test/integration/load/load_test.go | Added ValidateNetworkdRestart config field and TestValidateNetworkdRestart test function; updated calls to use ValidateStateFile instead of Validate |
| Makefile | Added test-validate meta-target and test-validate-networkd-restart target to support separate test execution |
| .pipelines/templates/validate-tests.yaml | New unified template for validation tests with conditional execution based on validation type |
| .pipelines/templates/cilium-tests.yaml | Refactored to use the new validate-tests.yaml template |
| .pipelines/templates/cilium-nightly-checks.yaml | Updated to use test-validate instead of test-validate-state |
| .pipelines/singletenancy//e2e.yaml | Multiple pipeline files refactored to use the new validate-tests.yaml template and add VALIDATE_NETWORKDRESTART flag |
| .pipelines/cni/k8s-e2e/k8s-e2e-job-template.yaml | Added configurable processes parameter with default of 8 |
| .pipelines/cni/cilium/nightly-release-test.yml | Set processes to 1 for serial test execution in nightly builds |
Comments suppressed due to low confidence (1)
.pipelines/singletenancy/cilium-overlay/cilium-overlay-e2e.steps.yaml:69
- The export VALIDATE_NETWORKDRESTART=true on line 62 is set but not explicitly passed to the make test-load command on lines 65-69. While the export may be inherited by sudo -E, it would be more consistent and explicit to pass VALIDATE_NETWORKDRESTART=true directly in the make command, as done in the cilium-overlay-e2e-step-template.yaml file (line 71) and other similar files.
export VALIDATE_NETWORKDRESTART=true
fi
kubectl get po -owide -A
sudo -E env "PATH=$PATH" make test-load \
SCALE_UP=32 OS_TYPE=linux VALIDATE_STATEFILE=true \
INSTALL_CNS=true INSTALL_OVERLAY=true CLEANUP=true \
AZURE_IPAM_VERSION=$(AZURE_IPAM_VERSION) CNS_VERSION=$(CNS_VERSION) \
IPAM_IMAGE_NAME_OVERRIDE=$(IPAM_IMAGE_NAME_OVERRIDE) CNS_IMAGE_NAME_OVERRIDE=$(CNS_IMAGE_NAME_OVERRIDE)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| done | ||
| sudo -E env "PATH=$PATH" make test-load \ | ||
| SCALE_UP=32 OS_TYPE=linux CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \ | ||
| SCALE_UP=32 OS_TYPE=windows CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \ |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OS_TYPE value appears to be incorrect. This line is in the azure-cni-overlay-stateless e2e steps for Linux, but the OS_TYPE has been changed from "linux" to "windows". This seems like an unintended change that would cause the test to run with the wrong OS type.
| SCALE_UP=32 OS_TYPE=windows CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \ | |
| SCALE_UP=32 OS_TYPE=linux CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stateless is restricted to windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only windows ? Was this incorrect at the first place ? cc: @behzad-mir
Co-authored-by: Copilot <[email protected]> Signed-off-by: John Payne <[email protected]>
|
is there an issue you can link for systemd-networkd restart fix? |
| parameters: | ||
| validate: "nodeRestart" | ||
| cni: "cilium" | ||
| os: ${{ parameters.os }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is os a parameter passed in here? if this is an issue there might be other places without the os like .pipelines/singletenancy/cilium-overlay/cilium-overlay-e2e-step-template.yaml etc. unless I'm missing where it's defined (is it like a pipeline var?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is passed in from any parent template the parameters.os will be populated. Varying on the scenario it is passed in from pipeline.yaml or the job template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if I had a parameter for a stage template, and then we had a template for a job, and a template for a task-- in the task I could use the stage template's parameters? wondering if it would be too much hassle to explicitly declare the parameters and pass them down into the template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure this covers all variations
so if I had a parameter for a stage template, and then we had a template for a job, and a template for a task-- in the task I could use the stage template's parameters?
- you cannot explicitly reference a given stage, job, task parameter.
- a template call will take any parameter. Defined or not in the respective template that is being called
- If a given/expected parameter is not passed into the template then it will use the default parameters we define at the top of our template files. We leverage the defaults as empty strings as helpers to people using them to know that this template requires A,B,C,X,Y,Z parameters to function.
Lines 1 to 7 in ab1c159
parameters: name: "" clusterName: "" testHubble: false testLRP: false scaleup: "" nightly: false - if you have 3 successive template calls A calls B, B calls C, C calls D.. that all pass the same
parameter.os, then the 1st template call value will bubble all the way to the 3rd.- pass-ing here means that when using
- template:you explictily define whatparameter.osis. - Note if the 2nd template call does not pass
parameter.osthen the 3rd template call will use the default of template file being called (if there even is one).
- pass-ing here means that when using
wondering if it would be too much hassle to explicitly declare the parameters and pass them down into the template?
I don't think it would be too much a hassle, but some parameters just don't need to be defined as the file name is descriptive enough to know that a given template should have a default value of cilium and linux for example.
If we were to merge job/step templates into single files then we 100% should be descriptive as to what parameters we want to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
John to update the PR so we don't have a scenario where A calls B and B uses a parameter that A never passed in and B never set as a default/defined
.pipelines/singletenancy/cilium-dualstack-overlay/cilium-dualstackoverlay-e2e.steps.yaml
Show resolved
Hide resolved
| dns: true | ||
| portforward: true | ||
| service: true | ||
| processes: 1 # Serialize testing to get granular data, prevent race conditions, false-negatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "prevent race conditions" ? Is that specific to the test cases running in a specific order ? It should be good to have non serialize right ? (to catch races)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are cleaning up namespace for the test suite before cilium can assign a CEP. The race is due to multiple parallel tests flooding the agent. The tests themselves have 0 to do with actual race conditions within cilium or the tests. It is just reporting as an error within cilium logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the processes from 1 back to 8 ( the default ) in favor of just not deleting the namespaces during testing.
| dependsOn: "" | ||
| sub: "" | ||
| cni: cni | ||
| processes: 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
magic number ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much. Idea is to have each processes take care of potentially 2 tests as most test blocks only have at most 15-20 tests. I don't think we would benefit of having (test count/processes) < 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be added to other pipelines too if not already present ? private/PR
5f6a06c to
cb34c85
Compare
db80699 to
3ee06cc
Compare
b857270 to
caf2fe6
Compare
af04f49 to
5d372db
Compare
5d372db to
710645f
Compare
cbb7bc6 to
cfa58f0
Compare
| kubectl cluster-info | ||
| kubectl get po -owide -A | ||
| if [ "$CILIUM_VERSION_TAG" = "cilium-nightly-pipeline" ]; then | ||
| echo "Running nightly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we branch on the nightly variable, but we also have this branch here in the else (which seems to be for non-nightly pipeline runs). Are we concerned that this template could be called without nightly being properly specified? Is it complex to use the parameter passed in inside these scripts as a replacement for checking CILIUM_VERSION_TAG (also done below when we "Start Azilium E2E Tests on Overlay Cluster")
| dependsOn: "" | ||
| sub: "" | ||
| cni: cni | ||
| processes: 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be added to other pipelines too if not already present ? private/PR
| done | ||
| sudo -E env "PATH=$PATH" make test-load \ | ||
| SCALE_UP=32 OS_TYPE=linux CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \ | ||
| SCALE_UP=32 OS_TYPE=windows CNI_TYPE=stateless VALIDATE_STATEFILE=true VALIDATE_V4OVERLAY=true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only windows ? Was this incorrect at the first place ? cc: @behzad-mir
| CNI_TYPE=stateless RESTART_CASE=true go test -timeout 30m -tags load -run ^TestValidateState$ | ||
| displayName: "Validate Node Restart" | ||
| retryCountOnTaskFailure: 3 | ||
| - template: ../../templates/validate-tests.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it does the load test and then run the noderestart etc. It should not impact anything but want to make sure we are aware of this change.
| echo "Check for additional errors." | ||
| # Check for any other errors aside from node restart | ||
| if grep -v "Non-zero (1)" output.log | grep -v "previous terminated container" | grep "check-log-errors" | grep "Error"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous terminated container what is the error line for this ?
|
lgtm waiting for a green pipeline run |
Reason for Change:
Resolves lingering issues from #4175.
Adds validate template in an effort to unify CI/CD.
Issues:
Device (eth0) Change/drop | Requires upstream changes to fix.TestStateValidation. Added make recipes to support and retain original intentError logged during k8sE2E test when running service conformance test. Non impacting, but caught in cilium agent logs which fails the pipeline.Issue Fixed:
Requirements:
Notes: