THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138
THREESCALE-12161 - Fix DB Migrations not executed on upgrade to 2.16#1138urbanikb wants to merge 5 commits into3scale:masterfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
/retest |
|
/ok-to-test |
|
Thanks the code looks much better. I have perform the following tests:
During the upgrade, the system-app-pre hooks run AFTER system-app, which is the behavior we've seen in THREESCALE-12013THREESCALE-12013 I'll run the test again tomorrow. In the meantime, could you build the bundle and check the upgrade process? |
This commit fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version. The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version. The basic idea is that the job annotation shoudld always match the "version" of system-app deployment. The 2.15 used generation as the version. 2.16 uses revision. The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise). This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job. Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration. This PR also includes all edge cases I could think of... even though we only hit the one described above.
The function needed to deal with cases when job didn't exist or the annotation didn't exist. It could return an error but that would just make it more complicated in the calling code. The new function needAppHookJobRerun returns false if the job didn't exist or revision matches deployment revision. It returns true if the job revision is unknown or deployment revision has changed.
b85cb4e to
d5facb4
Compare
This should not be possible on the upgrade path - the deployment should update only after job with revision+1 has been scheduled. Having jobs run after is possible on the "manual trigger" path - on revision change - although it shouldn't take a minute to trigger, so even for that path it seems to be a little bit off. If you still have this instance, could you check the logs of the "pre" hook to check whether the migration was executed in that run or it was already a no-op?
I built the bundle but couldn't reproduce the state where the migration isn't run from the "image changed" path - the bundle I tested with is this one (contains CSV from 2.15 and "alpha" with quay.io images): |
No tests have been changed - this is mainly to change the flow of the Reconcile function to avoid calling ReconcileJob when the revision has not changed.
00a5d4e to
9f53af0
Compare
|
Tested upgrade of the current PR from following index: Tried scaling down+up via the CR and it came up correctly, there were some logs in the 3.15 operator regarding pod being deleted ("won't delete job system-app-post because it's still running"), which means the job was being re-created twice. There were 2 caching-related Reconcile errors in the new version that don't seem to be anything to worry about - the corresponding logic has been re-queued and jobs have been re-created within a minute. |
| return reconcile.Result{}, err | ||
| } | ||
| if trackedRevision == -1 { | ||
| r.Logger().Info("Deleted PreHook job that was missing revision annotation") |
| } | ||
|
|
||
| // Already at correct revision | ||
| if trackedRevision == targetRevision { |
There was a problem hiding this comment.
Can you think of a scenario where we have the same job revision but different image? Perhaps we should check the job images here?
There was a problem hiding this comment.
I've added a test/code for this, I think it's good catch - even if there isn't a "normal" way to get to that state, it fits the function - after reconcile I'd expect the image to match.
| targetRevision, err := component.ParseSystemAppHookRevision(desired) | ||
| if err != nil { | ||
| return reconcile.Result{}, err | ||
| } |
There was a problem hiding this comment.
Small nit: We create the desired job before calling this function with the desired revision, so perhaps we should pass that revision into this function call instead of calling ParseSystemAppHookRevision?
There was a problem hiding this comment.
Sounds good to me - I've moved the targetRevision parameter from the component "factory" function here - could have it in 2 places but then it would be little less clear. Does the new version look OK?
|
Index with 2.15 and the bundle for the PR above: |
|
/retest |
|
@urbanikb: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com
What
Fix:
This PR fixes issue with system deployment reconciler happening on upgrade from 2.15 to 2.16 version.
The "state" of the upgrade process is persisted in an annotation on the job that records the deployment version.
The basic idea is that the job annotation shoudld always match the "version" of system-app deployment.
The 2.15 used generation as the version. 2.16 uses revision.
The reconciliation logic would only activate upgrade logic if the revision on the existing job matches exactly the deployment revision (it assumes the upgrade is in-flight otherwise).
This condition has been causing issue on upgrade from 2.16 because as part of the upgrade the customers need to scale the deployment down, which would bump the generation - which would cause the upgrade logic to incorrectly assume that migration is in flight and fail to delete the old job.
Reconciliation then simply updates metadata including the annotation without recreating the job and the process silently fails to execute migration.
This PR also includes some more edge cases, renamed annotation and some very minor cleanup of test suite.
Steps to reproduce
The deployment generation and job annotation should match. However the deployment generation should be ahead of the revision - this is needed to reproduce the bug. In my case I have generation 6 and revision 2:
Upgrade operator to 2.16... note the job image hasn't been updated but the deployment has.
The 2.16 operator also has updated the annotation to the current revision of the deployment - that is part of the bug.
Verification
The regression test case is implemented in test with test case name
"EDGE CASE: Both image AND revision changed - old job deleted and new one created".Due to this PR also renaming the annotation, the next upgrade will fall into test case
"UPGRADE: Job exists without annotation during upgrade - deleted and recreated".