-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Here are a few enhancements for bundle quarantine:
- Dedicated REST endpoint for quarantine
POST @ /Bundles/actions/quarantine/UUID- This will guarantee certain fields are updated as required:
reason, etc. - Disable
PATCH @ /Bundles/UUIDforstatus='quarantine'
- This will guarantee certain fields are updated as required:
- Dedicated REST endpoint for de-quarantine
DELETE @ /Bundles/actions/quarantine/UUID- This will make manual intervention easier, as the status fields will be auto-corrected.
- This would make Better Quarantine: Create Reviver Component #318 streamlined.
- Alternatively,
POST @ /Bundles/actions/dequarantine/UUID
- More: TBD
From @blinkdog:
I think #201 is still relevant because quarantine isn't quite fixed yet.
I think the three major steps to 'fixing' quarantine are probably:
Adding counting; both a quarantine_total_count (count of all quarantines since the Bundle record was created) and a quarantine_count or quarantine_streak meaning the most recent count of quarantines performed by a single component type (i.e.: GlobusReplicator has sent this bundle to quarantine 5 times.)
Adding a "Try everything again" button; all the bundles come out of quarantine, go back to the pipeline for another shot. At first, this is a manual tool for the operators (maybe ./ltacmd bundle repair --uuid $BUNDLE_UUID and ./ltacmd bundle repair --all) Later, a tool for automation to regularly invoke; keep-retrying
That automation checking the streak count and changing bundles over a certain limit to some other status (maybe operator?) so that it doesn't go back in the retry bin, and can be used to raise something on a dashboard / send an e-mail / sent a Slack alert / etc.
Now underlying all this are those new REST actions; they can 'keep score' and move things in and out of quarantine in a predictable way.
From @ric-evans:
I did something similar in ewms. I made a field that holds history for status changes
https://github.com/Observation-Management-Service/ewms-workflow-management-service/blob/667a4d713d777915ff616e3fe333af328968537f/wms/taskforce_launch_control.py#L42-L49
I used "phase" instead of "status", but this made retries and such query-able with mongo using filters and count
Note: This could also apply to transfer-request quarantine.