-
Notifications
You must be signed in to change notification settings - Fork 4
Post Mortems
Mark Bussey edited this page Jun 22, 2018
·
3 revisions
We practice no-fault post mortems. The goal is not to assign blame, but to understand what happened and what practices we might put in place to make our lives easier in the future.
- Monday Morning, Bess was looking for tickets to take proactive action on and found https://github.com/curationexperts/laevigata/issues/1028
- Bess found unix accounts for Alicia on qa, stage, and production and removed them
- There was also a application SuperUser account for Alicia on production (but not qa or stage)
- Bess removed the application User account, checked out the application.
- Around 3pm Collin reported that 2 students were getting error messages on submission
- Because the students were getting "this page does not exist" errors, they resubmitted multiple times
- Even though users weren't seeing errors, the ETDs were being accepted
- The primary issue is that the ETDs were not being added to the expected workflow
- Bess placed the system in Read-only mode, notified Emory, Restored the backup from that morning
- We wanted to prioritize getting production back to working order rather than diagnostics
- Tested the same change on the backup (deleting the SuperAdmin account)
- Time to respond & Time to resolve +++
- Restore from snapshot nightly backup
- Read-only mode
- Pairing for problem resolution (esp. someone to stay calm and bounce ideas off)
- We need a bullet proof self-service means to remove inactive Approvers and SuperAdmins
- Need to see if we can repeat on
- 2.x Laevigata - see https://github.com/curationexperts/laevigata/issues/1102
- 2.1.x Hyrax
- Track timings of events
- When the incident was reported - 3:03pm
- When the incident was resolved - 4:35pm
- If we can determine, when the event occurred that triggered the incident - between 8-10am
- Follow up on whether there were corresponding Honeybadger issues
- Follow up on whether there are ways we can detect "students having trouble"
- Ensure we "script" non-UI activities against production and test them in other environments