Incident, Mitigate, Learn
Published on DecĀ 28, 2023, filed under development, management (feed). (Share this on Mastodon orĀ Bluesky?)
During my time at LivePerson, specifically when managing the Data Protection and Privacy team (I later took on Developer Experience), I served as one of a couple dozen call leadersāmanagers and directors in our organization that did a few 12-hour shifts per month to take charge of all production incidents.
That time, short as it was, was one of the most insightful ones for me when it comes to incident management.
Now, the details of incident management are not the point hereāas so often, I like to keep it high-level. For some details, however, I recommend Googleās Site Reliability Engineering; when it comes to the point, I wish to emphasize mitigation and learning:
1. Mitigate
The first rule of incident management is toāmitigate.
Nothing terribly new here: Nothing else matters more than resolving an incident, as fast as possible.
Although people involved in an incident occasionally veer off (we just did so on my team lately, when we lost time looking into a possible hotfix as opposed to a rollback), mitigation is the number one priority.
2. Learn
The second-most important rule is to learnāto prevent similar incidents from happening again.
Now, this one is also a well-known and well-established priority, as our industry practices around RCAs, PMAs, and COEs attest.
However, learning may be loathed and skimped on, that is, this is where we seem to drop the ball the easiest. While there is something like ātoo much RCA/PMA/COEāāwhich appears to lead to loathing and skimpingā, thatās less frequently an issue than too little RCA/PMA/COE, when we donāt do enough to learn and prevent issues from reoccurring.
Telltales of this? No follow-up actions, or regularly identifying the same follow-up actions.
One solution? Aim for as little PMA as possible, but as much PMA as necessary, then approach coming from āslightly more PMAā to err on the side of learning too much rather than too little. Communicate both aim and approach.
ā§ Is there more to incidents? Yes, absolutely. Are those mitigation and learning challenges all addressed now? No, certainly not. But in the end, when in an incidentāmitigate, and learn; rinse, and repeat. Donāt skimp on this.
With this light note, Iām wrapping up 2023! Have a great start into the new year, everyone.
About Me
Iām Jens (long: Jens Oliver Meiert), and Iām a web developer, manager, and author. Iāve been working as a technical lead and engineering manager for companies youāve never heard of and companies you use every day, Iām an occasional contributor to web standards (like HTML, CSS, WCAG), and I write and review books for OāReilly and Frontend Dogma.
I love trying things, not only in web development and engineering management, but also in other areas like philosophy. Here on meiert.com I share some of my experiences and views. (I value you being critical, interpreting charitably, and giving feedback.)