Incident, Mitigate, Learn
During my time at LivePerson, specifically when managing the Data Protection and Privacy team (I later took on Developer Experience), I served as one of a couple dozen call leaders—managers and directors in our organization that did a few 12-hour shifts per month to take charge of all production incidents.
That time, short as it was, was one of the most insightful ones for me when it comes to incident management.
Now, the details of incident management are not the point here—as so often, I like to keep it high-level. For some details, however, I recommend Google’s Site Reliability Engineering; when it comes to the point, I wish to emphasize mitigation and learning:
The first rule of incident management is to—mitigate.
Nothing terribly new here: Nothing else matters more than resolving an incident, as fast as possible.
Although people involved in an incident occasionally veer off (we just did so on my team lately, when we lost time looking into a possible hotfix as opposed to a rollback), mitigation is the number one priority.
The second-most important rule is to learn—to prevent similar incidents from happening again.
However, learning may be loathed and skimped on, that is, this is where we seem to drop the ball the easiest. While there is something like “too much RCA/PMA/COE”—which appears to lead to loathing and skimping—, that’s less frequently an issue than too little RCA/PMA/COE, when we don’t do enough to learn and prevent issues from reoccurring.
Telltales of this? No follow-up actions, or regularly identifying the same follow-up actions.
One solution? Aim for as little PMA as possible, but as much PMA as necessary, then approach coming from “slightly more PMA” to err on the side of learning too much rather than too little. Communicate both aim and approach.
❧ Is there more to incidents? Yes, absolutely. Are those mitigation and learning challenges all addressed now? No, certainly not. But in the end, when in an incident—mitigate, and learn; rinse, and repeat. Don’t skimp on this.
With this light note, I’m wrapping up 2023! Have a great start into the new year, everyone.
I’m Jens, and I’m an engineering lead and author. I’ve worked as a technical lead for companies like Google, I’m close to W3C and WHATWG, and I write and review books for O’Reilly and Frontend Dogma. I love trying things, not only in web development, but also in other areas like philosophy. Here on meiert.com I share some of my views and experiences.
If you have a question or suggestion about what I write, please leave a comment (if available) or a message. Thank you!
Maybe of interest to you, too:
- Next: 2023
- Previous: “HTML First” Is Not HTML First
- More under Web Development and Engineering Management, or from 2023
- Most popular posts
Looking for a way to comment? Comments have been disabled, unfortunately.
Get a good look at web development? Try WebGlossary.info—and The Web Development Glossary 3K (2023). With explanations and definitions for thousands of terms of web development, web design, and related fields, building on Wikipedia as well as MDN Web Docs. Available at Apple Books, Kobo, Google Play Books, and Leanpub.