Incident, Mitigate, Learn
Published on December 28, 2023, filed under Development and Management (RSS feed for all categories).
During my time at LivePerson, specifically when managing the Data Protection and Privacy team (I later took on Developer Experience), I served as one of a couple dozen call leaders—managers and directors in our organization that did a few 12-hour shifts per month to take charge of all production incidents.
That time, short as it was, was one of the most insightful ones for me when it comes to incident management.
Now, the details of incident management are not the point here—as so often, I like to keep it high-level. For some details, however, I recommend Google’s Site Reliability Engineering; when it comes to the point, I wish to emphasize mitigation and learning:
1. Mitigate
The first rule of incident management is to—mitigate.
Nothing terribly new here: Nothing else matters more than resolving an incident, as fast as possible.
Although people involved in an incident occasionally veer off (we just did so on my team lately, when we lost time looking into a possible hotfix as opposed to a rollback), mitigation is the number one priority.
2. Learn
The second-most important rule is to learn—to prevent similar incidents from happening again.
Now, this one is also a well-known and well-established priority, as our industry practices around RCAs, PMAs, and COEs attest.
However, learning may be loathed and skimped on, that is, this is where we seem to drop the ball the easiest. While there is something like “too much RCA/PMA/COE”—which appears to lead to loathing and skimping—, that’s less frequently an issue than too little RCA/PMA/COE, when we don’t do enough to learn and prevent issues from reoccurring.
Telltales of this? No follow-up actions, or regularly identifying the same follow-up actions.
One solution? Aim for as little PMA as possible, but as much PMA as necessary, then approach coming from “slightly more PMA” to err on the side of learning too much rather than too little. Communicate both aim and approach.
❧ Is there more to incidents? Yes, absolutely. Are those mitigation and learning challenges all addressed now? No, certainly not. But in the end, when in an incident—mitigate, and learn; rinse, and repeat. Don’t skimp on this.
With this light note, I’m wrapping up 2023! Have a great start into the new year, everyone.
About Me
I’m Jens (long: Jens Oliver Meiert), and I’m a frontend engineering leader and tech author/publisher. I’ve worked as a technical lead for companies like Google and as an engineering manager for companies like Miro, I’m a contributor to several web standards, and I write and review books for O’Reilly and Frontend Dogma.
I love trying things, not only in web development (and engineering management), but also in other areas like philosophy. Here on meiert.com I share some of my experiences and views. (Please be critical, interpret charitably, and give feedback.)
Read More
Maybe of interest to you, too:
- Next: 2023
- Previous: “HTML First” Is Not HTML First
- More under Development or Management
- More from 2023
- Most popular posts
Looking for a way to comment? Comments have been disabled, unfortunately.
Get a good look at web development? Try WebGlossary.info—and The Web Development Glossary 3K. With explanations and definitions for thousands of terms of web development, web design, and related fields, building on Wikipedia as well as MDN Web Docs. Available at Apple Books, Kobo, Google Play Books, and Leanpub.