Incident, Mitigate, Learn

Published on December 28, 2023, filed under and (RSS feed).

During my time at LivePerson, specifically when managing the Data Protection and Privacy team (I later took on Developer Experience), I served as one of a couple dozen call leaders—managers and directors in our organization that did a few 12-hour shifts per month to take charge of all production incidents.

That time, short as it was, was one of the most insightful ones for me when it comes to incident management.

Now, the details of incident management are not the point here—as so often, I like to keep it high-level. For some details, however, I recommend Google’s Site Reliability Engineering; when it comes to the point, I wish to emphasize mitigation and learning:

1. Mitigate

The first rule of incident management is to—mitigate.

Nothing terribly new here: Nothing else matters more than resolving an incident, as fast as possible.

Although people involved in an incident occasionally veer off (we just did so on my team lately, when we lost time looking into a possible hotfix as opposed to a rollback), mitigation is the number one priority.

2. Learn

The second-most important rule is to learn—to prevent similar incidents from happening again.

Now, this one is also a well-known and well-established priority, as our industry practices around RCAs, PMAs, and COEs attest.

However, learning may be loathed and skimped on, that is, this is where we seem to drop the ball the easiest. While there is something like “too much RCA/PMA/COE”—which appears to lead to loathing and skimping—, that’s less frequently an issue than too little RCA/PMA/COE, when we don’t do enough to learn and prevent issues from reoccurring.

Telltales of this? No follow-up actions, or regularly identifying the same follow-up actions.

One solution? Aim for as little PMA as possible, but as much PMA as necessary, then approach coming from “slightly more PMA” to err on the side of learning too much rather than too little. Communicate both aim and approach.

❧ Is there more to incidents? Yes, absolutely. Are those mitigation and learning challenges all addressed now? No, certainly not. But in the end, when in an incident—mitigate, and learn; rinse, and repeat. Don’t skimp on this.

With this light note, I’m wrapping up 2023! Have a great start into the new year, everyone.

Toot about this?

About Me

Jens Oliver Meiert, on September 30, 2021.

I’m Jens, and I’m an engineering lead and author. I’ve worked as a technical lead for companies like Google, I’m close to W3C and WHATWG, and I write and review books for O’Reilly and Frontend Dogma. I love trying things, not only in web development, but also in other areas like philosophy. Here on meiert.com I share some of my views and experiences.

If you have a question or suggestion about what I write, please leave a comment (if available) or a message. Thank you!