When IT Disaster Strikes, Part 1: Resolving Incidents

As a developer or operations team member, there is nothing quite like the dread you feel when you hear the familiar ringtone of your on-call page at 3 a.m. Being on call means that you may be contacted at any time to investigate and fix issues that arise for the system, but that doesn’t mean you can’t get ready beforehand. There are steps your team can take to prepare for incidents and streamline the process of resolving them, leading to fewer 3 a.m. wake-up calls and better-running software and services.

Below is an overview of a proven approach your organization can take before, during and after an IT incident to reduce headaches and increase hours of sleep.

Before: Define the Who, What, When of Incidents

The first step in any incident prevention and orchestration process is determining how your organization defines an incident. Generally, this is done by defining the varying degrees of severity, with lower-numbered severities more urgent. Operational issues can be classified at one of these severity levels. As a general rule of thumb, if you are unsure which level an incident is, treat it as more severe to ensure it is dealt with timely.

In addition to aligning your team on defining an incident to communicate the sense of urgency involved, there are several main roles that should be designated ahead of potential IT issues. Certain roles have only one person per incident, while other roles can have multiple people assigned to it. It’s all about coming together as a team, working the problem and getting a solution quickly. Generally speaking, developer, IT Ops and DevOps teams must designate the following roles:

A point person or “Incident Commander”―someone who will drive the process forward, but not be involved in the actual remediation; Someone to document a timeline of an incident as it progresses for future analysis and learning and to act as a backup for the point person; One or more subject matter experts who are deeply familiar with the specified component or service and who will take the remediation steps; and One person to manage customer support and communication during and after an incident. During: Keep Calm and Communicate

If you are alerted of a major incident, don’t panic. The first step is to join a previously agreed-upon method of communication to be used during incidents to ensure that communication can run smoothly throughout the resolution process. The incident commander (IC) and the IC deputy should announce the issue in appropriate communication channels and lead the process, and it is best to defer to the subject matter experts assigned to the incident to ensure non-essential communication is kept to a minimum.

The steps you take to prepare for an incident have a significant impact on how quickly you are able to move when disaster strikes. If incident prep has been done correctly, each member of the incident response team will have a very specific role and set of responsibilities carved out, ranging from someone to provide regular updates in the chat client to someone to modify your company’s status page to keep customers informed. By having these roles defined beforehand, you don’t have to spend valuable time during an incident figuring them out instead of fixing the problem. In a future article, I will break down each member of the incident response team’s specific role and the steps they should follow during an incident.

After: Conduct a Post-Mortem Without Finger-Pointing

For every major incident, you should follow up with a post-mortem―a blame-free, detailed description of exactly what went wrong to cause the incident, along with a list of steps to take to prevent a similar incident from occurring again in the future. The incident response process itself should also be included as part of the review.

As the IT incidents we deal with daily become increasingly tied to larger organizational success and business objectives, streamlining the resolution process is a must. According to a report from IDC, the average hourly cost of an infrastructure failure is $100,000 per hour, and the average cost of a critical application failure per hour is $500,000 to $1 million. In future articles, I will break down the varying roles on an incident response team, specific steps to follow during an incident, a tried-and-true template for conducting a successful post-mortem and more, to ensure effective incident prevention and resolution.

About the Author / Eric Sigler
When IT Disaster Strikes, Part 1: Resolving Incidents

Eric Sigler is the Head of DevOps at PagerDuty, helping protect its customers from the pains of downtime. Before his current role, Eric led infrastructure teams at Minted, Expensify, and the Missouri University of Science and Technology. Connect with him on Twitter .

When IT Disaster Strikes, Part 1: Resolving Incidents

Trending Articles

金士顿V300拆的FT64G08UCM1-27或者FT64G08UCT1-8B用SM3257NEBA主控量产

皇家騎士團1、2，SFC超級任天堂經典SRPG遊戲下載，模擬器+攻略+詳細流程資料+金手指！

【XY】精简中文 23h2 Win11 Pro 22631.4169 x64c 自建账户+内置管理员 24.9.18更新

「圖紙集」+「功能變數」_管理圖號及張號

帳務小管家Life 2024 免安裝中文版 (2024/01/05) - 中文記帳軟體

[攻略] 《魔獸世界》乾了啦！6.2.2 啤酒節新戰寵和玩具已報到

泰语每日一词：ของ“的”，“东西”（Day 252）

cocos creator 3.5.2 與 Android Studio 3.5.2 打包 aab 一直上不了 Google Play store

「青春達人」性別平等教育講座

黑龙江省民代幼教师致省政府的诉求信

活得更真实：10个行动建议重拾幸福人生

[家庭教师.HITMAN REBORN!]音乐全集+手机铃音[度盘下载][3G]

[沸班亚马制作组] 胆大党第一季 - 01-12 [BDRip AI Ultimate 2160p HEVC-10bit OPUS]

出售: DENON DP67L直驅唱盤

中／世唯生乳捲超Q彈

S3/U5變速箱CVT 7保養&照顧

[心得] MVPmods版 MVP2015模組中文化測試報告。

【报Bug】AMap com.amap.api:3dmap：请考虑将 SDK 升级到版本 10.0.600。

[DBD-Raws][占领电视台/They wanted to fly far away on the...

[转载]煞貢、直星、人專吉日\金神七煞歌