How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty
Site Reliability Engineering (SRE) and Operations (Ops) teams heavily rely on notifications. We use them to know what’s going on with application workloads and how applications are performing. Notifications are critical to ensuring SREs and Ops teams can resolve errors and reduce downtime. They’re also crucial when monitoring environments — not only when running in production but also during the dev-test or staging phase.
Having monitoring tools available is essential. They’re responsible for collecting and providing the necessary data that enables investigation and mitigation of issues. Teams don’t have to watch complex monitor dashboards anymore. We all remember the 40-inch flat screens hanging up in the IT helpdesk office. These dashboards mainly showed reactive information. When an incident occurred, the screen would switch an object from green (healthy) to amber (warning) or red (alert). Since then, numerous ways have evolved to notify people. This article discusses three such improvements called email, Slack, and PagerDuty.
The Importance of Error Notifications
The pandemic has forced many people to work remotely. Since we’re all working at home, having a big screen in the IT Helpdesk office won’t help us anymore. But even when SRE and Ops engineers were still working in the office, the big flat screen monitor had already lost its charm.
Having the color of an object on the dashboard doesn’t consistently identify the root cause or the full impact of an outage. Some notices, like a partial or complete outage, must be resolved as soon as possible. They need our immediate attention. Other alerts, such as when an application occasionally throws a non-critical exception, can wait until we have sufficient time available.
When it comes to IT incident management, there are two important definitions to know:
- Mean Time to Detection (MTTD) identifies the amount of time it takes to detect and identify an incident. The shorter the MTTD, the faster DevOps teams can start taking action to investigate and fix the incident.
- Mean Time to Resolution (MTTR) identifies how long it took to mitigate an alert or incident. Again, the shorter amount of time this takes, the better.
Unfortunately, those two metrics aren’t always obvious, as each incident might require a different response time. It depends on the root cause or the criticality of an incident. The time it takes to resolve the issue could be lengthy for situations with low criticality, or it could be as short as possible for high criticality.
A recommended way to triage alerts and incidents is by having a proper notification channel. Properly coordinating your notification channels can allow you to identify the severity and frequency of a situation and route it to the correct person in a way that they’re comfortable with and, more importantly, will acknowledge.
Error Notification Channels
Let’s compare some common uses of notification channels:
- High-risk (critical and real-time) notifications using PagerDuty
- Medium-risk alerts on Slack
- Low-risk notifications using email
PagerDuty is an incident event management tool. It analyzes signals from our IT environment, whether running on-premises, in a public cloud, or a hybrid cloud. PagerDuty recognizes alerts coming from monitoring tools. It identifies similar incidents. Then, it helps on-call teams run automated playbooks and keep them up-to-date with relevant information. PagerDuty also recognizes possibly related incidents by relying on machine learning intelligence. The use of machine learning enables detailed analysis using public and in-house created information sources. All of these capabilities make PagerDuty a perfect solution for handling high-risk, business-critical outages. It also optimizes both MTTD and MTTR.
Slack is a collaboration and communication platform that uses channels. In IT incident management, we can use Slack to contact the right people through using a specific channel. We can integrate it with a DevOps pipeline that deploys application workloads. Also, we can use it for channel notification updates, such as updates about failed pipeline runs. The team working on pipeline deployment will immediately see the notifications and act. In this case, it could mean reverting to the last pipeline run. In the event of outages and larger incidents, Slack is helpful as a forum for teams to discuss ideas. Organizing application landscape components in dedicated channels can shorten both MTTD and MTTR, even if the issues are not business-critical. Plus, they’re accessible by developers and infrastructure teams, and any required stakeholder.
While email used to be the typical notification medium, it’s lost momentum as a business-critical notification method. One reason for this is that it leads to information overload, known as incident information fatigue. Incident information fatigue is often the product of receiving hundreds or thousands of emails sent to multiple distribution groups. As a result, nobody acts or knows what to do.
Though it doesn’t always reduce MTTD and MTTR, email is still a viable solution for errors that teams will need to solve eventually. Email helps with archiving or compliance purposes. For example, we might use it to summarize the incident email traffic from the last few days or weeks and use it to help outline sprint planning. Since most systems can communicate with email, it can also be a useful last resort to inform teams using legacy systems.
Mezmo, formerly known as LogDNA, understands the need for different notifications for different types of issues and offers you plenty of flexibility. If you’re not a Mezmo customer yet, you can sign up for a free trial. If you’re already using Mezmo as your trusted centralized log management solution, you can review the documentation on configuring the proper notifications for your organization at any time.