How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

4 MIN READ
MIN READ

Site Reliability Engineering (SRE) and Operations (Ops) teams heavily rely on notifications. We use them to know what’s going on with application workloads and how applications are performing. Notifications are critical to ensuring SREs and Ops teams can resolve errors and reduce downtime. They’re also crucial when monitoring environments — not only when running in production but also during the dev-test or staging phase.


Having monitoring tools available is essential. They’re responsible for collecting and providing the necessary data that enables investigation and mitigation of issues. Teams don’t have to watch complex monitor dashboards anymore. We all remember the 40-inch flat screens hanging up in the IT helpdesk office. These dashboards mainly showed reactive information. When an incident occurred, the screen would switch an object from green (healthy) to amber (warning) or red (alert). Since then, numerous ways have evolved to notify people. This article discusses three such improvements called email, Slack, and PagerDuty.

The Importance of Error Notifications

The pandemic has forced many people to work remotely. Since we’re all working at home, having a big screen in the IT Helpdesk office won’t help us anymore. But even when SRE and Ops engineers were still working in the office, the big flat screen monitor had already lost its charm. 

Having the color of an object on the dashboard doesn’t consistently identify the root cause or the full impact of an outage. Some notices, like a partial or complete outage, must be resolved as soon as possible. They need our immediate attention. Other alerts, such as when an application occasionally throws a non-critical exception, can wait until we have sufficient time available. 

When it comes to IT incident management, there are two important definitions to know:

  • Mean Time to Detection (MTTD) identifies the amount of time it takes to detect and identify an incident. The shorter the MTTD, the faster DevOps teams can start taking action to investigate and fix the incident.
  • Mean Time to Resolution (MTTR) identifies how long it took to mitigate an alert or incident. Again, the shorter amount of time this takes, the better.

Unfortunately, those two metrics aren’t always obvious, as each incident might require a different response time. It depends on the root cause or the criticality of an incident. The time it takes to resolve the issue could be lengthy for situations with low criticality, or it could be as short as possible for high criticality.

A recommended way to triage alerts and incidents is by having a proper notification channel. Properly coordinating your notification channels can allow you to identify the severity and frequency of a situation and route it to the correct person in a way that they’re comfortable with and, more importantly, will acknowledge. 

Error Notification Channels

Let’s compare some common uses of notification channels:

  • High-risk (critical and real-time) notifications using PagerDuty
  • Medium-risk alerts on Slack
  • Low-risk notifications using email 

PagerDuty is an incident event management tool. It analyzes signals from our IT environment, whether running on-premises, in a public cloud, or a hybrid cloud. PagerDuty recognizes alerts coming from monitoring tools. It identifies similar incidents. Then, it helps on-call teams run automated playbooks and keep them up-to-date with relevant information. PagerDuty also recognizes possibly related incidents by relying on machine learning intelligence. The use of machine learning enables detailed analysis using public and in-house created information sources. All of these capabilities make PagerDuty a perfect solution for handling high-risk, business-critical outages. It also optimizes both MTTD and MTTR.

Slack is a collaboration and communication platform that uses channels. In IT incident management, we can use Slack to contact the right people through using a specific channel. We can integrate it with a DevOps pipeline that deploys application workloads. Also, we can use it for channel notification updates, such as updates about failed pipeline runs. The team working on pipeline deployment will immediately see the notifications and act. In this case, it could mean reverting to the last pipeline run. In the event of outages and larger incidents, Slack is helpful as a forum for teams to discuss ideas. Organizing application landscape components in dedicated channels can shorten both MTTD and MTTR, even if the issues are not business-critical. Plus, they’re accessible by developers and infrastructure teams, and any required stakeholder.

While email used to be the typical notification medium, it’s lost momentum as a business-critical notification method. One reason for this is that it leads to information overload, known as incident information fatigue. Incident information fatigue is often the product of receiving hundreds or thousands of emails sent to multiple distribution groups. As a result, nobody acts or knows what to do.

Though it doesn’t always reduce MTTD and MTTR, email is still a viable solution for errors that teams will need to solve eventually. Email helps with archiving or compliance purposes. For example, we might use it to summarize the incident email traffic from the last few days or weeks and use it to help outline sprint planning. Since most systems can communicate with email, it can also be a useful last resort to inform teams using legacy systems.

Conclusion

Mezmo, formerly known as LogDNA, understands the need for different notifications for different types of issues and offers you plenty of flexibility. If you’re not a Mezmo customer yet, you can sign up for a free trial. If you’re already using Mezmo as your trusted centralized log management solution, you can review the documentation on configuring the proper notifications for your organization at any time.


Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines
    6 Steps to Implementing a Telemetry Pipeline