A Day in the Life of a Mezmo SRE

4 MIN READ
11 MIN READ

What keeps an SRE at the top of his game? I had an insightful conversation with Jon Duarte, a Site Reliability Engineer (SRE) at Mezmo and he walked me through his role and the various tasks he manages on a typical day. Here’s Jon offering a brief glimpse into the challenges he faces, the thought processes behind his approach, and the innovative solutions SREs come up with.

Tell us about your background and the role at Mezmo.

Just out of college, I went into operations application support, and that's when I got my first glimpse of DevOps / SRE. This is the role I always wanted, as I like to help clarify things and learn new things. I have been an SRE since 2011 and with Mezmo for the last two years.

The SRE role at Mezmo involves end-to-end support, working with new applications, testing, and solving challenging problems. 

What does a day in your life look like? 

Here at Mezmo, very generally speaking, we have two categories: 1, things that need fixing or improving, and 2, project work. For that, we follow a certain workflow mechanism for identifying and working on tickets in a timely manner. It involves a balance of providing detailed information in tickets and managing communication across different channels, such as Slack, during incidents or projects. A key task is integrating the information from Slack or other channels into the tickets or documentation. Our goal is to automate this process to minimize toil and reduce the time spent.

How do you know, agnostic of the tools, what data will be relevant to a particular incident?

This is one challenge for us - for every small or large incident, there is a lot of data in various tools like Datadog, Sysdig, QuickWit, and others. We have to sift through a large volume of data to identify the relevant details and add them to runbooks or other documentation systems like Slack or Confluence.

Many SREs develop internal tools to streamline this process, and we are always adding commands to help reduce toil searching for information. However, when new commands are introduced, there may be a learning curve, and we need to quickly learn them to handle incidents effectively.

Of course, identifying relevant data and accessing it can be challenging. So, we want to automate and reduce the amount of time that we spend doing repetitive tasks. One cool thing that I'm testing and actually have been using quite often is summarizing incident data using AI. It can serve as a template for documenting longer incidents efficiently.

Can you explain the workflow for integrating a new feature or app into the system? Do you work directly with the engineers during this process or explore through data?

We Site Reliability Engineers work primarily through Slack and other channels integrated with GitHub to track releases and project updates. The focus is on gathering pertinent data from these channels and reaching out to developers or conducting research as needed. When something interesting comes up, we seek help from developers or research ourselves to get more information.

Rather than diving directly into raw log metrics or traces, we start at a higher level. For significant releases, like updates to Kafka, monitoring logs and performance proactively helps to understand potential impacts and prepare for incidents.

From my perspective, understanding different technologies like Kafka and their roles in telemetry pipelines is essential for SREs to maintain system reliability and keep the team informed.

Tell us about going through your process when an incident happens. What does that look like?

When an alert appears in PagerDuty, we first ensure that Kafka's observability is correctly set up by integrating with Sysdig, configuring app settings, writing and testing PromQL queries for metrics, and using Terraform to manage alerts. We collect metrics from Kubernetes, prevent false alerts, and monitor everything once integrated. This complex process is necessary to streamline alerts and reduce toil.  

So, that is the toil I was talking about. But the goal is to get everything working and streamlined.  Our SRE team uses Sysdig to aggregate metric events based on applied logic, allowing for a precise definition of what gets aggregated.

So, you want to aggregate the data about incidents, but how does that help you?

Consider a scenario where a log analysis tool, sees a sudden surge in log data from a specific source. That overloads the system. However, some log entries are null or repetitive. They can be filtered or reduced to prevent system overload. Sometimes they can even be aggregated to just one line saying this string gets repeated a number of times.

This is for one event consuming huge resources. There are limitations in dedicated hardware because we don't want performance affected. In more scalable environments, we have to be more careful about data, storage space, and cost. So, aggregating data about incidents saves resources.

Changing gears here, what would your life be without a tool like Mezmo?

Before Mezmo, if we had multiple environments, moving logs between the repositories was labor-intensive, with manual toil to configure the changes. Mezmo simplifies log management with telemetry pipelines. We can easily take multiple logs from different sources and send them to our log analysis tools where we can do things like line parse, filter, and search in an easy user interface. And we can do that for any combination. So, Mezmo streamlines log movement between environments with drag-and-drop simplicity. And without a tool like Mezmo, there would be a lot of toil and resource spending. 

Do you wait for an incident to happen, or do you try to identify things before they happen?

We are always trying to proactively identify things. As an SRE, we rely on observability tools to analyze data early on. We Site Reliability Engineers, in general, are skeptical of the data presented to us. So, it would be nice to have tools, like our pipeline tools, that provide actionable insights, such as query improvements and performance enhancements, with suggesting or implementing optimizations to help SREs to stay ahead of potential problems. This approach reduces toil and helps manage the complexities of system operations more effectively.

During our discussion, Jon offered a unique perspective on how automation and AI are transforming SRE tasks, reducing toil and helping with optimization.

As we approached the end of the chat, Jon also discussed the challenges of understanding the metrics needed for new applications, and distinguishing between known and unknown metrics. He suggested that it would be helpful to have tools that identify and suggest essential metrics based on the system's usage, to optimize data collection and reduce costs.

It was a rewarding conversation. Thank you, Jon, for giving us a peek into your day and helping us understand the importance of telemetry pipelines and a proactive approach to prevent potential issues.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines
    6 Steps to Implementing a Telemetry Pipeline
    Webinar Recap: Taming Data Complexity at Scale