A Day in the Life of a Mezmo SRE
8.28.24
What keeps an SRE at the top of his game? I had an insightful conversation with Jon Duarte, a Site Reliability Engineer (SRE) at Mezmo and he walked me through his role and the various tasks he manages on a typical day. Here’s Jon offering a brief glimpse into the challenges he faces, the thought processes behind his approach, and the innovative solutions SREs come up with.
Tell us about your background and the role at Mezmo.
Just out of college, I went into operations application support, and that's when I got my first glimpse of DevOps / SRE. This is the role I always wanted, as I like to help clarify things and learn new things. I have been an SRE since 2011 and with Mezmo for the last two years.
The SRE role at Mezmo involves end-to-end support, working with new applications, testing, and solving challenging problems.
What does a day in your life look like?
Here at Mezmo, very generally speaking, we have two categories: 1, things that need fixing or improving, and 2, project work. For that, we follow a certain workflow mechanism for identifying and working on tickets in a timely manner. It involves a balance of providing detailed information in tickets and managing communication across different channels, such as Slack, during incidents or projects. A key task is integrating the information from Slack or other channels into the tickets or documentation. Our goal is to automate this process to minimize toil and reduce the time spent.
How do you know, agnostic of the tools, what data will be relevant to a particular incident?
This is one challenge for us - for every small or large incident, there is a lot of data in various tools like Datadog, Sysdig, QuickWit, and others. We have to sift through a large volume of data to identify the relevant details and add them to runbooks or other documentation systems like Slack or Confluence.
Many SREs develop internal tools to streamline this process, and we are always adding commands to help reduce toil searching for information. However, when new commands are introduced, there may be a learning curve, and we need to quickly learn them to handle incidents effectively.
Of course, identifying relevant data and accessing it can be challenging. So, we want to automate and reduce the amount of time that we spend doing repetitive tasks. One cool thing that I'm testing and actually have been using quite often is summarizing incident data using AI. It can serve as a template for documenting longer incidents efficiently.
Can you explain the workflow for integrating a new feature or app into the system? Do you work directly with the engineers during this process or explore through data?
We Site Reliability Engineers work primarily through Slack and other channels integrated with GitHub to track releases and project updates. The focus is on gathering pertinent data from these channels and reaching out to developers or conducting research as needed. When something interesting comes up, we seek help from developers or research ourselves to get more information.
Rather than diving directly into raw log metrics or traces, we start at a higher level. For significant releases, like updates to Kafka, monitoring logs and performance proactively helps to understand potential impacts and prepare for incidents.
From my perspective, understanding different technologies like Kafka and their roles in telemetry pipelines is essential for SREs to maintain system reliability and keep the team informed.
Tell us about going through your process when an incident happens. What does that look like?
When an alert appears in PagerDuty, we first ensure that Kafka's observability is correctly set up by integrating with Sysdig, configuring app settings, writing and testing PromQL queries for metrics, and using Terraform to manage alerts. We collect metrics from Kubernetes, prevent false alerts, and monitor everything once integrated. This complex process is necessary to streamline alerts and reduce toil.
So, that is the toil I was talking about. But the goal is to get everything working and streamlined. Our SRE team uses Sysdig to aggregate metric events based on applied logic, allowing for a precise definition of what gets aggregated.
So, you want to aggregate the data about incidents, but how does that help you?
Consider a scenario where a log analysis tool, sees a sudden surge in log data from a specific source. That overloads the system. However, some log entries are null or repetitive. They can be filtered or reduced to prevent system overload. Sometimes they can even be aggregated to just one line saying this string gets repeated a number of times.
This is for one event consuming huge resources. There are limitations in dedicated hardware because we don't want performance affected. In more scalable environments, we have to be more careful about data, storage space, and cost. So, aggregating data about incidents saves resources.
Changing gears here, what would your life be without a tool like Mezmo?
Before Mezmo, if we had multiple environments, moving logs between the repositories was labor-intensive, with manual toil to configure the changes. Mezmo simplifies log management with telemetry pipelines. We can easily take multiple logs from different sources and send them to our log analysis tools where we can do things like line parse, filter, and search in an easy user interface. And we can do that for any combination. So, Mezmo streamlines log movement between environments with drag-and-drop simplicity. And without a tool like Mezmo, there would be a lot of toil and resource spending.
Do you wait for an incident to happen, or do you try to identify things before they happen?
We are always trying to proactively identify things. As an SRE, we rely on observability tools to analyze data early on. We Site Reliability Engineers, in general, are skeptical of the data presented to us. So, it would be nice to have tools, like our pipeline tools, that provide actionable insights, such as query improvements and performance enhancements, with suggesting or implementing optimizations to help SREs to stay ahead of potential problems. This approach reduces toil and helps manage the complexities of system operations more effectively.
During our discussion, Jon offered a unique perspective on how automation and AI are transforming SRE tasks, reducing toil and helping with optimization.
As we approached the end of the chat, Jon also discussed the challenges of understanding the metrics needed for new applications, and distinguishing between known and unknown metrics. He suggested that it would be helpful to have tools that identify and suggest essential metrics based on the system's usage, to optimize data collection and reduce costs.
It was a rewarding conversation. Thank you, Jon, for giving us a peek into your day and helping us understand the importance of telemetry pipelines and a proactive approach to prevent potential issues.
SHARE ARTICLE