Mezmo Launches Fast & Precise AI SRE for Kubernetes Ahead of KubeCon
The observability market stands at an inflection point as AI-powered site reliability engineering moves from theoretical promise to practical reality. Yet separating genuine capability from vendor hype remains challenging, particularly as organizations grapple with spiraling telemetry costs and question whether today's AI models can truly deliver on their transformative potential. Tucker Callaway, CEO of Mezmo, believes the answer lies not in waiting for better models, but in fundamentally rethinking how telemetry data is processed and delivered to AI agents.
In this wide-ranging VMblog Q&A, Callaway makes the case that agentic root cause analysis is achievable today—provided organizations shift their approach from traditional prompt engineering to what Mezmo calls "context engineering." The company's recently rebranded Active Telemetry platform claims to deliver root cause analysis outcomes a standard deviation faster than industry benchmarks, reducing typical troubleshooting time from 50 minutes to just 5 minutes. Callaway also shares his provocative vision for observability's future, where dashboards transition from operational necessities to mere trust verification tools, and where the economics of the entire category face disruption as AI agents replace human analysis.
++
VMblog: There's a lot of energy around AI SRE right now. How can folks separate what's reality from hype?
Tucker Callaway: I think "energy" is the right word - there is excitement for the potential, skepticism for the reality, concerns about cost, debates on approaches and expectations to get deterministic outcomes from probabilistic systems. Some feel the models aren't enough today and need to be trained and others (myself included) feel that we have proven that they are more than sufficient to deliver outcomes today that are both performant and accurate.
In terms of breaking down the hype and realities, it's further complicated by the diversity of tasks and the expectations of the role of the SRE before we even get into what components of that role can be delivered through AI.
So I will answer that question through the specific lens of Root Cause Analysis, which is effectively why Observability exists and say that we have proven that this can be affordably and repeatedly be delivered today.
The reason we focus on RCA is because it is the critical gate to a fully agentic future - which I believe is closer than many think - if you can not confidently identify and diagnose anomalous behavior then all of the downstream potential can never be realized.
So in my view - Agentic RCA is a reality today but when we conflate the diversity of the tasks human SREs perform and we don't breakdown the critical workflows and tasks - we quickly leave reality and drift into hype.
VMblog: Mezmo has recently re-branded with a heavy emphasis on Active Telemetry. Can you explain what Active Telemetry is and what value it delivers?
Callaway: At the risk of over generalizing, Observability today is largely driven by a single purpose supported by a single approach.
The purpose is to make complex data consumable by humans with the intent of identifying and diagnosing issues and the approach is to store all of the data and ask questions of that data later.
We recognized 3 years ago, even before AI amplified both the problem and the opportunity, that the physics behind the growth of data (and the corresponding cost) and the efficiencies behind the value derived from that data were fundamentally broken. Our hypothesis was and still is that the processing of data has to shift left, closer to the point of conception and we needed to create a platform that could handle the dynamic analysis of that data in motion. The linearity of collection, ingestion, storage and analysis has brought us to a point where the cost of Observability is rapidly approaching the cost to deliver what we are observing in the first place.
Active Telemetry is our answer to that problem. We have incorporated the processing, retention and analysis of telemetry data into a platform that provides developers with instant access to that data they need, Agents with curated context to deliver performant accuracy and empower platform teams with the governance, control and data orchestration they require.
VMblog: In a recent press release, Mezmo states that the new AI SRE is a standard deviation faster than the industry standard when it comes to resolving issues in Kubernetes. How is that possible and what data do you have to back that up?
Callaway: The short answer is Active Telemetry and Yes. :) That obviously requires more explanation so I will first cover the approach and then we can discuss the results.
As I alluded to earlier, a benefit of Active Telemetry is the ability to direct realtime curated context to an agent. This is another way of saying it enables Context Engineering of Telemetry data. Context engineering is the future of agentic outcomes and performance - Anthropic recently published a great post on the theory behind the concept that allows us to deliver these outcomes called "Effective context engineering for AI agents" - it's definitely worth a read for everyone thing about driving more cost effective and deterministic outcomes with AI.
The general premise is that the models today are sufficient to deliver cost effect RCA 90% faster than a human based approach. The reason this is not pervasive today lies in the inefficiencies of prompt based approaches and their inability to provide the context needed to the models.
Clickhouse recently published a benchmark leveraging a prompt engineering approach identifying root causes on the OTEL demo application - we repeated the exercise with a context engineering approach powered by Active Telemetry. The difference in the results were striking - we saw 90% few tokens consumed and positive identification on the first try - no prompts needed - just the right context.
Beyond the benchmarks, our customers are experiencing the same outcomes. Long running issues that have been undiagnosed for years are resolved in minutes. Typical troubleshooting time to identification is reduced from 50 to 5 mins and the biggest surprise is always that the less we prompt, the better the outcome.
It's incredibly exciting and we are just getting started - we have some really exciting enhances coming before the end of the year.
VMblog: How do you think the rise of agents will impact observability? Should we expect humans to still be looking at dashboards in 2026?
Callaway: Going back to my previous statement that the foundation of Observability today is to make complex data consumable by humans - the impact and opportunity can't be overstated. Driven by proper context, agentic RCA is possible today. This combined with the ability to better manage and orchestrate the retention of data behind the scenes will turn RCA and Observability into an AI driven outcome and SREs can go back to focusing on what they really love which is designing and architecting the systems.
When the analysis is performed by agents, there is no need for charts and graphs, the analysis is commoditized by the models and the curation and management of data is the driver of success. I don't believe the players in the space today have the ability to respond and shift at the speed this shift will happen. So yeah - I think there is going to be an impact - a huge benefit for the consumers and a massive shift in the providers.
Now ... you put a timeframe on the end of dashboards. I wouldn't take 2026 in the dashboard death pool. It will start to happen but as always the typical enterprise will need time. I do think in 2026 there will be a shift from a dashboard as an operational tool to a source of trust and confidence. There is also an element of risk management, compliance and audit that underlies a lot of my hypothesis. Trust and auditability will become a more embedded capability as we remove the humans from the loop - but that's a conversation for another time.



































































































