Reducing Telemetry Toil with Rapid Pipelining

4 MIN READ
4 MIN READ

Intellyx BrainBlog by Jason English for Mezmo

“Bubble bubble, toil and trouble” describes the mysterious process of mixing together log data and metrics from multiple sources as they enter an observability data pipeline.

Customers demand high performance, functionality-rich digital experiences with near-instantaneous response times. This drives enterprise development teams to build services that integrate to external APIs and modernize their applications, using ephemeral containers and clusters atop highly distributed cloud architectures and data lakes.

To make this brew of disparate elements work together, we are constantly adding new sources of data, each of which emits a constant stream of logs and metrics that could indicate something about its consistency. 

We call all of this data emanation that could tell us about the condition of a system telemetry. Telemetry data helps engineers zero in on whatever could impact the availability and performance of an application. Unfortunately, there is so much telemetry data coming in, we aren’t sure how to deal with it, much less figure out what useful information is inside of it.

Telemetry data at the boiling point

As log volumes continue to grow, dealing with the data boil-over is both expensive and troublesome, requiring too much low-value work, or toil, from SREs and developers. 

The toil of dealing with excessive log data isn’t just a minor nuisance—it’s an endemic problem across enterprise architectures. Developers and operations engineers can spend 20% to 40% of their time sorting through massive log data volumes for relevance, or writing brittle automation scripts to try and normalize log data for consumption within observability and security analysis tools.

Much like crude oil entering a pipeline that is extracted from a ‘tar sands’ field of shale rocks, there’s a lot of completely non-essential, or ‘crude’ data polluting the telemetry data pipeline that offers little insight. How can we reduce the burden of handling so much crude data, before it overwhelms the team?

Why data retention policies don’t cut it

Traditionally, we used data retention policies to address this flood of data at its destination—when logfiles arrived at the data store or cloud data warehouse used by an observability or SIEM platform. These approaches became common a decade or so ago, seeking to reduce cloud and on-prem storage costs as well as reducing data processing efforts.

By manipulating the retention settings in a data management tool or a time-series database, engineers could set acceptable intervals for downsampling and eliminating logs.

 

To explain this practice, let’s say that instead of capturing and storing a million logs a day from each service, what if you could set an automated policy to retain one log per second—or, 86,400 logs a day? That would reduce data volume by a factor of 11.5X—impressive. Then, a month, or a year in the future, downsample further to daily or monthly summaries. So cheap!

After all, if you looked at a customer survey or political poll, an evenly sampled set of several thousand answers across a large set of millions of responses should typically yield statistically accurate results.

Unfortunately, that’s not how software telemetry works. Indicators of oncoming failure conditions are quite momentary. Anomalous activity can appear in one log, and disappear in the next. Blink, and you might miss the issue, until a much more serious performance lag or security incident is experienced by users.

Instead of tossing out logs with retention policies, you could choose to tag the data stream at the end of the pipeline, at the point of ingress—thereby reducing the number of logs engineers need to work with, perhaps at those one-second intervals, or by sampling on some other property, such as unique source IDs, urgency, or geography. 

While that might provide incremental improvements, if we are paying for all of the data ingress costs of a flood of log data in our observability platform, we’re still likely to miss some issues, and we’re not even addressing the whole cost problem by simply downsampling the destination.

Early processing at the ‘first mile’ of telemetry data

Rather than manipulating crude data at its destination, what we could instead look for patterns within the flow of all telemetry data as it enters the ‘first mile’ of the data pipeline, allowing comparisons to be made and anomalies detected, without dropping any logs before they can be considered?

If writing automated queries and complex sorting and joins against data at rest seems difficult enough, imagine trying to find relevance within the open flood of incoming telemetry data at its sources. But that is precisely what we need to do.

The new next-gen Log Management takes a pipeline-first approach to telemetry data and allows developers and operators to quickly build telemetry pipelines, using in-stream processors to refine crude logs, metrics, and trace data in motion.

The sources, processors, and outputs can be assembled in an intuitive dashboard that allows telemetry data sources to be routed with drag-and-drop ease through processors, including steps such as:

  • Dedupe: Most logs of a normally functioning system aren’t interesting, so removing redundant logs such as status pings and duplicate events from the stream will reduce data volume.
  • Sample: Being able to summarize sets of logs into single events, or a series of events into a single trend metric is more valuable on the front end of the pipeline, as it cuts downstream data costs and processing overhead as well as reducing the cognitive load of analytics work.
  • Filter Out: Conditioning incoming data by source or attribute type can be extremely useful for obfuscating private user data or transaction information, or removing logs and events that are irrelevant to the consuming engineering team’s purview.
  • Throttle: Rate-limiting the data pipeline for cost or performance concerns.

Figure 1. Part of next-gen log management, the Mezmo Flow interface shows telemetry source types and processing actions, and their impact on reducing the size or volume of data passing through the telemetry pipeline.

Combining in-stream processors for new effects

There are more ways to apply these and other in-stream processors in Mezmo Flow, including combining sets of them with different settings, in different orders as telemetry pipeline templates for particular source types, application types, and teams analyzing the data.

The real power behind telemetry pipelines lies in quickly being able to configure them to route any number of log sources, through unique processor sets, to any number of unique destinations, based on the intended use case.

Say I’m an SRE at an eCommerce vendor, supporting the European region. I would want to use a telemetry pipeline for all clusters within the Kubernetes namespaces running on AWS and Azure regions located in the EU, that would show me performance trendlines, and filter out PII data to comply with GDPR regulations, which is then sent to my Snowflake instance for analysis in New Relic or AppDynamics. 

My development peer in another group might want another pipeline that samples user session and network logs into a rate-limited set of possible alert events for viewing in a SIEM like Elastic or Splunk, while allowing the rest of the log data to pass through to a low-cost S3 bucket or a data lakehouse like Starburst or Axiom for later historical exploration if needed.

“For telemetry data, there are three phases for refining the raw data into useful information: understand, optimize, and respond. Each of these are reflected in the functional aspects of a telemetry pipeline.”

The Intellyx Take

As a homebrewer, it’s easy for me to think of these telemetry processors as interchangeable steps of adding new ingredients such as grains or hops at different times and temperatures to achieve later flavor results in a brewing process, before it goes down the pipe to sit in a fermenter for a week or two. 

In the telemetry world, we are dealing with an exponentially greater amount of supplier complexity and source materials in this brew, so we can’t wait to find out what is going down the pipe, before it reaches its final destinations.

Telemetry pipelines like Mezmo can rapidly reduce the signal-to-noise ratio of telemetry data, so you can understand, optimize, and respond to the flood of events coming in from complex application architectures.

©2025 Intellyx B.V. At the time of writing, Mezmo is an intellyx customer, and Elastic, New Relic, and Splunk are former Intellyx customers. No AI was used to source or write this content. Image sources: Screenshot from Mezmo Flow, feature image from Adobe Express.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines
    6 Steps to Implementing a Telemetry Pipeline
    Webinar Recap: Taming Data Complexity at Scale