Reducing Telemetry Toil with Rapid Pipelining

4 MIN READ

Intellyx BrainBlog by Jason English for Mezmo

‍

“Bubble bubble, toil and trouble” describes the mysterious process of mixing together log data and metrics from multiple sources as they enter an observability data pipeline.

‍

Customers demand high performance, functionality-rich digital experiences with near-instantaneous response times. This drives enterprise development teams to build services that integrate to external APIs and modernize their applications, using ephemeral containers and clusters atop highly distributed cloud architectures and data lakes.

‍

To make this brew of disparate elements work together, we are constantly adding new sources of data, each of which emits a constant stream of logs and metrics that could indicate something about its consistency.

‍

We call all of this data emanation that could tell us about the condition of a system telemetry. Telemetry data helps engineers zero in on whatever could impact the availability and performance of an application. Unfortunately, there is so much telemetry data coming in, we aren’t sure how to deal with it, much less figure out what useful information is inside of it.

Telemetry data at the boiling point

As log volumes continue to grow, dealing with the data boil-over is both expensive and troublesome, requiring too much low-value work, or toil, from SREs and developers.

‍

The toil of dealing with excessive log data isn’t just a minor nuisance—it’s an endemic problem across enterprise architectures. Developers and operations engineers can spend 20% to 40% of their time sorting through massive log data volumes for relevance, or writing brittle automation scripts to try and normalize log data for consumption within observability and security analysis tools.

‍

Much like crude oil entering a pipeline that is extracted from a ‘tar sands’ field of shale rocks, there’s a lot of completely non-essential, or ‘crude’ data polluting the telemetry data pipeline that offers little insight. How can we reduce the burden of handling so much crude data, before it overwhelms the team?

Why data retention policies don’t cut it

Traditionally, we used data retention policies to address this flood of data at its destination—when logfiles arrived at the data store or cloud data warehouse used by an observability or SIEM platform. These approaches became common a decade or so ago, seeking to reduce cloud and on-prem storage costs as well as reducing data processing efforts.

‍

By manipulating the retention settings in a data management tool or a time-series database, engineers could set acceptable intervals for downsampling and eliminating logs.

To explain this practice, let’s say that instead of capturing and storing a million logs a day from each service, what if you could set an automated policy to retain one log per second—or, 86,400 logs a day? That would reduce data volume by a factor of 11.5X—impressive. Then, a month, or a year in the future, downsample further to daily or monthly summaries. So cheap!

‍

After all, if you looked at a customer survey or political poll, an evenly sampled set of several thousand answers across a large set of millions of responses should typically yield statistically accurate results.

‍

Unfortunately, that’s not how software telemetry works. Indicators of oncoming failure conditions are quite momentary. Anomalous activity can appear in one log, and disappear in the next. Blink, and you might miss the issue, until a much more serious performance lag or security incident is experienced by users.

‍

Instead of tossing out logs with retention policies, you could choose to tag the data stream at the end of the pipeline, at the point of ingress—thereby reducing the number of logs engineers need to work with, perhaps at those one-second intervals, or by sampling on some other property, such as unique source IDs, urgency, or geography.

‍

While that might provide incremental improvements, if we are paying for all of the data ingress costs of a flood of log data in our observability platform, we’re still likely to miss some issues, and we’re not even addressing the whole cost problem by simply downsampling the destination.

Early processing at the ‘first mile’ of telemetry data

Rather than manipulating crude data at its destination, what we could instead look for patterns within the flow of all telemetry data as it enters the ‘first mile’ of the data pipeline, allowing comparisons to be made and anomalies detected, without dropping any logs before they can be considered?

‍

If writing automated queries and complex sorting and joins against data at rest seems difficult enough, imagine trying to find relevance within the open flood of incoming telemetry data at its sources. But that is precisely what we need to do.

‍

The new next-gen Log Management takes a pipeline-first approach to telemetry data and allows developers and operators to quickly build telemetry pipelines, using in-stream processors to refine crude logs, metrics, and trace data in motion.

‍

The sources, processors, and outputs can be assembled in an intuitive dashboard that allows telemetry data sources to be routed with drag-and-drop ease through processors, including steps such as:

‍

Dedupe: Most logs of a normally functioning system aren’t interesting, so removing redundant logs such as status pings and duplicate events from the stream will reduce data volume.
Sample: Being able to summarize sets of logs into single events, or a series of events into a single trend metric is more valuable on the front end of the pipeline, as it cuts downstream data costs and processing overhead as well as reducing the cognitive load of analytics work.
Filter Out: Conditioning incoming data by source or attribute type can be extremely useful for obfuscating private user data or transaction information, or removing logs and events that are irrelevant to the consuming engineering team’s purview.
Throttle: Rate-limiting the data pipeline for cost or performance concerns.

‍

Figure 1. Part of next-gen log management, the Mezmo Flow interface shows telemetry source types and processing actions, and their impact on reducing the size or volume of data passing through the telemetry pipeline.

Combining in-stream processors for new effects

There are more ways to apply these and other in-stream processors in Mezmo Flow, including combining sets of them with different settings, in different orders as telemetry pipeline templates for particular source types, application types, and teams analyzing the data.

‍

The real power behind telemetry pipelines lies in quickly being able to configure them to route any number of log sources, through unique processor sets, to any number of unique destinations, based on the intended use case.

‍

Say I’m an SRE at an eCommerce vendor, supporting the European region. I would want to use a telemetry pipeline for all clusters within the Kubernetes namespaces running on AWS and Azure regions located in the EU, that would show me performance trendlines, and filter out PII data to comply with GDPR regulations, which is then sent to my Snowflake instance for analysis in New Relic or AppDynamics.

‍

My development peer in another group might want another pipeline that samples user session and network logs into a rate-limited set of possible alert events for viewing in a SIEM like Elastic or Splunk, while allowing the rest of the log data to pass through to a low-cost S3 bucket or a data lakehouse like Starburst or Axiom for later historical exploration if needed.

“For telemetry data, there are three phases for refining the raw data into useful information: understand, optimize, and respond. Each of these are reflected in the functional aspects of a telemetry pipeline.”

The Fundamentals of Telemetry Pipelines, Russ Miles, O’Reilly guide

The Intellyx Take

As a homebrewer, it’s easy for me to think of these telemetry processors as interchangeable steps of adding new ingredients such as grains or hops at different times and temperatures to achieve later flavor results in a brewing process, before it goes down the pipe to sit in a fermenter for a week or two.

‍

In the telemetry world, we are dealing with an exponentially greater amount of supplier complexity and source materials in this brew, so we can’t wait to find out what is going down the pipe, before it reaches its final destinations.

‍

Telemetry pipelines like Mezmo can rapidly reduce the signal-to-noise ratio of telemetry data, so you can understand, optimize, and respond to the flood of events coming in from complex application architectures.

‍

©2025 Intellyx B.V. At the time of writing, Mezmo is an intellyx customer, and Elastic, New Relic, and Splunk are former Intellyx customers. No AI was used to source or write this content. Image sources: Screenshot from Mezmo Flow, feature image from Adobe Express.

‍

Table of Contents

Share Article

RSS Feed

Next blog post

You're viewing our latest blog post.

Previous blog post

You're viewing our oldest blog post.

Why we open-sourced AURA: Infrastructure for production AI