Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits

4 MIN READ
8 MIN READ

One of the inherent challenges you'll face when working with Kubernetes is that a typical cluster includes many resources that produce telemetry data. Because producing and moving telemetry data consumes resources, you can end up in situations where different workloads are competing for the resources necessary to manage telemetry data. And if some workloads lack the resources necessary to process telemetry data quickly enough, the result could be blindspots in your Kubernetes monitoring and observability strategy.

Fortunately, Kubernetes offers some built-in features that can help alleviate these challenges: resource quotas and limits. Keep reading for a look at how resource quotas and limits can improve your clusters' ability to handle telemetry data, as well as how to get started leveraging these features for this purpose.

Challenges With Kubernetes And Managing Telemetry Data

Before diving into how resource quotas and limits help to streamline telemetry in Kubernetes, let's talk a bit more about why telemetry can be a challenge in Kubernetes.

Telemetry is the generation, processing, and management of data that teams can use to monitor and observe IT environments. Typically, telemetry data is generated by an IT resource in the form of logs, traces, or metrics. Then, the resource transmits it – often with the help of a telemetry pipeline – to a location where it can be analyzed and/or stored over the long term.

Like any IT process, processing telemetry data consumes CPU and memory resources – and the more telemetry data you are working with, the higher the CPU and memory load tends to be. Telemetry processing may also consume network bandwidth when transmitting data from one server to another.

In Kubernetes, it's common to have many components that produce their own telemetry streams. Various elements of the Kubernetes control plane – such as the API server, the etcd key-value store, and control plane nodes – produce telemetry data. So do most applications hosted on Kubernetes.

Now, this is fine so long as each component that needs to generate and transmit telemetry data has sufficient CPU and memory resources to do so. But because the total CPU and memory resources available to a Kubernetes cluster are limited based on the number of servers (or nodes, as they are called in Kubernetes) in the cluster, and because a single cluster may contain dozens or even hundreds of individual nodes and pods that generate telemetry data, the problem can arise that sufficient resources are not available for processing telemetry data – especially if heavier-than-usual load causes an increased volume of telemetry data.

This is one example of what's commonly referred to as the "noisy neighbor" problem. When one component inside a Kubernetes cluster begins a resource-intensive task (such as responding to an increase in requests), it can become a noisy neighbor, sucking up resources that other components require to perform important tasks.

If that happens – if one component in a Kubernetes cluster hogs resources during telemetry processing to the point that other components can't do their jobs – two types of problems may occur:

  1. Other components won't be able to generate and/or transmit telemetry data quickly enough to support real-time monitoring and observability. The data they send may be delayed until CPU and memory resources free up, preventing teams from detecting issues as quickly as possible.
  2. In extreme cases, other components may stop operating properly because they lack the CPU and memory necessary to do so. In other words, an application could begin generating errors or dropping requests because the resources it needs to operate normally are being tied up by another application's telemetry operations.

Examples Of Kubernetes Telemetry Constraints

To put this in a real-world context, imagine that you have a Kubernetes cluster that hosts five different applications. During periods of normal activity, the cluster operates with total CPU and memory consumption rates of 90 percent – meaning that 10 percent of its capacity is held in reserve. Ideally, clusters would have a larger resource buffer than this, but because increasing spare resource capacity requires adding servers to a cluster – and because adding servers costs money – it's not uncommon for the resource margins of a Kubernetes cluster to be tighter than they should be.

One day, due to a configuration oversight related to how much telemetry data it is supposed to process, one application begins generating ten times as many logs, traces, and metrics as usual, leading to an increase in its telemetry operations. As a result, the application's CPU and memory also increase ten-fold, bringing the cluster's CPU and memory loads to 100 percent.

Because the cluster's resources are now maxed out, there is no spare CPU and memory available for other applications, and the buggy app will continue hogging resources until it either ends up in a CrashLoopBackOff state or is reconfigured to manage telemetry data properly. In the meantime, other applications may not be able to function properly because there are not enough resources to accommodate them.

Kubernetes Won't Automatically Solve Telemetry Problems

You might think that Kubernetes would automatically distribute resources to components within a cluster based on their needs. But you'd be wrong. Kubernetes doesn't reserve resources for components automatically (unless admins explicitly configure resource requests and limits).

After all, Kubernetes doesn't know what the apps that you deploy need to do. It can't tell the difference between a mission-critical application and a dummy app that your developers are just testing.

Thus, when a cluster runs short of resources, Kubernetes doesn't intervene by default. It can do things like restarting crashed pods to try to keep them running, but it can't step in and say "app Y is hogging too much CPU, so I'm going to redistribute the CPU to other apps" without using a feature of Kubernetes called preemption.

How Resource Quotas and Limits Enable Efficient Telemetry

However, there is a way to tell Kubernetes how many resources a given app or set of apps should be able to consume. You can do this by setting up resource quotas and limits.

Here's what each of these things does:

  • A resource quota defines how many resources a namespace can use. In Kubernetes, a namespace is a virtual cluster that can host multiple apps. By setting a resource quota for a namespace, then, you effectively define how many total resources should be available collectively to the apps running in that namespace.
  • Resource requests and limits define a range of resources that a specific container or Pod can consume. For example, you can use limits to tell Kubernetes what the minimum and maximum CPU resources should be for a given container.

The reason why resource quotas and limits help prevent the "noisy neighbor" problem described above is that they can prevent one application or set of applications from hogging resources and depriving other apps from operating normally.

For example, imagine that you have two namespaces in your cluster – one for production apps and one for apps that are in testing. Since the testing apps are not mission-critical, you could define a resource quota for their namespace that prevents those apps from consuming more than 20 percent of the total resources available to your cluster. Then, if a testing app were to experience a surge in telemetry operations, the resource quota imposed on the app's namespace would prevent the creation of additional pods that could increase resource utilization and risk destabilizing production apps running in the other namespace.

As another example, imagine that you run multiple production apps, but one is especially mission-critical. You could use resource limits to define a higher set of maximum resources that Kubernetes should make available to that app. That way, Kubernetes would prioritize giving resources to that app in cases where cluster resources are maxed out. Doing so might come at the expense of other apps, but it would ensure that the most important app has the resources necessary to operate stably. That's better than leaving it to chance to decide which app or apps will be able to function normally during times of resource shortages.

How to Set Up Resource Quotas and Limits

Configuring resource quotas and limits is simple enough. You include them in manifests when describing objects in Kubernetes.

For example, this YAML code (borrowed from the Kubernetes documentation) defines a ResourceQuota for a namespace named mem-cpu-demo:

apiVersion: v1

kind: ResourceQuota

metadata:

 name: mem-cpu-demo

spec:

 hard:

   requests.cpu: "1"

   requests.memory: 1Gi

   limits.cpu: "2"

   limits.memory: 2Gi

Limitations of Resource Quotas and Limits for Telemetry

Overall, it's important to note that resource quotas and limits don't automatically eliminate the risks associated with telemetry data overload or other resource shortages in Kubernetes.

These features don't magically generate additional resources when your cluster runs short on CPU or memory. Even with resource quotas and/or limits set up, some apps may fail until cluster admins either add more infrastructure to the cluster or reduce the resource consumption rates of its applications.

However, resource quotas and limits will at least tell Kubernetes which workloads to prioritize during times of insufficient resources. They're a way to protect your most important workloads, ensuring that they can continue to manage telemetry operations and otherwise function normally.

How Telemetry Pipelines Keep Kubernetes Stable

Another way to prevent the noisy neighbor problem in Kubernetes is to offload the processing of telemetry data from the local cluster as much as possible. That way, the cluster resources are not used for processing telemetry data.

That's where telemetry pipelines come into play. By making it possible to transform, merge, deduplicate, or otherwise process data while it is in transit, telemetry pipelines reduce the data processing load placed on the workloads where telemetry data originates. In that way, they also reduce the risk that telemetry operations will cause your cluster resources to max out.

The bottom line: configuring resource quotas and limits is a best practice that will help keep your most important workloads operating during times of peak cluster load. But it's also a best practice to leverage telemetry pipelines to lighten overall cluster load, reducing the risk that Kubernetes will need to enforce resource maximums at all.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines