Top Kubernetes Metrics & Logs for End-to-End Log Monitoring

4 MIN READ
MIN READ

Kubernetes makes life as a DevOps professional easier by creating levels of abstractions like pods and services that are self sufficient. Now, though this means we no longer have to worry about infrastructure and dependencies, what doesn’t change is the fact that we still need to monitor our apps, the containers they’re running on, and the orchestrators themselves. What makes things more interesting, however, is that the more Kubernetes piles on levels of abstraction to “simplify” our lives, the more levels we have to see through to effectively monitor the stack. Across the various levels you need to monitor resource sharing, communication, application deployment and management, and discovery. Pods are the smallest deployable units created by Kubernetes that run on nodes which are grouped into clusters. This means that when we say “monitoring” in Kubernetes, it could be at a number of levels—the containers themselves, the pods they’re running on, the services they make up, or the entire cluster. Let’s look at the key metrics and log data that we need to analyze to achieve end-to-end visibility in a Kubernetes stack.

Usage Metrics

Performance issues generally arise from CPU and memory usage and are likely the first resource metrics users would want to review. This brings us to cAdvisor, an open source tool that automatically discovers every container and collects CPU, memory, filesystem, and network usage statistics. Additionally, cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. Sounds too good to be true, doesn’t it? Well it is, and the catch is that cAdvisor is limited in a sense that It only collects basic resource utilization and doesn't offer any long term storage or analysis capabilities.

CPU, Memory and Disk I/O

Why is this important? With traditional monitoring, we’re all used to monitoring actual resource consumption at the node level. With Kubernetes, we’re looking for the sum of the resources consumed by all the containers across nodes and clusters (which keeps changing dynamically). Now, if this sum is less than your node’s capacity, your containers have all the resources they need, and there’s always room for Kubernetes to schedule another container if load increases. However, If it goes the other way around and you have too few nodes, your containers might not have enough resources to meet requests. This is why making sure that requests never exceed your collective node capacity is more important than monitoring simple CPU or memory usage. With regards to disk usage and I/O, with Kubernetes we’re more interested in the percentage of disk in use as opposed to the size of our clusters, so graphs are wired to trigger alerts based on the percentage of disk size being used. I/O is also monitored in terms of Disk I/O per node, so you can easily tell if increased I/O activity is the cause for issues like latency spikes in particular locations.

Kubernetes Metrics

There are a number of ways to collect metrics from Kubernetes, although Kubernetes doesn’t report metrics and instead relies on tools like Heapster instead of the cgroup file. This is why a lot of experts say that container metrics should usually be preferred to Kubernetes metrics. A good practice however, is to collect Kubernetes data along with Docker container resource metrics and correlate them with the health and performance of the apps they run. That being said, while Heapster focuses on forwarding metrics already generated by Kubernetes, kube-state-metrics is a simple service focused on generating completely new metrics from Kubernetes.These metrics have really long names which are pretty self explanatory; kube_node_status_capacity_cpu_cores and kube_node_status_capacity_memory_bytes are the metrics used to access your node’s CPU and memory capacity respectively. Similarly, kube_node_status_allocatable_cpu_cores tracks CPU resources currently available and kube_node_status_allocatable_memory_bytes does the same for memory. Once you get the hang of how they’ve been named, it’s pretty easy to make out what the metric keeps track of.

Consuming Kubernetes Metrics

These metrics are designed to be consumed either by Prometheus or a compatible scraper, and you can also open /metrics in a browser to view them raw. Monitoring a Kubernetes cluster with Prometheus is becoming a very popular choice as both Kubernetes & Prometheus have similar origins and are instrumented with the same metrics in the first place. This means less time and effort lost in “translation” and more productivity. Additionally, Prometheus also keeps track of the number of replicas in each deployment, which is an important metric. Pods typically sit behind services that are scaled by “replica sets” which create or destroy pods as needed and then disappear. ReplicaSets are further controlled by “declaring state” for a number of running ReplicaSets (done during deployment). This is another example of a feature built to improve performance that makes monitoring more difficult. Replica sets need to be monitored and kept track of just like everything else if you want to continue to make your applications perform better and faster.

Network Metrics

Now, like with everything else in Kubernetes, networking is about a lot more than network in, network out and network errors. Instead you have a boatload of metrics to look out for which include request rate, read IOPS, write IOPS, error rate, network traffic per second and network packets per second. This is because we have new issues to deal with as well, like load balancing and service discovery and where you used to have network in and network out, there are thousands of containers. These thousands of containers make up hundreds of microservices which are all communicating with each other, all the time. A lot of organizations are turning to a virtual network to support their microservices as software-defined networking gives you the level of control you need in this situation. That’s why a lot of solutions like Calico, Weave, Istio and Linkerd are gaining popularity with their tools and offerings. SD-WAN especially is becoming a popular choice to deal with microservice architecture.

Kubernetes Logs

Everything a containerized application writes to stdout and stderr is handled and redirected somewhere by a container engine and, more importantly, is logged somewhere. The functionality of a container engine or runtime, however, is usually not enough for a complete logging solution because when a container crashes, for example, it takes everything with it, including the logs. Therefore, logs need a separate storage, independent of nodes, pods, or containers. To implement this cluster-level, logging is used, which provides a separate backend to store and analyze your logs. Kubernetes provides no native storage solution but you can integrate quite a few existing ones.

Kubectl Logs

Kubectl is the logging command to see logs from the Kubernetes CLI and can be used as follows:

  • $ kubectl logs to test out the updates.

This is the most basic way to view logs on Kubernetes and there are a lot of operators to make your commands even more specific. For example, “$ kubectl logs pod1” will only return logs from pod1. “$ kubectl logs -f my-pod” streams your pod logs, and “kubectl logs job/hello” will give you the logs from the first container of a job named hello.

Logs for Troubleshooting

Logs are particularly useful for debugging problems and troubleshooting cluster activity. Some variations of kubectl logs for troubleshooting are:

  • kubectl logs --tail=20 pod1” which displays only the most recent 20 lines of output in pod1; or
  • kubectl logs --since=1h pod1” which will show you all logs from pod1 written in the last hour.

To get the most out of your log data, you can export your logs to a log analysis service like LogDNA and leverage its advanced logging features. LogDNA’s Live Streaming Tail makes troubleshooting with logs even easier since you can monitor for stack traces and exceptions in real time, in your browser. It also lets you combine data from multiple sources with all related events so you can do a thorough root cause analysis while looking for bugs.

Logging Levels and Verbosity

Additionally, there are different logging levels depending on how deep you want to go; if you don't see anything useful in the logs and want to dig deeper, you can select a level of verbosity. To enable verbose logging on the Kubernetes component you are trying to debug, you just need to use --v or --vmodule, to at least level 4, though it goes up all the way to level 8. While level 3 gives you a reasonable amount of information with regards to recent changes made, level 4 is considered debug-level verbosity. Level 6 is used to display requested resources while level 7 displays HTTP request headers and 8 HTTP request contents. The level of verbosity you choose will depend on the task at hand, but it’s good to know that Kubernetes gives you deep visibility when you need it.Kubernetes monitoring is changing and improving every day because at the end of the day, that’s the name of the new game. The reason monitoring is so much more “proactive” now is because everything rests on how well you understand the ins and outs of your containers. The better the understanding, the better the chances of improvement, the better the end user experience. So in conclusion, literally everything depends on how well you monitor your applications.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines