Best Practices For Working With Difficult-To-Understand Kubernetes Logs
- Understand why Kubernetes logging is difficult
- Learn about the different types of Kubernetes logs and how to make them easier to collect
- Explore the benefits of centralized logging for Kubernetes
The Kubernetes platform consists of several components, each with distinct functionality and expected outcomes. Yet, even with those features, if you want to understand the process of orchestration and how Kubernetes manages its resources, you need to have an effective logging solution; one that will track a cluster’s current runtime information, errors, applications logs, problems and contextual information.
Although Kubernetes offers a logging architecture, you will find that there are lots of critical features left for DevOps teams to configure on their own. These teams have demanding requirements for centralized logging and monitoring. Therefore, it becomes an engineering effort to figure out the most efficient solution for this problem.
This article explains the initial difficulties of collecting and analyzing Kubernetes logs in a meaningful way. It subsequently explains the different kinds of logs and ways to make it easier for DevOps teams to collect and monitor them in their workflows.
Let’s get started.
Why Kubernetes Logging is Difficult
We can identify a few significant reasons why logging in Kubernetes is challenging and requires particular attention:
- It’s limited by default: With kubectl, you can view or watch logs both from the kube-system namespace, from nodes, or from application logs. But, that's about it. This default logging functionality is suitable for quick spot checks only. If you want to stream, filter, or aggregate any of those logs, then you need to use third-party solutions.
- Pods are ephemeral: Kubernetes autoscaling operators add or remove pods based on the active configuration. It often creates pods for jobs or workloads that have a limited lifetime. The fundamental challenge with those pods is having sufficient time to capture all of the ephemeral logs before Kubernetes cleans them or associates them with a particular host or job id.
- There are scaling limitations: Kubernetes can scale up to thousands of nodes, each with hundreds of pods running inside them. You need to ensure that no logs are lost or not persisted in place. Collecting all those logs without exhausting all the system resources can also be problematic. More often, application code can use loggers that are not efficient; thus, they become bottlenecks over time.
- There are limited integrations: Not all external log collectors can integrate properly with Kubernetes. When evaluating a log collector or processor, you need to check for specific criteria such as awareness of Kubernetes pod/node lifecycle events, types of workloads, and the ability to run as a node-level agent or a sidecar process.
Put simply, although Kubernetes offers some limited logging functionality, it leaves most functional features out for third-party logging tools. This makes sense though, as the primary goals of this platform are to be extensible and easy to adopt by many organizations irrespective of their logging requirements. Next, we’ll discuss the difficulties you may encounter when collecting and analyzing Kubernetes logs.
Kubernetes Logging Challenges
Kubernetes emits logs from several places like pods, nodes, Kubernetes control planes (cluster logs), lifecycle events, and audit logs. Although those logs exist, they have differences in format or context information. This inconsistency will require normalization, especially if you plan to aggregate those logs with external tools such as Fluentd or CloudWatch.
Let's explain the primary concerns for each type of Kubernetes log and how to make them easier to collect:
- Container Logs: These are also considered as application logs and by default are written to a file in the /var/log directory. However, this approach has many disadvantages as these log files are more difficult to collect across many containers. This is why all apps that run inside a container and get deployed in Kubernetes should stream the logs to stdout and stderr outputs. It’s not very challenging to accomplish that and, quite often, logging frameworks offer this functionality by default. The only key consideration when logging application logs is their format. Different frameworks log diverse levels of information and there is no consistency in that area. In that case, you should standardize the log format of all deployed applications for easier management and contextual understanding across teams.
- Node-level Logs: As each of the pods that run inside a node stream logging information, they are aggregated together in files inside that host. By default, these logs are rotated with a low limit (10mb) so it's important to persist them regularly. If a container is evicted at any point, then the logs disappear for that container. So if you overlook the opportunity to log them from there, they are gone forever.
A good recommendation is to use a centralized logging agent that collects all logs from each pod. This agent can run as a DaemonSet or as a sidecar deployed on each pod. Usually, the external log collector provides deployment options for either of those two cases.
- Cluster Logs: All System component logs are called cluster logs and can be viewed under the kube-system namespace. There are many components running inside the master control plane, so it’s significant that Kubernetes offers a consistent experience when logging events from there. When you run managed Kubernetes distributions they will conveniently stream those logs to their native logging agents (from Amazon EKS control plane to CloudWatch for instance). In other cases, when you don’t provide this functionality you can use the same centralized logging agents to monitor the control plane nodes as well. For more efficient troubleshooting, you should have a dedicated dashboard and be able to configure alerts for cluster logs.
- Audit Logs: Audit logs contain information from the kube- apiserver, which is the component that interfaces with all the other components (either clients or control plane components). As this service acts as a central gateway between them, it’s an ideal candidate for capturing audit logs. This also means that there could be numerous log events coming in at any time. You can capture those logs with greater granularity than other log types. For example you can apply audit policies on each request stage (ResponseStarted, ResponseComplete, etc.) and also stream to multiple backends. This makes audit logs a very convenient and valuable feature.
- Event Logs: Kubernetes events indicate information about what is happening inside the cluster. These events come with other limitations and they are harder to handle as they only persist for a short period of time (about one hour). Their log format is also not standardized or configurable, which is why there is a separate kubectl command to inspect them. However, as they are really important, you must make certain you persist them into a secure place through an external collector agent.
- Ingress Logs: These logs capture the events that happen inside the ingress controller, which is one of the most important pieces of the infrastructure. The Ingress Controller is responsible for controlling external access to the cluster components and logs important information like incoming connections, HTTP headers, and metadata. As you can imagine, if you install an Ingress Controller and don’t configure the log format, you can capture excessive information or create bottlenecks when there are traffic spikes. Maintaining inconsistent formats equally makes it more difficult to parse and recognize in log collectors. One solution for handling this scenario is to log only specific request fields that are deemed useful for debugging or introspection purposes. We list the available formats of the Nginx Ingress Controller for reference, which shows the use of predefined variables.
Other than maintaining a consistent log format for all the types, it’s also recommended to have resource constraints in all agents so that they can be scheduled in another node if the current one is reaching capacity limits or is failing. That way, there will be fewer interruptions with the logging process.
Best Practice: Native Kubernetes Integration
We have explained the problem of logging in Kubernetes and the factors that make it difficult to collect and analyze. However, as more companies like Mezmo, formerly known as LogDNA, offer a native integration with Kubernetes, taking control of your logs in that environment becomes easier.