Logging In An Era Of Containers
Log analysis will always be fundamental to any log monitoring strategy. It is the closest you can get to the source of what’s happening with your app or infrastructure. As application development has undergone drastic change over the past few years with the rise of DevOps, containerization, and microservices architecture, logs have not become less important, rather they are now at the forefront of running stable, highly available, and performant applications. However, the practice of logging has changed. From being simple it is now complex, from just a few hundred lines of logs we now often see millions of lines of logs, from all being in one place we are now dealing with distributed log data. Yet as logging has become more challenging, a new breed of tools have arrived on the scene to manage and make sense of all this logging activity. In this post, we’ll look at the sea change that logging has undergone, and how innovative, cloud-based solutions, such as a container, have sprung up to address these challenges.
Complexity of the stack
Traditional client-server applications were simple to build, understand, and manage. The front end was required to run on a few browsers or operating systems. The backend consisted of a single consolidated database, or at most a couple of databases on a single server. When something goes wrong you can jump into your system logs at /VAR/LOG and easily identify the source of the failure and how to fix it. With today’s cloud-native apps, the application stack has become tremendously complex. Your apps need to run on numerous combinations of mobile devices, browsers, operating systems, edge devices, and enterprise platforms. Cloud computing has made it possible to deliver apps consistently across the world using the internet, but it comes with its own challenges of management. VMs (virtual machines) brought more flexibility and cost efficiencies over hardware servers, but organizations soon outgrew them and needed a faster way to deliver apps. Enter Docker. Containers bring consistency to the development pipeline by breaking down complex tasks and code into small manageable chunks. This fragmentation lets organizations ship software faster, but it requires you to manage a completely new set of components. Container registries, the container runtime, an orchestration tool or CaaS (containerization as a service) - all make a container stack more complex than VMs.
Volume of data has spiked
Each component generates its own set of logs. Monolithic apps are decomposed into microservices with each service being powered by numerous containers. These containers have short life spans of a few hours compared to VMs which typically run for months or even years. Every request is routed across the service mesh and touches many services before being returned as a response to the end user. As a result, the total volume of logs has multiplied. Correlating the logs in one part of the system with those of another part is difficult, and insights are hard won. Having more log data is an opportunity for better monitoring, but only if you’re able to glean insights out of the data efficiently.
Many logging mechanisms
Each layer has its own logging mechanism. For example, Docker has drivers for many log aggregators. Kubernetes, on the other hand, doesn’t support drivers. Instead it uses a Fluentd agent running on every node. More on Fluentd later in this post. Kubernetes doesn’t have native log storage, so you need to configure log storage externally. If you use a CaaS platform like ECS, they would have their own set of log data. ECS has its own logs collector. With log collection so fragmented, it can be dizzying to jump from one tool to another to make sense of errors when troubleshooting. Containers require you to unify logging from all the various components for the logs to be useful.
The rise of open source tools
As log data has become more complex the solutions for logging have matured as well. Today, there are many open source tools available. The most popular open source logging tool is the ELK stack. It’s actually a collection of three different open source tools - Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed full-text search database, Logstash is a log aggregation tool, and Kibana is a visualization tool for time-series data. It’s easy to get started with the ELK stack when you’re dipping your toes into container logging, and it packs a lot of powerful features like high availability, near-real-time analysis, and great visualizations. However, once your logs reach the limits of your physical nodes that power the ELK stack, it becomes challenging to maintain operations smoothly. Performance lags and resource consumption become an issue. Despite this, the ELK stack has sparked many other container logging solutions like Mezmo, formerly known as LogDNA. These solutions have found innovative ways to deal with the problems that weigh down the ELK stack. Fluentd is another tool commonly used along with the ELK stack. It is a log collection tool that manages the flow of log data from the source app to any log analysis platform. Its strength is that it has a wide range of plugins and can integrate with a wide variety of sources. However, in a Kubernetes setup, to send logs to Elasticsearch, Fluentd places an agent in every node, and so becomes a drain on system resources.
Machine learning is the future
While open source tools have led the way in making logging solutions available, they require a lot of maintenance overhead when monitoring real-world applications. Considering the complexity of the stack, volume of data, and various logging mechanisms, what’s needed is a modern log analysis platform that can intelligently analyze log data and derive insights. Analyzing log data by manual methods is a thing of the past. Instead machine learning is opening up possibilities to let algorithms do the heavy lifting of crunching log data and extracting meaningful outcomes. Because algorithms can spot minute anomalies that would be invisible to humans, they can identify threats much before a human would, and in doing so can help prevent outages even before they happen. Mezmo is one of the pioneers in this attempt to use machine learning to analyze log data. In conclusion, it is an exciting time to build and use log analysis solutions. The challenge is great, and the options are plenty. As you choose a logging solution for your organization, remember the differences between legacy applications and modern cloud-native ones, and choose a tool that supports the latter most comprehensively. And as you think about the future of log management, remember that the key words are ‘machine learning’.