The Role of Infrastructure Monitoring in DevOps

Learning Objectives

Conversations about DevOps tend to focus first and foremost on software delivery. After all, the core goal of DevOps is enabling fast and reliable delivery of software.

But what can be easy to overlook is that you can’t deliver software very effectively if it lacks the reliable infrastructure to run on. Although DevOps teams may not think of infrastructure monitoring and management as a core part of their jobs, it arguably is for any team that wants to maximize its ability to deliver application releases continuously and quickly.

With that reality in mind, here’s a look at five infrastructure metrics that DevOps teams should track  to ensure the success of the DevOps delivery chain.

Total Available Cluster Nodes

In many cases today, DevOps teams deliver software into distributed and orchestrated environments using a platform like Kubernetes.These environments need sufficient nodes to host applications reliably and guarantee that enough capacity remains available as applications gain more features (and therefore consume more resources).

Toward that end, DevOps teams should track the total available nodes within their clusters. That’s especially true because one feature that Kubernetes lacks is the ability to add nodes automatically. (Some cloud-based Kubernetes distributions, like GKE, offer node autoscaling, but it’s not a core feature of generic Kubernetes.) If available node totals drop, teams must be ready to provision more nodes.

Remember that the total node count is not necessarily the same as the total available node count. Some nodes could be assigned to a cluster but not able to host containers. You’ll need to look granularly at Kubernetes logs to determine which nodes are up and running relative to your assigned nodes.

Container Restart Count

One of the reasons why DevOps teams tend to love Kubernetes is that Kubernetes is good at keeping things running under non-ideal circumstances. It does this by, among other strategies, automatically restarting containers when they fail.

Ultimately, this means that a container with a problem may continue to function in Kubernetes relatively well because Kubernetes keeps restarting it when it crashes.

Of course, having a container that repeatedly crashes is not ideal. It could be a sign of a problem with your application. If left unaddressed, the situation could escalate until it leads to total failure.

That’s why it’s worth tracking the Restart Count metric, which you can access using the kubectl describe pod POD NAME command. Restart Count tells you how many times Kubernetes has restarted each container.

Ideally, this number will be zero. If it’s not, you should probably figure out why. Although it’s not necessarily a problem for containers to restart periodically, continual restarts could, again, be a sign that there is a bug in your application, which is something you’ll need to fix as part of the DevOps release process.

Monitoring Kubernetes Namespaces

Another handy feature in Kubernetes is namespaces. Namespaces are a way to isolate workloads into virtual environments within the same Kubernetes cluster. They help ensure that an issue with one workload doesn’t cause problems for another workload running alongside it.

For DevOps teams, it’s best to isolate workloads into their namespaces unless there is a reason for workloads to run in the same namespace (there may be if the workloads share an identical configuration or interact in some way).

However, it can be easy to forget to follow this rule, especially as clusters evolve. As the workloads increase and more and more users and teams share the same cluster, unrelated workloads may run in the same namespace.

To minimize this risk, keep track of how many namespaces you have and which workloads are running in each one.

You can list all existing namespaces with:

kubectl get namespaces

To see which pods are running in which namespace, run:

kubectl get pods --all-namespaces

Cloud Billing Metrics

You may not think of your cloud computing bill as a source of data that will help you achieve DevOps success. But it can be. After all, inefficiencies in application design or deployment and performance issues that cause applications to consume more cloud infrastructure resources than they ideally would, are often reflected via higher cloud costs.

Most cloud providers offer tools to help you track your cloud spend and generate alarms when it unexpectedly peaks. Of course, you’ll have to keep track of how your cloud environment changes as well because the addition of cloud services or changes in workload configuration could trigger billing anomalies just as much as application performance issues.

Nonetheless, consider billing metrics as an out-of-the-box means of gaining additional insight into infrastructure usage patterns and their relationship to application performance.

Infrastructure Utilization Rates

We’ve saved infrastructure utilization rates for the end of this list because they are the least exciting category of infrastructure metrics to track. But they are also one of the most important.

Infrastructure utilization rates are basic metrics like CPU and memory utilization. Depending on how you run your applications, you may track these metrics for each VM, container, or other resources that you run.

From the perspective of DevOps, infrastructure utilization rates are significant because they are another means of identifying potential performance issues or bugs within your application. If infrastructure consumption rates creep up over time without a corresponding increase in application requests, it may be a sign that the application runs sub-optimally. It could also reflect feature bloat, meaning implementing features that consume more resources but don’t necessarily add value for users. Either way, you’ll want to take a closer look to figure out what is going on.

Conclusion: Infrastructure Monitoring as a Key to DevOps Success

Infrastructure monitoring may not be a top-of-mind priority for the typical DevOps team and, as a result, may feel bland compared to the flashier work of designing, creating, and deploying new software. However, keeping track of what is happening in your application hosting infrastructure is a critical ingredient in DevOps success. You can’t deliver work consistently and continuously if you can’t rely on your infrastructure.

It’s time to let data charge