Containers have gone mainstream, and Kubernetes is fast becoming the de facto standard for container orchestration. Flexera’s 2021 State of the Cloud Report found that 48% of enterprises are using Kubernetes, and another 25% plan to use it.
Kubernetes is growing increasingly popular because it enables IT organizations to optimize the utilization of their compute resources while greatly simplifying the configuration and management of containerized applications at scale. However, for all that it simplifies, Kubernetes is itself a highly complex software application. From DNS issues to hardware failures, there are a lot of things that can, and do, go wrong with Kubernetes deployments, so many that there’s an entire website devoted to cataloging Kubernetes failure stories, including tips on how to avoid making the same mistakes.
Applications deployed on Kubernetes can take advantage of Kubernetes built in resiliency features like auto-scaling and auto-restarts but in an increasingly digitized world, organizations can’t afford to take a wait-and-see approach. The average hourly cost of downtime for high-priority applications is estimated to be nearly $68,000, while the hourly cost for normal applications comes in at nearly $62,000.
Enter chaos engineering, a proactive approach embraced by DevOps teams at major tech companies like Netflix and Amazon. Instead of waiting for failures to happen, chaos engineers purposefully introduce faults into production systems and observe what happens, enabling them to fix vulnerable parts of their applications and infrastructure before outages happen.
In addition to preventing outages, chaos experiments aid in the development of resilient apps and systems by giving IT teams a deeper understanding of how their systems act under volatile, real-world conditions. This also helps resolve the sometimes-conflicting goals of DevOps teams, who want to push changes as soon as possible, and site reliability engineers, who may be skittish about how these changes will impact availability. Through chaos engineering, IT teams can build a resilient environment where SRE teams can feel confident that the changes being made will not negatively impact system reliability or performance.
Kubernetes components can interact with each other—and with other systems—in highly unpredictable ways, and as an organization’s container environment grows, so does the number of things that can possibly go wrong. Traditional, pre-production testing may not uncover scenarios such as the one Target found itself in when it upgraded its OpenStack infrastructure in its development environment, setting off a series of events that ended up impacting Kubernetes and causing it to provision tens of thousands of new nodes.
Target uses Kafka as a message broker between applications and disconnected systems, including shipping logs and metrics from apps to backend systems. The OpenStack upgrade unexpectedly disrupted network connectivity for several hours, which in turn disrupted connectivity to Kafka. Meanwhile, one of Target’s Kubernetes clusters was significantly larger than the others, hosting approximately 2,000 development environment workloads.
Kafka’s connectivity issues caused all logging sidecars for applications, which normally consume only minimal CPU resources, to simultaneously “wake up,” which placed a high load on the docker daemons for the nodes in the Kubernetes cluster. This higher load prompted the nodes to report to Kubernetes that they were unhealthy, and the Kubernetes scheduler attempted to “solve” the problem by moving workloads off the “unhealthy” nodes and on to healthy ones. During this rescheduling event, Kubernetes provisioned approximately 41,000 new nodes.
Despite their name, chaos experiments are very carefully planned and executed, following a process that mimics the scientific method:
Typically, IT teams perform chaos experiments using one of the many tools designed specifically for chaos testing in Kubernetes including Gremlin, Litmus, and Chaos Mesh. While the nitty-gritty details vary between tools, all of them trigger problems, then report back on how Kubernetes handled them.
Before attempting to conduct a chaos experiment, IT teams must have a centralized log management solution in place in order to have insights to ensure that their Kubernetes deployment is stable enough to begin experimenting. Having centralized log management in place also allows IT teams to establish the baseline normal needed to distill actionable insights from chaos tests.
During the experiment, chaos logging tools generate event logs that capture exactly when and where an event was triggered, along with what parts of the system were impacted by the event and how. By correlating event logs with the baseline normal captured in standard logs, IT teams can observe how the system’s behavior changed and determine if they need to make adjustments to prevent a future failure. During a real-world event, incident response teams must piece together information about the conditions leading up to a failure, but chaos event logs let developers and IT administrators see the big picture as it unfolds.
With so many moving parts, Kubernetes failures are bound to happen. How teams respond to failure is critical to the success and quality of deployments. Chaos experiments in production aren’t a replacement for experiments during testing and staging; in fact, some chaos testing tools run in testing and staging as well as production. However, testing and staging environments can’t replicate the real world. Chaos experiments let IT teams safely test failure scenarios and head off catastrophes down the road.