How to Use Logging with Chaos Engineering

Learning Objectives

  • Define chaos engineering and explain how chaos experiments are conducted.
  • Understand the benefits of chaos engineering.
  • Understand the role of logging and log analysis in chaos engineering.

Logging plays an integral role in all aspects of chaos engineering

In the post-pandemic world, both consumers and organizations rely on digital technologies more heavily than ever. Meanwhile, with the rise of microservices and distributed system architectures, organizational data environments have grown increasingly complex, which means system failures are more difficult to predict than ever.

In addition to being highly disruptive to end users, application downtime is extremely costly for organizations. One study estimates that on average, the hourly cost of downtime for high-priority applications is nearly $68,000, while the hourly cost for normal applications is estimated at nearly $62,000.

Waiting for an outage to happen, then cleaning up the mess, is no longer a realistic option. Organizations need a proactive approach that prevents failures from happening in the first place. That’s why DevOps teams at companies like Netflix and Amazon have embraced chaos engineering. 

What is chaos engineering?

Chaos engineering is about “breaking things on purpose.” IT teams introduce faults, or chaos, into systems in production, then measure how they respond. By conducting planned experiments that test how a system performs under stress, developers and engineers can gain a better understanding of how complex, distributed systems behave, identify and fix vulnerabilities before they turn into outages, and build more resilient systems.

Despite its name, chaos engineering involves very carefully planned and executed experiments. It can be compared to firefighters intentionally starting controlled fires to ensure that they’re trained and equipped to contain a real blaze. Chaos experiments are designed and implemented following a process similar to the scientific method:

  1. Define a system’s steady state according to a measurable output that indicates normal behavior.
  2. Formulate a hypothesis. Ask, “What could go wrong?” and “How will the system respond when this thing goes wrong?” For example, what happens if a server crashes, or a network connection is severed?
  3. Design the smallest possible experiment to test this hypothesis. Because the experiment will be conducted on a system in production, make sure to have a rollback plan in case things go off the rails!
  4. Execute the experiment, increasing the “blast radius” if necessary to induce a measurable impact. Measure this impact along the way.

The benefits of chaos engineering

Why would anyone dare experiment on a system in production, regardless of how controlled the experiments are? Because while many DevOps testing tools exist, these tools have limits. They can only test foreseen failures. Chaos experiments uncover the problems that IT teams otherwise couldn’t have predicted or foreseen -- the very type of problems that cause system outages.

In addition to uncovering otherwise unforeseen issues, chaos engineering enables IT teams to build a much deeper understanding of how their data environments behave under real-world conditions. Armed with this understanding, engineers can build systems that aren’t just resilient but are “antifragile,” meaning that they don’t just keep running in the face of failures but improve with each event. 

From this perspective, chaos engineering bridges the gap between DevOps teams, who want to push changes as quickly as possible, and site reliability engineers, who are concerned with keeping systems running. The end goal of chaos engineering is an antifragile data environment where DevOps teams can scale systems, introduce new apps and features, and make other changes without compromising system reliability or performance.

Using logging in chaos engineering

Before a chaos experiment is performed, a system absolutely must be in a steady state. If a system isn’t stable to begin with, the chaos experiment could cause an outage, which will sour company leadership on any further experiments while also failing to produce any meaningful insights. Additionally, a chaos experiment requires a baseline normal to measure against. Log analysis is essential to ensuring that a system is stable enough to begin experimenting, and it also establishes the baseline normal needed to glean actionable insights from an experiment.

Logging during an experiment is crucial to understanding the results. Absent log analysis, it’s impossible to observe what parts of the system are impacted by the fault and how they are impacted. 

What should be logged during a chaos experiment?

While specifics depend on the experiment and data environment, Google’s “four golden signals” of monitoring distributed systems are an excellent starting point.

  1. Latency, or the time it takes to service a request. To avoid misleading calculations, make sure to distinguish between the latency of successful requests and the latency of failed requests.
  2. Traffic, how much demand is being placed on the system. This is measured in a high-level system-specific metric, such as  HTTP requests per second for a web service or transactions and retrievals per second for a key-value storage system.
  3. Errors, or rate of requests that fail explicitly (e.g., HTTP 500s), implicitly (e.g., an HTTP 200 success response, but coupled with the wrong content), or by policy (e.g., "If you committed to one-second response times, any request over one second is an error").
  4. Saturation, or a measure of your system fraction with an emphasis on the resources that are most constrained. A utilization target is essential, as many systems degrade in performance before achieving 100% utilization.

Chaos engineering is an exceptionally powerful discipline that is transforming the way in which systems are being designed and built at some of the world’s largest businesses. Testing systems in production is ethical and low-risk so long as the experiment is carefully planned, with a contained blast radius, a rollback plan, and of course buy-in from all applicable organizational stakeholders.

Move forth and break things!

It’s time to let data charge