Incident Postmortem: 08 June 2020

4 MIN READ
MIN READ

We encountered four interwoven bugs that caused a degradation of service in one of our production instances. As the bugs have particularly interesting implications and some significant lessons learned, we thought it would be prudent to explain the situation to the community to help the next person who may encounter one of these problems. The majority of the problems we encountered were external in nature, and we’re working toward mitigating the impact of future external problems.

Note: All times listed are in UTC.

On 08 June 2020 at 10:35, our on-call SRE received an alert that our production instance needed attention. Investigation began immediately.

What Happened

When addressing an unusual problem with a spike in incoming data, we discovered four distinct but interconnected bugs that compounded the reverberations of the data spike. Our Elasticsearch clusters were exhibiting high thread pool counts that exacerbated the data spike, and when we attempted to increase our hardware allocation, we ran into three external bugs with our disks that triggered index corruption issues, further compounding the problem. While identifying mitigations and workarounds for two of the external bugs, we figured out a way to adjust our system to handle data spikes better and address the thread pool count in the process.

Detailed Times (in UTC)

2020-06-08 ~07:00

A customer’s incoming data started spiking. This spike didn’t start impacting other systems until our first alert at 10:35.

2020-06-08 10:35

We received an alert, which kicked off our incident management process.

2020-06-08 11:31 –2020-06-09 14:58

We found our first indication of a “noisy neighbor” problem: A customer having a spike in incoming data that was so large other clusters were affected as the load balanced. Our typical mitigation strategy wasn’t working very well, and we started investigating why. One issue we identified was that our prioritization strategies were not working properly.We also were getting a number of noisy alerts about pod numbers that briefly sent us down the wrong path.Elasticsearch high thread pool issues arose that were not like any we had seen before. We continued to mitigate with manual intervention to keep the system running while we tried to identify the underlying issue.

2020-06-09 15:30 –2020-06-10 02:00

We started noticing Elasticsearch pods on various clusters had high thread pools and were constantly restarting. We attempted to fix it by clearing the Elasticsearch cache and adjusting how different components could request CPU resources to free up those resources for Elasticsearch. Neither approach worked, and we continued to manage the environment manually while searching for a cause.

2020-06-10 05:00 –16:45

We provisioned new hardware for our backend systems to scale up and manage the load better. While setting up the new hardware, we continued to manage the environment and kept hitting the threadpool problem.

2020-06-10 16:45 –2020-06-11 04:45

When scaling up the new hardware, we discovered a bug, which we’ll call here a group label bug, in a third-party provider that blocked connections from Elasticsearch to storage volumes provisioned on the new hardware. We engaged with the third-party vendor and began to work around the problem manually.

2020-06-11 04:45 –09:30

While still working around the group label bug, we observed random Elasticsearch pod restarts all around the clusters, seemingly at random, making the Elasticsearch recovery even more difficult. We continued engaging with the third-party vendor while managing the environment as best we could.

2020-06-11 09:30

We identified patterns in the pod restarts. The restarts did not only affect Elasticsearch pods; all disk-backed pods on one hardware node would restart simultaneously, then another random node would proceed through the same restart cycle. This was our first suspicion of a second bug related to the third-party provider.

2020-06-11 09:35

We confirmed that we encountered a bug with Stork, which our third-party storage provider uses for orchestration. A Stork healthcheck that ensures the storage driver is running on a node killed disk-backed pods on any nodes that did not seem to be running that storage driver. Despite these nodes actually having Portworx (our storage driver) running, the healthcheck didn’t identify the running instance and therefore killed all of the pods on the node. We disabled the healthchecks to stop the pods from crashing while the core devs from the Stork project started digging.

2020-06-11 13:00 –18:15

We continued to manage the environment as it started to stabilize a bit after the healthcheck bug was identified and mitigated, and we continued to work around the group label bug manually to scale up. We still were seeing high thread pool issues, which slowed down Elasticsearch. We decided to lock all Elasticsearch indices from past days and stop shard assignment from all days other than the current day.

2020-06-11 19:15 –20:00

We disabled thin provisioning on nodes, which is enabled by default, to manage our resources in a more controlled fashion.

2020-06-11 22:30 –2020-06-12 02:00

We figured out there was a strong correlation between high merge counts and high thread pool queue in our Elasticsearch pods. As we continued to try to identify why we were dealing with high thread pool issues, we continued to manage the environment.

2020-06-12 02:21 –04:50

While trying to manage the environment, we observed a sudden drop in our deferred backlog and realized that some of the batches in the deferred queue hit our 18-hour retention limit, causing them to get deleted from disk. We immediately worked on the data loss and engaged with our Elasticsearch experts to understand how we might mitigate the data loss.

2020-06-12 05:00 –08:00

We hit a setback with the group label bug where nodes rebooted and reset back to the state where the bug prevented volume mounting, so we had to re-decommission the nodes. Meanwhile, we worked with our Elasticsearch experts to handle the thread pool issue while continuing to manage the environment and investigate the data loss.

2020-06-12 08:32

We believed we had gotten the situation under control and had updated the status page accordingly.

2020-06-12 09:30 –12:15

While monitoring the environment since the incident was still open until all Elasticsearch clusters had fully recovered, the oncall SRE noticed some Elasticsearch pods living on the same hardware node crash-looping with an Elasticsearch error about index corruption. This set of errors kicked off more investigation and another round of environment management.We discovered that one of the nodes was missing primaries on a volume that showed corruption errors. As a result, we engaged with our third-party vendor to see if we were facing disk corruption.

2020-06-12 13:00 –18:00

We thought we had encountered an NVME bug that corrupted the data and started digging to confirm or refute that theory. By 15:37, we were able to confirm that the NVME bug was inapplicable to our environment and therefore a red herring. Meanwhile, we continued to manage the environment due to the initial noisy neighbor hitting again and the continual thread pool issue, and we continued all of the previous investigations with the assistance of our Elasticsearch experts.

2020-06-12 18:30 –20:15

We found and started allocating stale primaries for all of the unassigned shards that are around in an attempt to retrieve lost data. The nodes holding the stale primaries had not been touched before the data corruption began, so they were less likely to have corruption.

2020-06-13 02:10

We found the same index corruption error on a different hardware node and its Elasticsearch pods.

2020-06-13 ~02:20 –3:00

We started to provision more nodes to handle increased load and allocated more stale primaries on certain clusters to recover data. Some customers on these clusters experienced a degradation in service of the search function while this action was performed.

2020-06-13 03:15

We discovered one corrupted Elasticsearch pod that had a few primary shards from unreplicated indices, so there was no replica shard available to use to recover lost data.

2020-06-13 03:25 –04:50

We released a set of hotfixes to handle this new version of the noisy neighbor problem. We immediately began seeing significant improvement in the clusters later attributed to the hotfix.

2020-06-13 05:31 –05:49

The new nodes were ready and bootstrapping began.

2020-06-13 06:07

We again believed we had the situation under control and updated the status page accordingly.

2020-06-13 06:38

A final hotfix was released to handle the last of the edge cases.

2020-06-13 10:57 –onward

We moved to more monitoring the incident rather than actively managing the environment as it appeared the last of the hotfixes worked. We moved toward a stronger focus on recovery.

2020-06-13 18:18

A final update to the status page was added to mark this incident as resolved.

2020-06-13 21:19

We ran into a similar thread pool issue, but it didn’t seem to have the same effect as before with the new hotfix.

2020-06-13 21:56

The oncall SRE observed that the increased speed from the hotfix caused a change in our deferred batch processing. As the problem seemed to be resolving itself generally quickly, the bug was considered a low severity and was relegated to the development backlog.

2020-06-15 15:56 –17:14

After some further monitoring during a particularly busy period that was similar to the trigger of the initial event, we decided the hotfix was holding and the incident was fully resolved in our incident management process.

Key Factors

First, a quick general note. Elasticsearch is built off of the Apache Lucene platform,1 and a lot of the following discussions will refer to the underlying Lucene processes and how they affected our Elasticsearch clusters.

The “Noisy Neighbor” Problem

The issue of one customer generating a data spike that kicks off a shift of resources to manage that spike has been something we’ve managed through a semi-automated system in the past. Normally, this spike would be a small blip on the radar at most, requiring minimal intervention to offset. While this factor was mainly a trigger that kicked off the incident, we were able to add a more code-based, permanent solution to our architecture as a result that reduced the need for more manual intervention in similar future cases.

Merge Threads Problem

Lucene has the concept of an index, which is essentially a specialized data store that holds documents--objects that hold data in a standardized format. These indices are further broken down into segments, or sub-indices, that are immutable, or unable to be changed once created. Lucene will merge these segments together to create larger segments as more segments are generated when more data is added to the overall index.2 Elasticsearch’s shards are essentially Lucene indices, and therefore, by extension, Elasticsearch shards also have this merge function. Elasticsearch also uses the thread pool design pattern throughout its architecture to manage the functions like bulk insertion of data into shards and flushing indices to disk.

In our case, we were constantly hitting a high thread pool count while data were getting written, which caused significant lag times. Generally, this high thread pool count is related to the noisy neighbor problem with significant spikes of data incoming to the system. Here, the coupling of the noisy neighbor problem with the external factors we were dealing with made the issue appear in an unusual form. The fact that the hotfix reduced the impact of high thread pool counts on our architecture bolsters this argument.

Incidentally, the thread pool spike could also account for the initial impact on search when the data spike occurred. High thread pool counts on merge can be a symptom of a large number of segments being created all at once as large amounts of data are added to the index, and that rapid turnover of memory cache can cause invalidation of search results and therefore degradation of search availability.

External Concerns

Index Corruption and Storage Timeouts

After some investigation, the index corruption issue we saw on our longer-storage systems seemed to stem from a connection timeout to our third-party storage system, and this symptom does not seem to have been isolated to our systems.3 The underlying cause was out of our scope, but the technical recovery process is worth talking about separately.

Elasticsearch duplicates clusters across an entire ecosystem, creating primary and secondary (replica) shards on different clusters.4 The source of truth for a document in the cluster is the primary shard. Generally, whatever happens on the primary shard then is replicated to the secondary shards, and new secondaries can then be created or destroyed at any time. In our case, the secondaries are created over time, leaving a trail of “stale” copies of the data until those secondary shards are eventually assigned and wiped. The newer secondaries had the same data corruption issues as the failed primaries. However, the stale secondaries had the majority of the data from the primary shard without the corruption problem, and they luckily had not yet been wiped. We were able to manually reassign the secondaries as primaries, buffering them to memory and then flushing them to disk. In doing so, we were able to recover nearly all of the missing data from the corrupted indices.

Other External Bugs

Two other factors involved external sources where we ran into unique bugs that have since been reported and solved by third parties. As these bugs belong to other sources, we won’t spend too much time here explaining them. One bug involved labeling of groups of volumes in Kubernetes-based clusters.5 The other, as noted, was related to the Stork health check call. The health check didn’t seem to identify the storage driver that was running or that the storage driver was still booting, and that caused all of the disk-backed pods to get ejected from a node. For more information on this last bug, we refer you to the relevant pull request on the open-source project.

In a larger sense, one of the hallmarks of a mature platform is the ability to mitigate external factors swiftly with minimal impact on the customer’s experience. That hallmark is why the industry balances loads across datacenters and encourages rolling updates and restarts across microservices architectures, as just two examples. The aim of mitigation strategies is to handle issues gracefully, such as seamlessly transitioning a user’s session to a different datacenter if the first datacenter’s connectivity falters or preserving a known working function if an update to an external dependency breaks part of an application’s functionality. No one expects that a third party’s systems work all of the time. Instead, we all work off of service-level agreements and uptime statistics and then attempt to have worst-case scenarios planned and ready should something we rely on go down. And we find solid third-party partners that react swiftly to reports that something isn’t working properly.

Next Steps

We are continuing to monitor and tweak the hotfix we released to production that seemed to address our “noisy neighbor” problem once and for all. There will definitely be more optimization that is needed simply because it was a hotfix released at speed to address a pressing issue in production. Examples of this optimization include the following potential investigations:

  • Improving our cluster balance based on the new metrics around data spikes to ensure no single cluster can take over all of the processing power available in a cluster when encountering a spike in incoming data.
  • Decoupling our clusters by ensuring our worker pools pause processing on problematic clusters while maintaining throughput on all other clusters.
  • Revising our auto-resume functionality, which would automatically attempt to resume problematic clusters in an exponential backoff fashion but which actually compounded the issue as work from the slow clusters would leak into the worker pool and slow the total worker throughput down. For now, the SRE team will make the determination of when to pause and resume clusters, and we may modify our data pipeline to provide finer throttling tools for our SRE team when there are spikes in data.

In regards to the external factors that complicated this incident, we generally were lucky to get engagement quickly from our third-party partners. We have some new internal mitigation techniques to standardize in case we run into similar issues in the future:

  • First, we are working to ensure that we are running the same version of our storage driver in every environment. Due to code freezes requested by various business partners and a transition to a different storage driver that is more robust, our environments have gotten out of sync, and we did not have a set way to catch up everything in a safe, controlled manner. We are standardizing that catchup process and adding in additional testing capacity to ensure that any process we do land on does not introduce further errors. We are moving everything to the more robust storage driver, as well, which should address some of the errors we have encountered in other incidents.
  • We are adding additional testing capacity to our post-upgrade and post-deployment processes, such as canary processes, to reduce the potential for uncaught or unknown bugs to affect our production environments. This addition will help in the case of unidentified kernel bugs, for example, so we can immediately mitigate and roll back as necessary.
  • We are adding additional checks and tooling to ensure that no shard is unreplicated in any of our environments to prevent data loss as we experienced here. While we may still run into similar index corruption issues in the future, replication and better backups will mitigate that external factor. We’re also standardizing the manual secondaries assignment to replicate the process in any necessary future incident.
  • We are standardizing the procedure generated during the incident to manage the high thread pool issue in Elasticsearch. By writing up and standardizing this playbook, we aim to ensure anyone on the team can replicate the process efficiently and accurately, addressing issues as soon as they arise rather than allowing them to build up while we go back to remember how to manage the issue again.
  • Finally, we are improving our tooling to ensure faster transparency with our customers when we do encounter incidents that affect them. We aim to ensure that customers who are immediately affected get faster communications and provide better information to our customer support and success teams.

Wrap Up

While this incident mostly ended up being compounded by external factors, we learned more about how to mitigate those factors and addressed some underlying issues with our infrastructure that the attempts to mitigate the problems uncovered. Kudos goes to our partners that stepped in immediately to help understand the problem and work on their respective ends to address underlying issues that we encountered. We also were pleased that our new notification systems for customers within the app seemed to work so well in our first test of our initial action items from the last postmortem.

1. See https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up for a still-relevant, solid dive into how Elasticsearch is built on Lucene. ↩

2. See http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

3. See https://forums.portworx.com/t/failed-to-mount-volumes-in-pod-http-error-404/445

4. https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html

5. https://docs.portworx.com/reference/release-notes/portworx/#2-5-0-2

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines
    6 Steps to Implementing a Telemetry Pipeline
    Webinar Recap: Taming Data Complexity at Scale