Incident Postmortem: 08 June 2020
We encountered four interwoven bugs that caused a degradation of service in one of our production instances. As the bugs have particularly interesting implications and some significant lessons learned, we thought it would be prudent to explain the situation to the community to help the next person who may encounter one of these problems. The majority of the problems we encountered were external in nature, and we’re working toward mitigating the impact of future external problems.
Note: All times listed are in UTC.
On 08 June 2020 at 10:35, our on-call SRE received an alert that our production instance needed attention. Investigation began immediately.
When addressing an unusual problem with a spike in incoming data, we discovered four distinct but interconnected bugs that compounded the reverberations of the data spike. Our Elasticsearch clusters were exhibiting high thread pool counts that exacerbated the data spike, and when we attempted to increase our hardware allocation, we ran into three external bugs with our disks that triggered index corruption issues, further compounding the problem. While identifying mitigations and workarounds for two of the external bugs, we figured out a way to adjust our system to handle data spikes better and address the thread pool count in the process.
Detailed Times (in UTC)
A customer’s incoming data started spiking. This spike didn’t start impacting other systems until our first alert at 10:35.
We received an alert, which kicked off our incident management process.
2020-06-08 11:31 –2020-06-09 14:58
We found our first indication of a “noisy neighbor” problem: A customer having a spike in incoming data that was so large other clusters were affected as the load balanced. Our typical mitigation strategy wasn’t working very well, and we started investigating why. One issue we identified was that our prioritization strategies were not working properly.We also were getting a number of noisy alerts about pod numbers that briefly sent us down the wrong path.Elasticsearch high thread pool issues arose that were not like any we had seen before. We continued to mitigate with manual intervention to keep the system running while we tried to identify the underlying issue.
2020-06-09 15:30 –2020-06-10 02:00
We started noticing Elasticsearch pods on various clusters had high thread pools and were constantly restarting. We attempted to fix it by clearing the Elasticsearch cache and adjusting how different components could request CPU resources to free up those resources for Elasticsearch. Neither approach worked, and we continued to manage the environment manually while searching for a cause.
2020-06-10 05:00 –16:45
We provisioned new hardware for our backend systems to scale up and manage the load better. While setting up the new hardware, we continued to manage the environment and kept hitting the threadpool problem.
2020-06-10 16:45 –2020-06-11 04:45
When scaling up the new hardware, we discovered a bug, which we’ll call here a group label bug, in a third-party provider that blocked connections from Elasticsearch to storage volumes provisioned on the new hardware. We engaged with the third-party vendor and began to work around the problem manually.
2020-06-11 04:45 –09:30
While still working around the group label bug, we observed random Elasticsearch pod restarts all around the clusters, seemingly at random, making the Elasticsearch recovery even more difficult. We continued engaging with the third-party vendor while managing the environment as best we could.
We identified patterns in the pod restarts. The restarts did not only affect Elasticsearch pods; all disk-backed pods on one hardware node would restart simultaneously, then another random node would proceed through the same restart cycle. This was our first suspicion of a second bug related to the third-party provider.
We confirmed that we encountered a bug with Stork, which our third-party storage provider uses for orchestration. A Stork healthcheck that ensures the storage driver is running on a node killed disk-backed pods on any nodes that did not seem to be running that storage driver. Despite these nodes actually having Portworx (our storage driver) running, the healthcheck didn’t identify the running instance and therefore killed all of the pods on the node. We disabled the healthchecks to stop the pods from crashing while the core devs from the Stork project started digging.
2020-06-11 13:00 –18:15
We continued to manage the environment as it started to stabilize a bit after the healthcheck bug was identified and mitigated, and we continued to work around the group label bug manually to scale up. We still were seeing high thread pool issues, which slowed down Elasticsearch. We decided to lock all Elasticsearch indices from past days and stop shard assignment from all days other than the current day.
2020-06-11 19:15 –20:00
We disabled thin provisioning on nodes, which is enabled by default, to manage our resources in a more controlled fashion.
2020-06-11 22:30 –2020-06-12 02:00
We figured out there was a strong correlation between high merge counts and high thread pool queue in our Elasticsearch pods. As we continued to try to identify why we were dealing with high thread pool issues, we continued to manage the environment.
2020-06-12 02:21 –04:50
While trying to manage the environment, we observed a sudden drop in our deferred backlog and realized that some of the batches in the deferred queue hit our 18-hour retention limit, causing them to get deleted from disk. We immediately worked on the data loss and engaged with our Elasticsearch experts to understand how we might mitigate the data loss.
2020-06-12 05:00 –08:00
We hit a setback with the group label bug where nodes rebooted and reset back to the state where the bug prevented volume mounting, so we had to re-decommission the nodes. Meanwhile, we worked with our Elasticsearch experts to handle the thread pool issue while continuing to manage the environment and investigate the data loss.
We believed we had gotten the situation under control and had updated the status page accordingly.
2020-06-12 09:30 –12:15
While monitoring the environment since the incident was still open until all Elasticsearch clusters had fully recovered, the oncall SRE noticed some Elasticsearch pods living on the same hardware node crash-looping with an Elasticsearch error about index corruption. This set of errors kicked off more investigation and another round of environment management.We discovered that one of the nodes was missing primaries on a volume that showed corruption errors. As a result, we engaged with our third-party vendor to see if we were facing disk corruption.
2020-06-12 13:00 –18:00
We thought we had encountered an NVME bug that corrupted the data and started digging to confirm or refute that theory. By 15:37, we were able to confirm that the NVME bug was inapplicable to our environment and therefore a red herring. Meanwhile, we continued to manage the environment due to the initial noisy neighbor hitting again and the continual thread pool issue, and we continued all of the previous investigations with the assistance of our Elasticsearch experts.
2020-06-12 18:30 –20:15
We found and started allocating stale primaries for all of the unassigned shards that are around in an attempt to retrieve lost data. The nodes holding the stale primaries had not been touched before the data corruption began, so they were less likely to have corruption.
We found the same index corruption error on a different hardware node and its Elasticsearch pods.
2020-06-13 ~02:20 –3:00
We started to provision more nodes to handle increased load and allocated more stale primaries on certain clusters to recover data. Some customers on these clusters experienced a degradation in service of the search function while this action was performed.
We discovered one corrupted Elasticsearch pod that had a few primary shards from unreplicated indices, so there was no replica shard available to use to recover lost data.
2020-06-13 03:25 –04:50
We released a set of hotfixes to handle this new version of the noisy neighbor problem. We immediately began seeing significant improvement in the clusters later attributed to the hotfix.
2020-06-13 05:31 –05:49
The new nodes were ready and bootstrapping began.
We again believed we had the situation under control and updated the status page accordingly.
A final hotfix was released to handle the last of the edge cases.
2020-06-13 10:57 –onward
We moved to more monitoring the incident rather than actively managing the environment as it appeared the last of the hotfixes worked. We moved toward a stronger focus on recovery.
A final update to the status page was added to mark this incident as resolved.
We ran into a similar thread pool issue, but it didn’t seem to have the same effect as before with the new hotfix.
The oncall SRE observed that the increased speed from the hotfix caused a change in our deferred batch processing. As the problem seemed to be resolving itself generally quickly, the bug was considered a low severity and was relegated to the development backlog.
2020-06-15 15:56 –17:14
After some further monitoring during a particularly busy period that was similar to the trigger of the initial event, we decided the hotfix was holding and the incident was fully resolved in our incident management process.
First, a quick general note. Elasticsearch is built off of the Apache Lucene platform,1 and a lot of the following discussions will refer to the underlying Lucene processes and how they affected our Elasticsearch clusters.
The “Noisy Neighbor” Problem
The issue of one customer generating a data spike that kicks off a shift of resources to manage that spike has been something we’ve managed through a semi-automated system in the past. Normally, this spike would be a small blip on the radar at most, requiring minimal intervention to offset. While this factor was mainly a trigger that kicked off the incident, we were able to add a more code-based, permanent solution to our architecture as a result that reduced the need for more manual intervention in similar future cases.
Merge Threads Problem
Lucene has the concept of an index, which is essentially a specialized data store that holds documents--objects that hold data in a standardized format. These indices are further broken down into segments, or sub-indices, that are immutable, or unable to be changed once created. Lucene will merge these segments together to create larger segments as more segments are generated when more data is added to the overall index.2 Elasticsearch’s shards are essentially Lucene indices, and therefore, by extension, Elasticsearch shards also have this merge function. Elasticsearch also uses the thread pool design pattern throughout its architecture to manage the functions like bulk insertion of data into shards and flushing indices to disk.
In our case, we were constantly hitting a high thread pool count while data were getting written, which caused significant lag times. Generally, this high thread pool count is related to the noisy neighbor problem with significant spikes of data incoming to the system. Here, the coupling of the noisy neighbor problem with the external factors we were dealing with made the issue appear in an unusual form. The fact that the hotfix reduced the impact of high thread pool counts on our architecture bolsters this argument.
Incidentally, the thread pool spike could also account for the initial impact on search when the data spike occurred. High thread pool counts on merge can be a symptom of a large number of segments being created all at once as large amounts of data are added to the index, and that rapid turnover of memory cache can cause invalidation of search results and therefore degradation of search availability.
Index Corruption and Storage Timeouts
After some investigation, the index corruption issue we saw on our longer-storage systems seemed to stem from a connection timeout to our third-party storage system, and this symptom does not seem to have been isolated to our systems.3 The underlying cause was out of our scope, but the technical recovery process is worth talking about separately.
Elasticsearch duplicates clusters across an entire ecosystem, creating primary and secondary (replica) shards on different clusters.4 The source of truth for a document in the cluster is the primary shard. Generally, whatever happens on the primary shard then is replicated to the secondary shards, and new secondaries can then be created or destroyed at any time. In our case, the secondaries are created over time, leaving a trail of “stale” copies of the data until those secondary shards are eventually assigned and wiped. The newer secondaries had the same data corruption issues as the failed primaries. However, the stale secondaries had the majority of the data from the primary shard without the corruption problem, and they luckily had not yet been wiped. We were able to manually reassign the secondaries as primaries, buffering them to memory and then flushing them to disk. In doing so, we were able to recover nearly all of the missing data from the corrupted indices.
Other External Bugs
Two other factors involved external sources where we ran into unique bugs that have since been reported and solved by third parties. As these bugs belong to other sources, we won’t spend too much time here explaining them. One bug involved labeling of groups of volumes in Kubernetes-based clusters.5 The other, as noted, was related to the Stork health check call. The health check didn’t seem to identify the storage driver that was running or that the storage driver was still booting, and that caused all of the disk-backed pods to get ejected from a node. For more information on this last bug, we refer you to the relevant pull request on the open-source project.
In a larger sense, one of the hallmarks of a mature platform is the ability to mitigate external factors swiftly with minimal impact on the customer’s experience. That hallmark is why the industry balances loads across datacenters and encourages rolling updates and restarts across microservices architectures, as just two examples. The aim of mitigation strategies is to handle issues gracefully, such as seamlessly transitioning a user’s session to a different datacenter if the first datacenter’s connectivity falters or preserving a known working function if an update to an external dependency breaks part of an application’s functionality. No one expects that a third party’s systems work all of the time. Instead, we all work off of service-level agreements and uptime statistics and then attempt to have worst-case scenarios planned and ready should something we rely on go down. And we find solid third-party partners that react swiftly to reports that something isn’t working properly.
We are continuing to monitor and tweak the hotfix we released to production that seemed to address our “noisy neighbor” problem once and for all. There will definitely be more optimization that is needed simply because it was a hotfix released at speed to address a pressing issue in production. Examples of this optimization include the following potential investigations:
- Improving our cluster balance based on the new metrics around data spikes to ensure no single cluster can take over all of the processing power available in a cluster when encountering a spike in incoming data.
- Decoupling our clusters by ensuring our worker pools pause processing on problematic clusters while maintaining throughput on all other clusters.
- Revising our auto-resume functionality, which would automatically attempt to resume problematic clusters in an exponential backoff fashion but which actually compounded the issue as work from the slow clusters would leak into the worker pool and slow the total worker throughput down. For now, the SRE team will make the determination of when to pause and resume clusters, and we may modify our data pipeline to provide finer throttling tools for our SRE team when there are spikes in data.
In regards to the external factors that complicated this incident, we generally were lucky to get engagement quickly from our third-party partners. We have some new internal mitigation techniques to standardize in case we run into similar issues in the future:
- First, we are working to ensure that we are running the same version of our storage driver in every environment. Due to code freezes requested by various business partners and a transition to a different storage driver that is more robust, our environments have gotten out of sync, and we did not have a set way to catch up everything in a safe, controlled manner. We are standardizing that catchup process and adding in additional testing capacity to ensure that any process we do land on does not introduce further errors. We are moving everything to the more robust storage driver, as well, which should address some of the errors we have encountered in other incidents.
- We are adding additional testing capacity to our post-upgrade and post-deployment processes, such as canary processes, to reduce the potential for uncaught or unknown bugs to affect our production environments. This addition will help in the case of unidentified kernel bugs, for example, so we can immediately mitigate and roll back as necessary.
- We are adding additional checks and tooling to ensure that no shard is unreplicated in any of our environments to prevent data loss as we experienced here. While we may still run into similar index corruption issues in the future, replication and better backups will mitigate that external factor. We’re also standardizing the manual secondaries assignment to replicate the process in any necessary future incident.
- We are standardizing the procedure generated during the incident to manage the high thread pool issue in Elasticsearch. By writing up and standardizing this playbook, we aim to ensure anyone on the team can replicate the process efficiently and accurately, addressing issues as soon as they arise rather than allowing them to build up while we go back to remember how to manage the issue again.
- Finally, we are improving our tooling to ensure faster transparency with our customers when we do encounter incidents that affect them. We aim to ensure that customers who are immediately affected get faster communications and provide better information to our customer support and success teams.
While this incident mostly ended up being compounded by external factors, we learned more about how to mitigate those factors and addressed some underlying issues with our infrastructure that the attempts to mitigate the problems uncovered. Kudos goes to our partners that stepped in immediately to help understand the problem and work on their respective ends to address underlying issues that we encountered. We also were pleased that our new notification systems for customers within the app seemed to work so well in our first test of our initial action items from the last postmortem.
1. See https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up for a still-relevant, solid dive into how Elasticsearch is built on Lucene. ↩