Managing Dynamic Data Flows Across Elasticsearch Clusters

Ask about this page

Massively scaling free-text search has always been the holy grail in big data. Many software firms now face the burgeoning challenge of searching through previously untapped data sources and the current trend is far surpassing the petabyte scale. Here at Mezmo, formerly known as LogDNA, we manage free-text search for thousands of customers with distinct traffic profiles across a multitude of Elasticsearch clusters. The mapping of this customer data across elasticsearch clusters is dynamic and can become a huge headache to manage. This article details our Elasticsearch Index Manager (ESIM) which automates the mapping of disparate customer data sources to particular indices segmented within or between elasticsearch clusters.

Mezmo's system maps customer accounts to a particular Elasticsearch cluster and an index on that cluster, this mapping will be referred to as the Account-Cluster Map (AC map). All components write customer data to the appropriate index and when that data needs to be searched upon each component queries multiple Elasticsearch clusters, stiches the responses into one, and passes the results downstream. We add entries to the AC map when a customer creates a new account based on their data flow rate and the available resources in the several Elasticsearch clusters we run. However, customer data flow is dynamic and fluctuates on hourly, daily, or even lengthier timescales. As customer data flow scales up or down certain Elasticsearch clusters are underutilized, while others are overburdened experiencing high throughput and slow insertion times. The AC map needs to be appropriately updated based on varying customer traffic patterns and cluster resource utilization.

Updating the AC map was largely a manual task for our operations team and was usually triggered by a spike in customer traffic, or revised over time as the clusters became unbalanced. As the number of environments and on-premise deployments managed by the operations team grew, this manual task became quite cumbersome for the operations team to execute along with their other responsibilities. This became the impetus for Elasticsearch Index Manager, a cron job that periodically scans data flow patterns and resource utilization, then balances account traffic to the appropriate clusters by updating the AC map.

Elasticsearch Index Manager begins by querying all available Elasticsearch clusters for their resources. This includes understanding of the number of shards and documents within each cluster, along with noting the total number of nodes available within the cluster. This is an insight to the supply available to ESIM and each cluster can be sorted by capacity. Next ESIM must characterize the data flows into the clusters so it queries the number of documents/min (throughput) and total documents/index (total volume) for each customer account. This informs ESIM about the total demand generated by the customers for the Elasticsearch resources available. Now ESIM should have enough information to move larger accounts to clusters with more capacity and smaller accounts to clusters with less capacity. If the largest clusters do not have enough resources to handle all the large accounts then ESIM can also suggest adding resources to the Elasticsearch clusters in order to scale them up.

Today Elasticsearch Index Manager enforces a threshold template set by the operations team in order to make all the decisions for when to move a customer up to a larger cluster or down to one with less resources. The template describes the “types” of each cluster, which include small, medium, and large, and thresholds for minimum and maximum values that trigger a move up from a smaller cluster to a larger one or vice-versa. Thresholds are set for total documents in an index, documents added per minute to the index, and total shard size of the index. Each threshold is defined as a min or max and has an interval associated to calculate averages. If total size reaches a max or min or average size over time reaches a max or min threshold then this triggers an update of the AC map. These thresholds are tweaked by the operations team in order to keep the clusters in a healthy state. As ESIM scans each account it detects if the account triggers any of the thresholds and will print out the evidence for how exactly that threshold was met over time. This allows us to run ESIM in two modes SUGGEST or ENFORCE. In SUGGEST mode ESIM will run as above, but it will only print evidence for the AC map updates it wants to make for each account. In ENFORCE mode ESIM will actually update the AC map with the recommendations set by the threshold template.

In a world where we manage petabytes of fluctuating data flows across multiple Elasticsearch clusters, routing the data to use resources efficiently becomes a major concern. The amount of manual intervention it requires to manage a system like this can be quite draining on operators. An automated system that attempts to balance dynamic traffic patterns across clusters with varying resources can help alleviate the burden. At Mezmo, implementing ESIM has improved resource utilization across clusters and minimized the time that operations spend on responding to customer traffic spikes. In my experience, the best solution to search free-text big data is to horizontally scale Elasticsearch clusters. I’d love to hear how you manage petabytes of data across your elastic search clusters.

If you'd like to learn more you can check out our on-demand Webinar: Elasticsearch Index Manager.

Telemetry Pipeline

Table of contents

Managing Dynamic Data Flows Across Elasticsearch Clusters

More blog posts