Querying Multiple ElasticSearch Clusters with Cross-Cluster Search

4 MIN READ
MIN READ

With great power comes great responsibility, and with big data comes the necessity to query it effectively. ElasticSearch has been around for over seven years and changed the game in terms of running complex queries on big data (petabyte scale). Tasks like e-commerce product search, real-time log analysis for troubleshooting or generally anything that involves querying big data is considered “data intensive”. ElasticSearch is a distributed, full-text search engine, or database, and the key word here is “distributed”.A lot of small problems are much easier to deal with than a few big ones, and DevOps is all about spreading out dependencies and responsibility so it’s easier on everyone. ElasticSearch uses the same concept to help query big data; it’s also highly scalable and open-source. Imagine you need to setup an online store, a private “Google search box” that your customers could use to search for anything in your inventory. That’s exactly what ElasticSearch can do for your application monitoring and logging data. It stores all your data, or in the context of our post, all your logging data in nodes that make up several clusters.

ElasticSearch and Data Analysis

Staying with the online store example, modern day queries can get pretty technical and a customer could, for example, be looking for only products in a certain price range, or a certain color, or a certain anything. Things can get more complicated if you’re also running a price alerting system that lets customers set alerts if things on their wish list drop below a certain price. ElasticSearch gives you those full-text search and analytics capabilities by breaking data down into nodes, clusters, indexes, types, documents, shards and replicas. This is how it allows you to store, search, and analyze big data quickly and in “near” real time (NRT).The architecture and design of ElasticSearch is based on the assumption that all nodes are located on a local network. This is the use case that is extensively tested for and in a lot of cases is the environment that users operate in. However, monitoring data can be stored on different servers and clusters and to query them, ElasticSearch needs to run across clusters. If your clusters are at different remote locations, this is where ElasticSearch’s assumption that all nodes are on the same network starts working against you. When data is stored across multiple ElasticSearch clusters, querying it becomes harder.

Global Search

sar_helicopter

Network disruptions are much more common between distributed infrastructure (even with a dedicated link) along with a host of other problems. Workarounds are what adapting to new technology is all about, and there have been a number of them -- one of the most recent and effective ones being tribe nodes. ElasticSearch has a number of different use cases within organizations and are spread across departments. It could be used for logging visitor data in one, analyzing financial transactions in another, and deriving insights from social media data across a third.Since data resides across the cluster on different nodes, some complex queries need to get data from multiple nodes and process them; for that you need to query multiple clusters. If all these clusters are not at the same physical location, tribe node connects them and lets you deal with them like one big cluster. What makes the tribe node unique is that it doesn’t impose any restrictions on core APIs like the cross-cluster search. The tribe node supports almost all APIs, with the exception of meta-level APIs like Create Index, for example, which must be executed on each cluster separately.

Tribe Nodes

The tribe node works by executing search requests across multiple clusters and merging the results from each cluster into a single global cluster result. It does this by actually joining each cluster as a node that keeps updating itself on the state of the cluster. This uses considerable resources, as the node has to acknowledge every single cluster state update from every remote cluster.Additionally, with tribe nodes, the node that receives the request (the corresponding node) basically does all the work. This means the node that receives the request identifies which indices, shards, and nodes the search has to be executed against. It sends requests to all relevant nodes, decides what the top N-hits that need to be fetched are and then actually fetches them.The tribe node is also very hard to maintain code-wise over time -- especially since it’s the only exception to ElasticSearch’s rule that a node must belong to one and only one cluster.

Cross-Cluster Search

If DevOps is about spreading the load around, it’s pretty obvious what the problem is with tribe node. One node is being taxed with all the processing work while the nodes not relevant to the query standby idle. With cross-cluster search, you’re actually remotely querying each cluster with its own _search APIs, so no additional nodes that need to be constantly updated would join the cluster and slow it down. When a search request is executed on a node, instead of doing everything itself, the node forwards the indices at a rate of one _search_shard request per cluster.The _search API allows ElasticSearch to execute searches, queries, aggregations, suggestions, and more against multiple indices which are in turn broken down into shards. The concept is that instead of having a huge Database A, Database B, Database C and so on, it merges everything into one giant solid block of data. The next step is to break it down into bits (shards) and give every node a piece to look after, worry about, care for, maintain, and query when required. This makes it a lot easier to query since the load is spread evenly across all the nodes.Now, unlike in tribe nodes where the first node would wait for every node to reply, do the math and fetch the documents, with cross-cluster search the initial node has done it’s job already. Once the shards are sent to all clusters for comparison, all further processing is done locally on the relevant clusters. Further time and processing power is saved by sending shards to only 3 nodes per cluster by default; you can also choose how many nodes per cluster you would like discovered.

One Direction

one-way-sign-1434558486ukg

Now traffic flows only one way in cross-cluster searches and that means the corresponding node just passes the message on to the three default nodes that carry on the process. You can also choose which nodes you would like to act as gateways and which nodes you would actually like to store data on. This gives you a lot more control over the traffic going in and out of your cluster. Again, unlike tribe nodes that require an actual additional node in each of your clusters, with cross-cluster search, no additional or special nodes are required for cross cluster searches and it isn’t tied to any specific API. Any node can act as a corresponding node and you can control which nodes get to be corresponding nodes and which nodes don’t. Furthermore, when merging clusters, tribe nodes can’t keep two indices with the same name even if they’re from different clusters. Cross-cluster search aims to fix this limitation by being able to dynamically register, update, and remove remote clusters.

The Need For Logging

There are also commercial algorithms built on ElasticSearch to make life and logging even easier. LogDNA is a good example, and we've been known to talk about our product in context of it being the “Apple of logging”. Along with predictive intelligence and machine learning, LogDNA allows users to aggregate, search and filter from all hosts and apps. LogDNA also features automatic parsing of fields from common log formats, such as weblogs, Mongo, Postgres and JSON. Additionally, we offer a live-streaming tail using a web interface or command line interface (CLI). LogDNA provides the power to query your log data end-to-end without having to worry about clusters, nodes, or indices. It does all the heavy lifting behind the scenes so you enjoy an intuitive and intelligent experience when analyzing your log data. As we discussed earlier, maintaining your own ElasticSearch stack and stitching together all the infrastructure and dependencies at every level is a pain. Instead, you’re better off saving all that time and opting for a third-party tool that abstracts away low-lying challenges and lets you deal with your log data directly. That’s what tools like LogDNA do. From not being able to query across clusters, to querying through a special node, to finally remotely querying across clusters, ElasticSearch is certainly making progress. In an age where data is “big”, nothing is as important as the ability to make it work for you and we can sure expect ElasticSearch to continue to make this feature better. However, if you’d rather save yourself all the effort of managing multi-cluster querying, and instead analyze and derive value from your log data, LogDNA is the way to go.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines