How LogDNA Scales Elasticsearch

4 MIN READ
MIN READ

I spoke at Container World 2019 in Santa Clara and shared insights on how LogDNA scales Elastic Search using Kubernetes over the years.

 

https://youtu.be/3SqtHoRsyF8

Here are some highlights from the talk and you can also find the slide deck below.

First, the basics...

What is Elasticsearch (ES) and why would I use it?

Elastic Search is the “E” in the popular ELK stack and allows easy searching of unstructured data. It is a distributed full-text search engine that is queryable using a JSON API and great for logging. It scales relatively easily because it handles clustering and syncing tasks across nodes and nodes can be added relatively easily. It’s a popular choice when in the market for an out of the box solution that’s easy to get started

Why would you want to run Elasticsearch (ES) on Kubernetes (k8s)?

Kubernetes is an open source container orchestration platform developed by Google. It schedules all your workloads onto all available resources. The cloud providers also have integrations that scales resources like memory and you don’t have to do it by hand.  Kubernetes allows for configuration as code and static docker images enforce consistent pod behaviors across your infrastructure. You’ve been watching the Kubernetes hype train ship and want to jump on board.

Why does LogDNA use ES and k8s?

At LogDNA, we have made many modifications to the Elasticsearch interface and we’ve built in-house versions of the L (Logstash) and K (Kibana) of the ELK stack for better performance.

We needed a consistent way to deploy our software across varying infrastructures. We run our application on both cloud and on-premise and we are agnostic to wherever our customers want to run LogDNA, whether it’s Amazon, Azure, a data center in Las Vegas, a barn in Russia, anywhere.

We use Kubernetes to help us better automation for versioning, CI/CD and maintenance. We run ES on k8s at scales.

Running Elasticsearch on Kubernetes is not straight forward

These are a few of the steps involved in running ES on Kubernetes:

  • Choose the appropriate Elasticsearch version and select the correct settings (there are hundreds of settings)
  • Learn the expansive query language for Elasticsearch and integrate it into your workflows
  • Set up a Kubernetes environment with access to appropriately sized hardware
  • Configure the Elasticsearch k8s workload to request the appropriate resources, including disks
  • Ensure the correct index templates and cluster settings are applied after launching your ES cluster
  • Create k8s services such that Elasticsearch pods can find each other
  • Troubleshoot all remaining issues as they arise and continue to manage and scales the cluster

How LogDNA got started running ES on k8s

First, at LogDNA we started with a few sane defaults that we recommend:

  • ES version 5.5 & Kubernetes cluster v1.11+. We use the alpine flavor of Elasticsearch to keep our docker image small: elasticsearch:5.5.2-alpine.
  • Hardware resources that worked for us are (k8s nodes) with at least 64 GB of RAM and 16 vCPUs (depends on your volume)

We will dive deeper into:

  1. Statefulsets and Services yaml configurations (we need them for identity, disks, and networking)
  2. Basic, but important cluster settings & a good starter index template. Index templates define how data is saved to an index, knowing how to configure this has helped us increase performance by an order of magnitude.
  3. Deploy an ES cluster management GUI (cerebro) to help with troubleshooting

1. Configuration Tips

  • Use two ConfigMaps
  • The elasticsearch configuration file
  • A start script used to configure ulimits, permissions, and JVM heap size
  • We use three ES role types (statefulsets)
  • Master - handles lightweight cluster-wide actions (does not require disk)
  • Hot - handles incoming writes to active indices (higher cpu to disk ratio)
  • Cold - stores and queries older indices (lower cpu to disk ratio)
  • volumeClaimTemplates is a field in the Statefulsets which allows for disk and identity. This allows you to dynamically provision disks.
  • There are important security context settings that you need in each statefulset. This is where the obscure knowledge starts showing. You need to add security context for priviledged containers to run Elasticsearch. You need IPC Lock to make sure you have access to more file handles than usual and sys resource.
Security context settings
  • Use k8s pre-emption to secure your ES pods get scheduling priority. When you’re managing a large set of nodes in kubernetes sometimes small microservices take up 10% of the nodes and you need that extra 1% to launch that Elasticsearch pod because it takes up more resources. What pre-emption does is to say "hey, my disk space pod is more important than your microservice, I’m going to bump you off and be scheduled here and it’s great for scales."
  • Create a startup script to set the correct configuration prior to starting the JVM (note the vm.max_map_count, we couldn’t run it without it).

2. Basic Cluster Settings

Service Discovery in Kubernetes

Once you have your pods, you’ll need to worry about since ES is a distributed database, the pods need talk to each other. ES hot and cold have a single load balanced cluster IP service endpoint for insertions and query data.

ES masters are really important because they hold an election to discover each other you have to make sure you have:

  • 1 load balanced cluster IP for handling transport (9300) and HTTP API requests (9200)
  • 1 clusterIP:None service used for ES unicast discovery.

What this does is to allow you to list all the available IP addresses for the pods that are in the group. Instead of getting a load balanced endpoint, all the masters can discover each other.

2 important settings for clusterIP:None

  • Ensure the DNS is publishable immediately
  • No sessionAffinity ensures that your addresses are up to date. If the master pod has a probe that determines if it’s ready to launch, it doesn’t fail publishing the address so that things discover each other without going through crash loop back off.

ES Startup Settings

Here's what we use:

  • Ensure memory_lock is on - important when consuming large amounts of memory
  • Adjust the minimum master nodes based on the total number of masters you have, make sure that is sized appropriately to be a functioning cluster.
  • The clusterIP:None service is referenced by the unicast settings
  • Set the correct ES role
  • Specify the number of cores that the JVM will use

Configuring an index template

What’s irritating is that index templates can’t be set ahead of time. You have to go and ping the API once your ES is up and then add your index templates. We have a job that does that.

  • Optimizing shards is helpful for fine tuning performance. The more shards you have the more distributed your system but the more state that has to be syncronized.
  • Configure index.total_shards_per_node based on your expected load
  • Optimizing shards can increase performance and reduced cluster state overhead
  • Set a refresh_interval that works for you - how often data is surfaced after insertion to be searchable.
  • Higher refresh intervals offer better throughput performance at the cost of latency
  • We typically use 15-30 seconds
  • Change translog.durability to async (allow asynchronous translog writes)
  • We regret not discovering this setting sooner, as it gave us 5-10x increase in performance
  • Note: index templates MUST be applied AFTER the ES masters are already running

3. Managing Elasticsearch

A) Cerebro (Manage using a GUI)

  • If you’re using ES v2.x or lower, use kopf

Cerebro connects to your ES service endpoint(s). It contains an ES node/pod list and their health stats. You can easily view indices and shards across the available data nodes. You can modify index settings, templates, and data. Most importantly you can move shards around.

Not everything is available via Cerebro.

B) Managing ES through API calls

We use Insomnia (a REST API GUI to share API calls) though curl works too

  • Here are API calls we commonly use:
  • /_cluster/health
  • Outputs useful information about the cluster. Cerebro displays this status color as a bar on the top.
  • /_cat/pending_tasks?v
  • We look at this a lot. When there is a large number of pending tasks, the operations that require cluster sync take a while. One thing you can do if you run into this problem is to lock shard allocations (easily done through the cerebral gui) which clears out the pending task list
  • /_flush?force & /_cluster/reroute?retry_failed=true
  • When you see unassigned shards in your cluster, it means that the shards can’t figure out where to live and they retry and stop. One of the things you can do is flush the cache and force the cache to retry.

Wrap Up

I know we’ve walked through a lot of what seems like obscure settings in Elasticsearch. When you’re running Elasticsearch in a Docker container you have to realize that it was not designed for Docker containers. It requires some coaxing to properly run inside a container.

  • Remember to use the correct security context, ulimit, and vm settings.
  • There are native concepts in Kubernetes than can make running ES easier
  • Index templates have a big impact on how well your ES cluster runs
  • We recommend managing Elasticsearch with both GUIs (cerebro) and ES APIs. Both are extremely useful for tuning performance
Download Slides

Feel free to reach out and share your experience in scaling ES with K8. Instead of worrying about scaling your own log management solution, give LogDNA a try and sign up for a 14-day free trial.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines