Redesigning Kafka — A Message Streaming Platform Built for Logging

4 MIN READ
MIN READ

When it comes to delivering logs in distributed, high-volume environments, many DevOps teams use Apache Kafka. Kafka is well known for its scalability, throughput, and ability to load balance message streams across multiple nodes and consumers. At Mezmo, formerly known as LogDNA, we have a lot of experience using Kafka for logs, and while it worked for us initially, we quickly ran into performance and scaling limitations when prioritizing live data.

In order to enable our operations team to continue to scale efficiently, we realized we would need to either find or create an alternative solution.

Limitations of Kafka for Logging

1. No deferment queue

As a log management platform, we often receive sudden bursts and spikes of log data that queue up to be processed. Sometimes messages appear in the queue that contain unexpected or complex data structures, causing consumers to process the data more slowly or even fail in some cases. In Kafka, this data can lead to a severe backlog until either the consumer finishes processing the message, or the broker reassigns the workload to another consumer.

With the fastest LiveTail in the market, we want to prioritize live data over older messages that might be delaying or blocking the stream. To address this problem, we temporarily store these messages in a separate queue known as a defer queue. If a consumer fails to process a message, or if it takes too long to process a message, the message is moved to the defer queue and the consumer moves on to the next message. Later, when the cluster is idle, we go back and reprocess the events in the defer queue. This lets us address the problem of bad data without backing up the message stream or losing data.

2. Constraints scaling consumers

Kafka splits its message stream across multiple partitions, each of which is assigned to a specific set of consumers. Adding partitions leads to increased latency, resource usage, and possible unavailability. But since there’s a fixed ratio of n-consumers assigned to each partition, adding partitions requires you to add a proportional number of consumers dedicated to only that partition.

In addition, certain actions — such as adding a consumer or partition — triggers a reassignment (also known as a rebalance) of partitions across each consumer in a group. This is a lengthy and expensive process, especially for larger consumer groups, during which all processing on that topic is halted. If you’re scaling to meet increased demand, the last thing you need is a sudden drop in throughput.

Instead of enforcing an n-consumer-per-partition policy, we needed a way to assign any number of consumers to any number of partitions while still ensuring that each message is only read once. Mezmo designed a different way to ensure total order across topics and that each message is only read once. This lets us add as many brokers, consumers, or nodes as necessary while scaling much more efficiently.

3. Kafka consumers don’t fit containerization model

At Mezmo, our operations rely on Kubernetes for orchestration of our services. However, running Kafka on Kubernetes is a challenge.

Kafka consumers are stateful and responsible for tracking their progress through a partition. When a consumer finishes processing a message, it reports its current position in the partition back to the broker. Outside of storing each consumer’s position and availability, the broker is entirely unaware of the consumer’s state or progress. Not only does Kafka rely on persistent consumers, it also relies on the direct routes between consumers and brokers. In addition, when a consumer dies, Kafka halts in order to rebalance partitions and, if necessary, elect a new leader.

Our consumers run on Kubernetes, which assumes that they can be stopped and replaced anytime. Out of the box, Kafka isn’t designed to run as a containerized workload, which can lead to performance and stability problems. Instead of fitting a square peg into a round hole, we took the opportunity to design something that would work for us. Instead of the Kafka model of dumb brokers and smart consumers, we took the opposite approach. Consumers poll for messages, and the broker tracks which messages have been successfully delivered and processed. This lets us add or remove consumers on demand without the need to coordinate consumer groups or reassign partitions.

Closing Notes

Ultimately, we wanted to share our scaling challenges using Kafka as our message broker as we know that there’s many of you out there experiencing similar issues when scaling your ELK stack. I would love to hear from you to learn the ways in which you’ve modified Kafka for your needs.

Table of Contents

    Share Article

    RSS Feed

    Next blog post
    You're viewing our latest blog post.
    Previous blog post
    You're viewing our oldest blog post.
    Mezmo + Catchpoint deliver observability SREs can rely on
    Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for Root Cause Analysis (RCA)
    What is Active Telemetry
    Launching an agentic SRE for root cause analysis
    Paving the way for a new era: Mezmo's Active Telemetry
    The Answer to SRE Agent Failures: Context Engineering
    Empowering an MCP server with a telemetry pipeline
    The Debugging Bottleneck: A Manual Log-Sifting Expedition
    The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server
    Your New AI Assistant for a Smarter Workflow
    The Observability Problem Isn't Data Volume Anymore—It's Context
    Beyond the Pipeline: Data Isn't Oil, It's Power.
    The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace
    From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace
    Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility
    Architecting for Value: A Playbook for Sustainable Observability
    How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines
    Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo
    Introducing the New Mezmo Product Homepage
    The Inconvenient Truth About AI Ethics in Observability
    Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)
    Do you Grok It?
    Top Five Reasons Telemetry Pipelines Should Be on Every Engineer’s Radar
    Is It a Cup or a Pot? Helping You Pinpoint the Problem—and Sleep Through the Night
    Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos
    Why Datadog Falls Short for Log Management and What to Do Instead
    Telemetry for Modern Apps: Reducing MTTR with Smarter Signals
    Transforming Observability: Simpler, Smarter, and More Affordable Data Control
    Datadog: The Good, The Bad, The Costly
    Mezmo Recognized with 25 G2 Awards for Spring 2025
    Reducing Telemetry Toil with Rapid Pipelining
    Cut Costs, Not Insights:   A Practical Guide to Telemetry Data Optimization
    Webinar Recap: Telemetry Pipeline 101
    Petabyte Scale, Gigabyte Costs: Mezmo’s Evolution from ElasticSearch to Quickwit
    2024 Recap - Highlights of Mezmo’s product enhancements
    My Favorite Observability and DevOps Articles of 2024
    AWS re:Invent ‘24: Generative AI Observability, Platform Engineering, and 99.9995% Availability
    From Gartner IOCS 2024 Conference: AI, Observability Data, and Telemetry Pipelines
    Our team’s learnings from Kubecon: Use Exemplars, Configuring OTel, and OTTL cookbook
    How Mezmo Uses a Telemetry Pipeline to Handle Metrics, Part II
    Webinar Recap: 2024 DORA Report: Accelerate State of DevOps
    Kubecon ‘24 recap: Patent Trolls, OTel Lessons at Scale, and Principle Platform Abstractions
    Announcing Mezmo Flow: Build a Telemetry Pipeline in 15 minutes
    Key Takeaways from the 2024 DORA Report
    Webinar Recap | Telemetry Data Management: Tales from the Trenches
    What are SLOs/SLIs/SLAs?
    Webinar Recap | Next Gen Log Management: Maximize Log Value with Telemetry Pipelines
    Creating In-Stream Alerts for Telemetry Data
    Creating Re-Usable Components for Telemetry Pipelines
    Optimizing Data for Service Management Objective Monitoring
    More Value From Your Logs: Next Generation Log Management from Mezmo
    A Day in the Life of a Mezmo SRE
    Webinar Recap: Applying a Data Engineering Approach to Telemetry Data
    Dogfooding at Mezmo: How we used telemetry pipeline to reduce data volume
    Unlocking Business Insights with Telemetry Pipelines
    Why Your Telemetry (Observability) Pipelines Need to be Responsive
    How Data Profiling Can Reduce Burnout
    Data Optimization Technique: Route Data to Specialized Processing Chains
    Data Privacy Takeaways from Gartner Security & Risk Summit
    Mastering Telemetry Pipelines: Driving Compliance and Data Optimization
    A Recap of Gartner Security and Risk Summit: GenAI, Augmented Cybersecurity, Burnout
    Why Telemetry Pipelines Should Be A Part Of Your Compliance Strategy
    Pipeline Module: Event to Metric
    Telemetry Data Compliance Module
    OpenTelemetry: The Key To Unified Telemetry Data
    Data optimization technique: convert events to metrics
    What’s New With Mezmo: In-stream Alerting
    How Mezmo Used Telemetry Pipeline to Handle Metrics
    Webinar Recap: Mastering Telemetry Pipelines - A DevOps Lifecycle Approach to Data Management
    Open-source Telemetry Pipelines: An Overview
    SRECon Recap: Product Reliability, Burn Out, and more
    Webinar Recap: How to Manage Telemetry Data with Confidence
    Webinar Recap: Myths and Realities in Telemetry Data Handling
    Using Vector to Build a Telemetry Pipeline Solution
    Managing Telemetry Data Overflow in Kubernetes with Resource Quotas and Limits
    How To Optimize Telemetry Pipelines For Better Observability and Security
    Gartner IOCS Conference Recap: Monitoring and Observing Environments with Telemetry Pipelines
    AWS re:Invent 2023 highlights: Observability at Stripe, Capital One, and McDonald’s
    Webinar Recap: Best Practices for Observability Pipelines
    Introducing Responsive Pipelines from Mezmo
    My First KubeCon - Tales of the K8’s community, DE&I, sustainability, and OTel
    Modernize Telemetry Pipeline Management with Mezmo Pipeline as Code
    How To Profile and Optimize Telemetry Data: A Deep Dive
    Kubernetes Telemetry Data Optimization in Five Steps with Mezmo
    Introducing Mezmo Edge: A Secure Approach To Telemetry Data
    Understand Kubernetes Telemetry Data Immediately With Mezmo’s Welcome Pipeline
    Unearthing Gold: Deriving Metrics from Logs with Mezmo Telemetry Pipeline
    Webinar Recap: The Single Pane of Glass Myth
    Empower Observability Engineers: Enhance Engineering With Mezmo
    Webinar Recap: How to Get More Out of Your Log Data
    Unraveling the Log Data Explosion: New Market Research Shows Trends and Challenges
    Webinar Recap: Unlocking the Full Value of Telemetry Data
    Data-Driven Decision Making: Leveraging Metrics and Logs-to-Metrics Processors
    How To Configure The Mezmo Telemetry Pipeline
    Supercharge Elasticsearch Observability With Telemetry Pipelines
    Enhancing Grafana Observability With Telemetry Pipelines
    Optimizing Your Splunk Experience with Telemetry Pipelines
    Webinar Recap: Unlocking Business Performance with Telemetry Data
    Enhancing Datadog Observability with Telemetry Pipelines
    Transforming Your Data With Telemetry Pipelines