Redesigning Kafka — A Message Streaming Platform Built for Logging
When it comes to delivering logs in distributed, high-volume environments, many DevOps teams use Apache Kafka. Kafka is well known for its scalability, throughput, and ability to load balance message streams across multiple nodes and consumers. At Mezmo, formerly known as LogDNA, we have a lot of experience using Kafka for logs, and while it worked for us initially, we quickly ran into performance and scaling limitations when prioritizing live data.
In order to enable our operations team to continue to scale efficiently, we realized we would need to either find or create an alternative solution.
Limitations of Kafka for Logging
1. No deferment queue
As a log management platform, we often receive sudden bursts and spikes of log data that queue up to be processed. Sometimes messages appear in the queue that contain unexpected or complex data structures, causing consumers to process the data more slowly or even fail in some cases. In Kafka, this data can lead to a severe backlog until either the consumer finishes processing the message, or the broker reassigns the workload to another consumer.
With the fastest LiveTail in the market, we want to prioritize live data over older messages that might be delaying or blocking the stream. To address this problem, we temporarily store these messages in a separate queue known as a defer queue. If a consumer fails to process a message, or if it takes too long to process a message, the message is moved to the defer queue and the consumer moves on to the next message. Later, when the cluster is idle, we go back and reprocess the events in the defer queue. This lets us address the problem of bad data without backing up the message stream or losing data.
2. Constraints scaling consumers
Kafka splits its message stream across multiple partitions, each of which is assigned to a specific set of consumers. Adding partitions leads to increased latency, resource usage, and possible unavailability. But since there’s a fixed ratio of n-consumers assigned to each partition, adding partitions requires you to add a proportional number of consumers dedicated to only that partition.
In addition, certain actions — such as adding a consumer or partition — triggers a reassignment (also known as a rebalance) of partitions across each consumer in a group. This is a lengthy and expensive process, especially for larger consumer groups, during which all processing on that topic is halted. If you’re scaling to meet increased demand, the last thing you need is a sudden drop in throughput.
Instead of enforcing an n-consumer-per-partition policy, we needed a way to assign any number of consumers to any number of partitions while still ensuring that each message is only read once. Mezmo designed a different way to ensure total order across topics and that each message is only read once. This lets us add as many brokers, consumers, or nodes as necessary while scaling much more efficiently.
3. Kafka consumers don’t fit containerization model
At Mezmo, our operations rely on Kubernetes for orchestration of our services. However, running Kafka on Kubernetes is a challenge.
Kafka consumers are stateful and responsible for tracking their progress through a partition. When a consumer finishes processing a message, it reports its current position in the partition back to the broker. Outside of storing each consumer’s position and availability, the broker is entirely unaware of the consumer’s state or progress. Not only does Kafka rely on persistent consumers, it also relies on the direct routes between consumers and brokers. In addition, when a consumer dies, Kafka halts in order to rebalance partitions and, if necessary, elect a new leader.
Our consumers run on Kubernetes, which assumes that they can be stopped and replaced anytime. Out of the box, Kafka isn’t designed to run as a containerized workload, which can lead to performance and stability problems. Instead of fitting a square peg into a round hole, we took the opportunity to design something that would work for us. Instead of the Kafka model of dumb brokers and smart consumers, we took the opposite approach. Consumers poll for messages, and the broker tracks which messages have been successfully delivered and processed. This lets us add or remove consumers on demand without the need to coordinate consumer groups or reassign partitions.
Ultimately, we wanted to share our scaling challenges using Kafka as our message broker as we know that there’s many of you out there experiencing similar issues when scaling your ELK stack. I would love to hear from you to learn the ways in which you’ve modified Kafka for your needs.