Kafka is not the best anymore, Meet Pulsar !

Anuradha Prasanna
9 min readJun 26, 2019

--

Apache Kafka created by Linkedin in 2011 was the long standing decoupled messaging power house that was the only option for most performance critical large scale messaging requirements that needs to pass millions of messages per day, well thats a lot (average tweets per day in Twitter 500 millions per day in 2018 for average number of 100 million users per day). Back in those days there we no such MOM system to handle large streaming capabilities to a large subscriber base. So Kafka was the survivor for most big names Linkedin, Yahoo, Twitter, Netflix and Uber.

Now the things are much more changed in 2019 and daily messaging counts go up in billions, so as platform support needs to scale accordingly to support the continued growing demand. So in order to do this messaging systems need to scale continuously and seamlessly without impacting the customers. Kafka has lot of problems in scaling and it is difficult system to manage. Guys who love Kafka will feel bad about telling this, but its nothing personal, I too a Kafka fan, but as world innovating new tools which as much convenient in comparison it is obvious that we tend to feel the previous things are more difficult to manage and has problems to deal with. Its the nature.

So a new guy comes to the town is called “Apache Pulsar” !

Pulsar is created by Yahoo back in 2013 and donated to Apache in 2016. Pulsar is now an Apache top level project with proven success. Yahoo and Twitter both use Pulsar where in Yahoo 100 billion messages were sent per day with over 200 million messaging topics. Well its unbelievably large.

Lets see what are the problems with Kafka and how Pulsar takes those burden off your shoulders.

  1. Scaling Kafka is difficult, this is due to the way Kafka stores data within the broker as distributed logs that stores as messaging persistence store. Spinning off another broker means it has to replicate the topic partitions and replicas completely which takes time to complete the broker spin off.
  2. Changing partition sizes to allow more storage can mess the message order by conflicting the message indexes. This is critical if the message order is a prime concern.
  3. Leader election can go crazy if the partition replicas are not in ISR (In Sync) state. Basically there should be at least one ISR replica to be elected as the leased when the original leader partition fails, but this cannot be guaranteed. there is a setting to disable this but then if enabled a non-ISR replica will get elected as the leader which is much worse since that impose a worse situation than a service outage without a leader partition.
  4. Ideally you must plan and calculate number of brokers, topics, partitions and replicas in first place (that fits planned future usage growth) to avoid scaling problems, but now adays with unpredictable traffic demands and spike its is hard to plan.
  5. Kafka cluster rebalancing can impact the performance of connected producers and consumers
  6. Kafka topics can lose messages in failure scenarios (specially in point 3)
  7. Working with offsets is a pain because Kafka is dumb (don’t be angry for saying that, because the “dumb” comes from the architectural concept “dumb broker and intelligent client”)
  8. Old messages has to be deleted pretty soon if the usage is high to avoid disk space problems
  9. Kafka’s native across region geo replication mechanism (MirrorMaker) is notoriously problematic, even within just two data centers. Because of this even Uber has created another solution to overcome this called uReplicator as an example
  10. You must use another realtime event analyzer tool such as Apache Storm, Apache Heron or Apache Spark if you need to do so, also that should be in a strength to support the incoming traffic rates
  11. No native multi-tenancy capabilities with complete isolation of tenants . It is done with the use of security features such as topic authorization

There are ways architects and engineers solve above problems in their production environments of course, but there it costs some hairs in the platform/solution or site reliability engineer’s head and some headaches for sure, because these are not simple fixes like we fix a logic in a code logic and then deploy the packaged binaries to production.

OK ,,,,

Now lets talk about out “guy”, Pulsar. Because he leads the game.

What is Apache Pulsar ?

“Apache Pulsar is an open-source distributed pub-sub messaging system originally created by Yahoo” as per the website’s intro. Essentially its something like Kafka for those who understands Kafka :)

Pulsar Performance !

Best highlight is performance, Pulsar is a much faster than Kafka, and it is proven by a performance comparison done by a Texas based technology research and analysis firm called GigaOm.

Approximately Pulsar is 2.5 times faster and 40% less latency than Kafka. Isn’t that great ? Of course. (source is here)
Note this performance comparison is on 1 topic over 1 partition with 100-byte messages where Pulsar maxed to a 220,000+ messages per second as shown below .

WOW, thats really awesome !

Well its the definite point to move away from Kafka and embrace Pulsar, and this will be supported by below facts I have explained.

Apache pulsar advantageous and features

Pulsar can function both like Kafka’s offset based topic reader consumption method as well as traditional pub-sub topic message consumption which is a great thing.

Pulsar has a different architecture for data persistence. Rather than using log files in the local broker as in Kafka, Pulsar stores all the topic data in a specialized data layer powered by Apache Bookkeeper as data ledgers. Simply said Bookkeeper is a highly scalable, fault-tolerant and low-latency storage service optimized for real-time durable data workloads. So bookkeeper guarantees the availability of the data unlike the issues possible to arise with Kafka log files which resides in individual brokers as well as for catastrophic server failures. Due to this guaranteed persistence layer, Pulsar brings another advantage where “brokers are made stateless”, yes you read it correctly. Thats a big difference compared to Kafka. Now the real advantage is Pulsar brokers can seamlessly scale horizontally to cater the growing demands by spinning new Pulsar brokers very quickly as there is no topic partition data to load nor synchronize such as Kafka does when spinning up new brokers.

What if a Pulsar broker goes down ? Topics will be reassigned immediately to another broker, and this is possible because no topic data persists in the broker’s disk and service discovery will handle the publishers/consumers.

Unlike in Kafka where you need to purge old data to make disk space available for Kafka, Pulsar store its topic data in a tiered structure where you can connect additional disks or Amazon S3 to extend and offload the topic data storage literally unlimitedly. The cool thing is Pulsar shows data seamlessly to the consumers as if they are coming from a single drive.
Another valuable use case is since you never need to purge old data is configured you can use these organized Pulsar topics as a “Data Lake”.
But of course you can set Pulsar to purge old data if needed.

Pulsar natively supports multi-tenancy with data isolation at the topic namespace level where such isolation is not possible with Kafka.
On top of that to make Pulsar applications more solid and secure, Pulsar supports fine grained access control capabilities.

Pulsar has multiple client libraries available for languages Java, Go, Python, C++ and WebSocket.

Pulsar has Function as a Service (FaaS) support natively which is a very cool feature which is similar to Amazon Lambda functions where realtime data streams can be analyzed, aggregates or summarized in realtime. This is a great advantage when compared to Kafka where you need a stream processing system such as Apache Storm which is an additional cost and pain to maintain. As of now Pulsar functions support Java and Python and other languages will be supported in later releases.
Example use cases of Pulsar functions are content based routing, aggregations, message formatting, message cleansing.
Below shows a sample function calculates word count,

package org.example.functions;import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import java.util.Arrays;public class WordCountFunction implements Function<String, Void> {
// This is invoked every time messages published to the topic
@Override
public Void process(String input, Context context)
throws Exception {
Arrays.asList(input.split(" ")).forEach(word -> {
String counterKey = word.toLowerCase();
context.incrCounter(counterKey, 1);
});
return null;
}
}

Pulsar supports number of data sinks for routing processed messages for leading products such as Pulsar topics itself, Cassandra, Kafka, AWS Kinesis, Elastic search, Redis, Mongo DB, Influx DB and many more.
Also there is ability to persist processed messages stream to disk files.

Pulsar enables querying past messages with a SQL style way with Pulsar SQL, which is very efficiently query Bookkeeper data efficiently with the use of Presto engine. Presto is a high performance, distributed SQL query engine for big data solutions that allows query data from multiple data sources within a single query. Below is a sample Pulsar SQL query,

show tables in pulsar."public/default"

Another very important feature is built in robust geo-location replication mechanism that synchronizes messages instantly across different clusters in different regions to maintain the message integrity. When messages are produced on a Pulsar topic, they are first persisted in the local cluster, and then forwarded asynchronously to the remote clusters. Geo-replication must be enabled on a per-tenant basis in Pulsar. Geo-replication can be enabled between clusters only when a tenant has been created that allows access to both clusters.

For messaging channel security TLS based and JWT token based authorization mechanisms are supported natively. So you can specify who can publish or consume messages from which topics. Also for additional security Pulsar Encryption allows applications to encrypt all messages at the producer and decrypt at the consumer passing encrypted messages through Pulsar. Encryption is performed using the public/private key pair configured by the application. Encrypted messages can only be decrypted by consumers with a valid key. But remember this comes with a performance hit since every message needs to be encrypted and decrypted for processing.

Anyone who is using Kafka wants to migrate to Pulsar has a big relief because Pulsar natively support to work directly with Kafka data through a connector OR you have the option to import existing Kafka application data to Pulsar very easily.

It is also need to note that both Kafka and Pulsar both have enterprise support providers and they are Confluent and Streamlio respectively which provides production support and consultations if you prefer.

Summary

So this is not to say Kafka is bad and the only option is Pulsar when it comes to a large scale messaging platform, but what is says is that some of the pain points that exists in Kafka is already handled by Pulsar for us, which is a great thing for any engineer or architect. Other most important point is due to the architectural aspects Pulsar is much much faster in large messaging solutions, as Yahoo and Twitter (and many others) already using the production environments message loads which is a proof that its stable and production ready for any business. But there is a little learning curve to move in and to mindset shift from Kafka (because we all were very proud to say that we are Kafka guys) to move to Pulsar based solutions, but there is a definite ROI for that !

--

--

Anuradha Prasanna
Anuradha Prasanna

Written by Anuradha Prasanna

an enterprise architect, technology enthusiast, dog lover , music maniac, a husband & a father ! 🐶

Responses (6)