Apache Kafka is a fault-tolerant and scalable messaging system that works on the publish-subscribe model. It helps developers to design distributed systems. Many major web applications like Airbnb, Twitter and Linkedin use Kafka.
Need for Kafka
Going forward, in order to design innovative digital services, developers require access to a wide data stream—which has to be integrated as well. Usually, the data sources such as transactional data like shopping carts, inventory, and orders are integrated with searches, recommendations, likes, and patch links. This portion of data holds an important role to offer insights into the behavior of customers’ purchasing habits. Here, different prediction analytics systems are used to predict future trends. It is this domain in which Kafka’s brilliance offers the companies a chance to edge their competitors.
How Was It Conceptualized
Around 9 years ago, in 2019, a team comprising of Neha Narkhede, Jun Rao, and Jay Kreps developed Apache Kafka at Linkedin. At that time, they were focusing to resolve a complex issue—voluminous amounts of event data related to LinkedIn’s infrastructure and website struggled from low latency ingestion. They planned to use a lambda architecture that took advantage of real-time event processing systems like Hadoop. Back then, they had no access to any real-time applications that could solve their issues.
For data ingestion, there were solutions in the form of offline batch systems. However, doing so risked exposing a lot of implementation information. These solutions also utilized a push model, capable of overwhelming consumers.
While the team had the option to use conventional messaging queues like RabbitMQ, they were deemed as overkill for the problem at hand. Companies do wish to add machine-learning but when they cannot get the data, the algorithms are of no use. Data extraction from the source systems was difficult, particularly moving it reliably. The existing enterprise messaging and batch-based solutions did not resolve the issue.
Hence, Kafka was designed as the ingestion backbone for such issues. By 2011, Kafka’s data ingestion was close to 1 billion events per day. In less than 5 years, it reached 1 trillion messages per day.
How Does Kafka Work?
Kafka offers scalable, persistent, and in-order messaging. Like other publish-subscribe systems, it is also powered by topics, subscribers, and publishers. It supports high parallel consumption via topic partitioning. Each message that is written to Kafka replicates and persists to the peer brokers. You can adjust the time span of these messages, for instance, if you configure it 30 days then they perish after a month.
Kafka’s major aspect is its log. Log here refers to the data structure that is append-only data order insertion which is time-ordered. In Kafka, you can use any type of data.
Typically, a database writes event modifications to a log and also extracts column values from them. For Kafka, messages write to a topic that is responsible for log maintenance. From these topics, subscribers can access and extract their relevant data representations.
For instance, a shopping cart’s log activity might include: add product shirt, add product bag, remove product shirt, and checkout. For the log, this activity is presented to the downstream systems. When that log is read by a shopping cart service, it can reference to the objects of the shopping cart that indicate the constituents of the shopping cart: product bag, and ready for checkout.
Since Apache Kafka is known to store messages for longer period of time, applications can be re-winded to previous log positions for reprocessing. For instance, consider a scenario in which you wish to use a new analytic algorithm or application so it can be tested for the previous events.
What Apache Kafka Does Not Do?
Apache Kafka offers blazing speed as it displays the log data structure like a first-class resident. It is far different from other conventional message brokers.
It is important to note that Kafka does not support individual IDs for messages. These messages are referenced according to their log offsets. It also refrains from monitoring consumers in terms of topic or their message consumption—consumers themselves can do all this.
Due to its unique design from other conventional messaging brokers, it can offer the following optimizations.
- It offers a decrease in the load. This is done by its refusal to maintain indexes that have the message records. Moreover, it does not offer random access; consumers define offsets where beginning from the offset, messages are delivered by Kafka in the correct order.
- There are no delete options. Kafka maintains log parts for a specific time period.
- It can use kernel-level input/output for effective stream messages to consumers, without depending on message buffering.
- It can take advantage of the OS for the write operations to disk along with file page caches.
Kafka and Microservices
Due to Kafka’s robust performance for big data ingestion, it has a series of use cases for microservices. Microservices often depend on event sourcing, CQRS, and other domain-driven concepts for scalability; their backing store can be provided by Kafka.
Often, event sourcing applications create a large amount of events—their implementation with conventional databases is tricky. By using Kafka’s feature log compaction, you can preserve your events for as long as possible. In log compaction, the log is not discarded after a defined time period; instead, Kafka saves all the events with a key set. As a result, the application gains loose coupling since it can discard or lose logs; at any point time, it uses the preserved events for the restoration of the domain state.
When to Use Kafka?
Apache Kafka’s use depends on your use case. While it solves many modern-day issues for web enterprises, similar to the conventional message brokers, it cannot perform well in all scenarios. If your intention is to design a reliable group of data applications and services, then Apache Kafka can function as your source of truth, gathering and storing all the system events.