The first part of Apache Kafka explains what Kafka is – a publish-subscribe based durable messaging system rethought as a distributed commit log exchanging data between systems, applications, and servers.
I’m going to cover a brief understanding of the messaging system
and distributed logs, Kafka ecosystem, Kafka architecture, and its important core
concepts.
Before this, let’s go back in the year 2011 to know a brief
history of Kafka, LinkedIn’s Messaging Platform.
A Brief History of Kafka
Apache Kafka is a highly scalable messaging system that
plays a critical role in LinkedIn’s central data pipeline. But it was not
always this way.
Over the years, they have to undergo to make hard architecture
decision, and when the company started growing and scaling. The challenge was to accommodate
LinkedIn growing membership and increasing site complexity.
Initially, they had already migrated from a monolithic
application infrastructure to microservices. This change allowed their search,
profile, communications, and other platforms to scale more efficiently.
They initially developed several different custom data
pipelines for their various streaming and queuing data. The use case ranged from
tracking site events like page views to gathering aggregated logs from other
services. Other pipelines provided queueing functionality for their InMail
messaging system, etc. These data pipelines needed to scale along with the site.
Rather than maintaining and scaling each pipeline individually, they thought of the development of a single, distributed publisher-subscriber messaging platform. Thus they end up creating Kafka.
Kafka was built with a few key design principles in mind: a
simple API for both producers and consumers, designed for high throughput, and
a scaled-out architecture from the beginning
As early as 2011, LinkedIn open-sourced Kafka via the Apache Software Foundation, providing the world with a powerful open-source solution for managing streams of information.
Today, Apache Kafka is part of the Confluent Stream Platform and handles trillions of events every day.
Overview of Kafka
Apache Kafka is a publish-subscribe distributed streaming
platform. Kafka is run as a cluster on one or more servers that can span
multiple data centers.
Kafka stores messages in topics that are partitioned and
replicated across multiple brokers in a cluster. Producers send messages to
topics from which consumers read.
Messages are byte arrays (with String, JSON, and Avro being
the most common formats). If a message has a key, Kafka makes sure that all
messages of the same key are in the same partition.
Language Agnostic — producers and consumers use a binary protocol (TCP Protocol) to talk to a Kafka cluster.
These are four main parts in a Kafka system:
Broker: A broker is a server that handles all requests from
clients (produce, consume, and metadata) and keeps data replicated within the
cluster. There can be one or more brokers in a cluster.
Zookeeper: Keeps the state of the cluster (brokers, topics,
users).
Producer: An application that sends messages to a broker.
Consumer: An application that reads data from Kafka.
There are four core APIs of Kafka.
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them and later may send them to Hadoop, Casandra, or HBase or may be again pushing back to Kafka for someone else to read these modified and transformed messages.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API. These are very powerful features. These are ready to use connectors to import data from databases to Kafka and export data from Kafka to databases. These are also a framework to build specialized connectors for any application. For example, a connector to a relational database might capture every change to a table.
- The Admin API allows managing and inspecting topics, brokers, and other Kafka objects.
Really liked the article. please post some examples of kafka streaming.
ReplyDelete