Tech Crusader: Apache Kafka for Beginner

The first part of Apache Kafka explains what Kafka is – a publish-subscribe based durable messaging system rethought as a distributed commit log exchanging data between systems, applications, and servers.

I’m going to cover a brief understanding of the messaging system and distributed logs, Kafka ecosystem, Kafka architecture, and its important core concepts.

Before this, let’s go back in the year 2011 to know a brief history of Kafka, LinkedIn’s Messaging Platform.

A Brief History of Kafka

Apache Kafka is a highly scalable messaging system that plays a critical role in LinkedIn’s central data pipeline. But it was not always this way.

Over the years, they have to undergo to make hard architecture decision, and when the company started growing and scaling. The challenge was to accommodate LinkedIn growing membership and increasing site complexity.

Initially, they had already migrated from a monolithic application infrastructure to microservices. This change allowed their search, profile, communications, and other platforms to scale more efficiently.

They initially developed several different custom data pipelines for their various streaming and queuing data. The use case ranged from tracking site events like page views to gathering aggregated logs from other services. Other pipelines provided queueing functionality for their InMail messaging system, etc. These data pipelines needed to scale along with the site.

Rather than maintaining and scaling each pipeline individually, they thought of the development of a single, distributed publisher-subscriber messaging platform. Thus they end up creating Kafka.

Kafka was built with a few key design principles in mind: a simple API for both producers and consumers, designed for high throughput, and a scaled-out architecture from the beginning

As early as 2011, LinkedIn open-sourced Kafka via the Apache Software Foundation, providing the world with a powerful open-source solution for managing streams of information.

Today, Apache Kafka is part of the Confluent Stream Platform and handles trillions of events every day.

Overview of Kafka

Apache Kafka is a publish-subscribe distributed streaming platform. Kafka is run as a cluster on one or more servers that can span multiple data centers.

Kafka stores messages in topics that are partitioned and replicated across multiple brokers in a cluster. Producers send messages to topics from which consumers read.

Messages are byte arrays (with String, JSON, and Avro being the most common formats). If a message has a key, Kafka makes sure that all messages of the same key are in the same partition.

Language Agnostic — producers and consumers use a binary protocol (TCP Protocol) to talk to a Kafka cluster.

These are four main parts in a Kafka system:

Broker: A broker is a server that handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.

Zookeeper: Keeps the state of the cluster (brokers, topics, users).

Producer: An application that sends messages to a broker.

Consumer: An application that reads data from Kafka.

There are four core APIs of Kafka.

The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them and later may send them to Hadoop, Casandra, or HBase or may be again pushing back to Kafka for someone else to read these modified and transformed messages.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API. These are very powerful features. These are ready to use connectors to import data from databases to Kafka and export data from Kafka to databases. These are also a framework to build specialized connectors for any application. For example, a connector to a relational database might capture every change to a table.
The Admin API allows managing and inspecting topics, brokers, and other Kafka objects.

Tech Crusader

Apache Kafka for Beginner– What is Apache Kafka?

A Brief History of Kafka

Overview of Kafka

1 comment:

Footer Social Widget

Resources