Kafka
- Horizontal scalable
- Millions of messages per second
- Terabytes of data
- Critical piece of puzzle in big data
- Kafka is for Data Collection, Storage and Transport but not for transformation
- In Azure we have
Azure Event Hub which supports Kafka protocol, but actually is different from Kafka
- More to explore:
- DLT
- retry
- how partitions are decided
- consumer rate, producer rate
- back pressure
- committed offset
- auto-offset-reset
- Idempotent
- Health Checks
- Noisy neighbor problem
- Kafka uses→ Large Processing, Streaming Data
- High throughput, Fault tolerance
Use Cases
- Messaging System
- Activity Tracking
- How many people are watching live stream of cricket
- user interactions from a website like clicks, page views, and sign-ups
- Gather metrics from many locations
- Application logs gathering
- Stream processing
- Decoupling of system dependencies
- Integration with
Spark, Flink, Storm, Hadoop etc.
Compared to other messaging systems
RabbitMQ is not same as Kafka as it immediately deletes the data
flowchart TD
Zookeeper-->Broker
Broker-->Topic
Topic-->Partition
Partition-->Message
Topic
- A particular stream of data. Similar to table in database
- TTL = Time to Live, keeps the data in topic after it receives
Partition
- A Topic is split into partitions.
- Messages in each partition are ordered
- Its data structure: append only sequence of records called Log
- Uses:
- Helps consumers read parallelly
- Helps in replication
- The number of partitions for a topic determines the maximum parallelism for consuming and processing messages
Message
- aka Record
- Each Record has the following:
- Topic
- Partition
- Offset
- Key (Optional)
- Value
- Timestamp
- Headers (Optional)
- Timestamp can be either Event time or Processing time based on the config
Message Keys
- Producers can choose to send
key with the message (string, number etc.)
- If
key=null data is sent in round robin
- If a
key is sent, then all messages for that key will always go to the same partition.
- Partitioning is done using record’s key
- For the same key, Producer’s record is placed into same partition
Offset
- Each message in a partition gets an incremental id called
offset
- offsets are stored in special internal topics named
__consumer_offsets
- managed and maintained by Kafka brokers
- Each offset is tagged to a consumer group
- The key is concatenation of:
- Consumer Group
- Topic
- Partition within Topic
- The value is Offset
- If consumer fails another consumer in the group can resume processing of messages
Log
- Kafka stores messages in an ordered data structure called logs
- This is different from traditional logs
- Kafka logs are named structures that hold records in an immutable manner, distributed across servers
- Partitions are further split into segments which are actual files in the file system holding logs, which contain messages
- Kafka stores its log files in folders organized by topics and partitions
- See _Kafka_Cleanup_Policies