Kafka

Critical piece of puzzle in big data
Kafka is for Data Collection, Storage and Transport but not for transformation
In Azure we have Azure Event Hub which supports Kafka protocol, but actually is different from Kafka
More to explore:
- DLT
- retry
- how partitions are decided
- consumer rate, producer rate
- back pressure
- committed offset
- auto-offset-reset
- Idempotent
- Health Checks
- Noisy neighbor problem
Kafka uses→ Large Processing, Streaming Data

Use Cases

Messaging System
Activity Tracking
- How many people are watching live stream of cricket
- user interactions from a website like clicks, page views, and sign-ups
Gather metrics from many locations
Application logs gathering
Stream processing
Decoupling of system dependencies
Integration with Spark, Flink, Storm, Hadoop etc.

flowchart TD
Zookeeper-->Broker
Broker-->Topic
Topic-->Partition
Partition-->Message

A particular stream of data. Similar to table in database
TTL = Time to Live, keeps the data in topic after it receives
- default = 7 days

Uses:
- Helps consumers read parallelly
- Helps in replication
The number of partitions for a topic determines the maximum parallelism for consuming and processing messages

Producers can choose to send key with the message (string, number etc.)
If key=null data is sent in round robin
If a key is sent, then all messages for that key will always go to the same partition.
Partitioning is done using record’s key

auto.offset.reset
- This is used when a new consumer group joins
- latest: Consumer starts from the place the existing consumer left off
- earliest: Consumer starts from the beginning of the topic
https://stackoverflow.com/questions/64426376/is-consumer-offset-managed-at-consumer-group-level-or-at-the-individual-consumer
https://docs.confluent.io/kafka/design/consumer-design.html#consumer-offsets

Kafka stores messages in an ordered data structure called logs
This is different from traditional logs
Kafka logs are named structures that hold records in an immutable manner, distributed across servers
Partitions are further split into segments which are actual files in the file system holding logs, which contain messages
Kafka stores its log files in folders organized by topics and partitions
See _Kafka_Cleanup_Policies