Kafka

  • Horizontal scalable
    • Millions of messages per second
    • Terabytes of data
  • Critical piece of puzzle in big data
  • Kafka is for Data Collection, Storage and Transport but not for transformation
  • In Azure we have Azure Event Hub which supports Kafka protocol, but actually is different from Kafka
  • More to explore:
    • DLT
    • retry
    • how partitions are decided
    • consumer rate, producer rate
    • back pressure
    • committed offset
    • auto-offset-reset
    • Idempotent
    • Health Checks
    • Noisy neighbor problem
  • Kafka uses Large Processing, Streaming Data
  • High throughput, Fault tolerance

Use Cases

  • Messaging System
  • Activity Tracking
    • How many people are watching live stream of cricket
    • user interactions from a website like clicks, page views, and sign-ups
  • Gather metrics from many locations
  • Application logs gathering
  • Stream processing
  • Decoupling of system dependencies
  • Integration with Spark, Flink, Storm, Hadoop etc.

Compared to other messaging systems

  • RabbitMQ is not same as Kafka as it immediately deletes the data

flowchart TD
Zookeeper-->Broker
Broker-->Topic
Topic-->Partition
Partition-->Message

Topic

  • A particular stream of data. Similar to table in database
  • TTL = Time to Live, keeps the data in topic after it receives
    • default = 7 days

Partition

  • A Topic is split into partitions.
  • Messages in each partition are ordered
  • Its data structure: append only sequence of records called Log
  • Uses:
    • Helps consumers read parallelly
    • Helps in replication
  • The number of partitions for a topic determines the maximum parallelism for consuming and processing messages

Message

  • aka Record
  • Each Record has the following:
    • Topic
    • Partition
    • Offset
    • Key (Optional)
    • Value
    • Timestamp
    • Headers (Optional)
  • Timestamp can be either Event time or Processing time based on the config

Message Keys

  • Producers can choose to send key with the message (string, number etc.)
  • If key=null data is sent in round robin
  • If a key is sent, then all messages for that key will always go to the same partition.
  • Partitioning is done using record’s key
  • For the same key, Producer’s record is placed into same partition

Offset

  • Each message in a partition gets an incremental id called offset
  • offsets are stored in special internal topics named __consumer_offsets
    • managed and maintained by Kafka brokers
    • Each offset is tagged to a consumer group
      • The key is concatenation of:
        • Consumer Group
        • Topic
        • Partition within Topic
      • The value is Offset
    • If consumer fails another consumer in the group can resume processing of messages

Log

  • Kafka stores messages in an ordered data structure called logs
  • This is different from traditional logs
  • Kafka logs are named structures that hold records in an immutable manner, distributed across servers
  • Partitions are further split into segments which are actual files in the file system holding logs, which contain messages
  • Kafka stores its log files in folders organized by topics and partitions
  • See _Kafka_Cleanup_Policies