Processing Streams

  • Write to Derived Systems like DB, Cache, Search Index
  • Push Events to users like Email, Push Notifications
  • Process one or more streams to produce one or more output streams
  • Uses
    • Fraud detection of Credit card by analyzing usage patterns
    • Trading Systems to analyze market and execute trades
    • Manufacturing systems need to monitor and quickly identify issues

Stream Processor

  • aka operator or job
  • Piece of code that processes stream
  • Stream Processor consumes input streams in a read-only fashion and writes its output to a different location in an append-only fashion

Stream Processing vs Batch Processing

  • Stream never ends unlike batch process
  • Sorting does not make sense
  • Fault Tolerance
    • Streams cannot be restarted unlike batch processes

Searching and Analytics in Streams

  • Complex Event Processing (CEP)
  • Materialized Views

Reasoning about time

  • Event time is different from Processing time
  • Processing time maybe delayed:
    • queueing
    • network faults
    • performance issue in broker or processor
    • consumer restart
    • reprocessing of past event while recovering from fault
  • Defining time window to be used in aggregation
    • Using Processing time, the processing time maybe delayed compared to event time and also events could be out of order
      • If using Event time, and the event comes later when window is calculated, the event is aka straggler event, You can:
        • ignore the straggler events, record dropped events as a metric
        • publish a correction, retracting the previous output
    • Using Event time, you can never be sure if you have received all the events for a particular window
  • Assigning timestamps can be tricky
    • Mobile devices sending usage metrics can use local clock while offline
    • But local clocks cannot be trusted
    • Using Server time of event is not useful
    • We can log three timestamps
      • Event time as per device clock
      • Time when event sent to server as per device clock
      • Time when event received by the server as per server clock
    • Device clock offset ignoring network delay:
      • Event received time (Server clock) - Event sent time (Device clock)

Types of Windows

  • Tumbling Window
    • Fixed length
    • Non-overlapping
    • Each event belongs to single window
    • Example: 5-min tumbling window
      • [0–5], [5–10], [10–15]
  • Hopping Window
    • Fixed length
    • Window overlaps
    • Hop size < window size
    • Example: 5-min hopping window with hop size of 2-min
      • [0–5], [2–7], [4–9]
  • Sliding Window
    • Fixed length
    • Window overlaps, Slide continuously
    • Maximum overlap
    • Sliding Window maybe of two types
      • Time sliding, has fixed duration of window
      • Eviction sliding, has fixed number of events in window
    • Example: 5-min sliding window
      • [0-5], [1-6], [2-7] (This is discrete example)
  • Session Window
    • No Fixed duration
    • Grouping together all events for the same user that occur closely together in time
    • Window ends when user is inactive for some time
  • In simple terms
    • If Hop size = window size, then tumbling window
    • If Hop size < window size, then hopping window
    • If Hop size is minimum possible, then sliding window

Stream Joins

  • Types
    • Stream-Stream Joins
      • Bring together search action event and click action event with same Session ID
    • Stream-Table Joins
      • User profiles
    • Table-Table Joins
      • Follower list
  • Time Dependence of Joins
    • Joins can become Non deterministic if ordering is undetermined

Fault Tolerance

  • Exactly-once semantics (EOS)
    • aka effectively-once
    • Visible effect of processing is as if the processing is done only once, even though it may have been processed multiple times due to failures
  • Achieving EOS
    • Microbatching
      • used in Apache Spark Streaming
      • Batch size is typically around 1s
    • Checkpointing
      • used in Apache Flink
    • Atomic Commit
      • used in Google DataCloud Flow, VoltDB and Apache Kafka
      • when we want to have side effects like sending notifications
      • Different from traditional XA, State changes of different systems are kept internally in the stream processor
    • Idempotence
      • Try to have logic which is idempotent naturally
      • Make operation idempotent
        • can include monotonically increasing offset, for example writing DB with the offset will always produce same result
  • Rebuilding state from Failure
    • Operation requiring state like Aggregations (counters, averages etc.), Indexes etc. need to be recovered in case of failures
    • Flink captures periodic snapshots and write to HDFS
    • Kafka Streams replicate changes to dedicated Kafka topic with log compaction