Processing Streams

Write to Derived Systems like DB, Cache, Search Index
Push Events to users like Email, Push Notifications
Process one or more streams to produce one or more output streams
Uses
- Fraud detection of Credit card by analyzing usage patterns
- Trading Systems to analyze market and execute trades
- Manufacturing systems need to monitor and quickly identify issues

Stream Processor

aka operator or job
Piece of code that processes stream
Stream Processor consumes input streams in a read-only fashion and writes its output to a different location in an append-only fashion

Tumbling Window
- Fixed length
- Non-overlapping
- Each event belongs to single window
- Example: 5-min tumbling window
  - [0–5], [5–10], [10–15]
Hopping Window
- Fixed length
- Window overlaps
- Hop size < window size
- Example: 5-min hopping window with hop size of 2-min
  - [0–5], [2–7], [4–9]
Sliding Window
- Fixed length
- Window overlaps, Slide continuously
- Maximum overlap
- Sliding Window maybe of two types
  - Time sliding, has fixed duration of window
  - Eviction sliding, has fixed number of events in window
- Example: 5-min sliding window
  - [0-5], [1-6], [2-7] (This is discrete example)
Session Window
- No Fixed duration
- Grouping together all events for the same user that occur closely together in time
- Window ends when user is inactive for some time
In simple terms
- If Hop size = window size, then tumbling window
- If Hop size < window size, then hopping window
- If Hop size is minimum possible, then sliding window

Types
- Stream-Stream Joins
  - Bring together search action event and click action event with same Session ID
- Stream-Table Joins
  - User profiles
- Table-Table Joins
  - Follower list
Time Dependence of Joins
- Joins can become Non deterministic if ordering is undetermined

Exactly-once semantics (EOS)
- aka effectively-once
- Visible effect of processing is as if the processing is done only once, even though it may have been processed multiple times due to failures
Achieving EOS
- Microbatching
  - used in Apache Spark Streaming
  - Batch size is typically around 1s
- Checkpointing
  - used in Apache Flink
- Atomic Commit
  - used in Google DataCloud Flow, VoltDB and Apache Kafka
  - when we want to have side effects like sending notifications
  - Different from traditional XA, State changes of different systems are kept internally in the stream processor
- Idempotence
  - Try to have logic which is idempotent naturally
  - Make operation idempotent
    - can include monotonically increasing offset, for example writing DB with the offset will always produce same result
Rebuilding state from Failure
- Operation requiring state like Aggregations (counters, averages etc.), Indexes etc. need to be recovered in case of failures
- Flink captures periodic snapshots and write to HDFS
- Kafka Streams replicate changes to dedicated Kafka topic with log compaction