Processing Streams
- Write to Derived Systems like DB, Cache, Search Index
- Push Events to users like Email, Push Notifications
- Process one or more streams to produce one or more output streams
- Uses
- Fraud detection of Credit card by analyzing usage patterns
- Trading Systems to analyze market and execute trades
- Manufacturing systems need to monitor and quickly identify issues
Stream Processor
- aka operator or job
- Piece of code that processes stream
- Stream Processor consumes input streams in a read-only fashion and writes its output to a different location in an append-only fashion
Stream Processing vs Batch Processing
- Stream never ends unlike batch process
- Sorting does not make sense
- Fault Tolerance
- Streams cannot be restarted unlike batch processes
Searching and Analytics in Streams
- Complex Event Processing (CEP)
- Materialized Views
Reasoning about time
- Event time is different from Processing time
- Processing time maybe delayed:
- queueing
- network faults
- performance issue in broker or processor
- consumer restart
- reprocessing of past event while recovering from fault
- Defining time window to be used in aggregation
- Using Processing time, the processing time maybe delayed compared to event time and also events could be out of order
- If using Event time, and the event comes later when window is calculated, the event is aka straggler event, You can:
- ignore the straggler events, record dropped events as a metric
- publish a correction, retracting the previous output
- Using Event time, you can never be sure if you have received all the events for a particular window
- Assigning timestamps can be tricky
- Mobile devices sending usage metrics can use local clock while offline
- But local clocks cannot be trusted
- Using Server time of event is not useful
- We can log three timestamps
- Event time as per device clock
- Time when event sent to server as per device clock
- Time when event received by the server as per server clock
- Device clock offset ignoring network delay:
- Event received time (Server clock) - Event sent time (Device clock)
Types of Windows
- Tumbling Window
- Fixed length
- Non-overlapping
- Each event belongs to single window
- Example: 5-min tumbling window
- Hopping Window
- Fixed length
- Window overlaps
- Hop size < window size
- Example: 5-min hopping window with hop size of 2-min
- Sliding Window
- Fixed length
- Window overlaps, Slide continuously
- Maximum overlap
- Sliding Window maybe of two types
- Time sliding, has fixed duration of window
- Eviction sliding, has fixed number of events in window
- Example: 5-min sliding window
[0-5], [1-6], [2-7] (This is discrete example)
- Session Window
- No Fixed duration
- Grouping together all events for the same user that occur closely together in time
- Window ends when user is inactive for some time
- In simple terms
- If Hop size = window size, then tumbling window
- If Hop size < window size, then hopping window
- If Hop size is minimum possible, then sliding window
Stream Joins
- Types
- Stream-Stream Joins
- Bring together search action event and click action event with same Session ID
- Stream-Table Joins
- Table-Table Joins
- Time Dependence of Joins
- Joins can become Non deterministic if ordering is undetermined
Fault Tolerance
- Exactly-once semantics (EOS)
- aka effectively-once
- Visible effect of processing is as if the processing is done only once, even though it may have been processed multiple times due to failures
- Achieving EOS
- Microbatching
- used in Apache Spark Streaming
- Batch size is typically around 1s
- Checkpointing
- Atomic Commit
- used in Google DataCloud Flow, VoltDB and Apache Kafka
- when we want to have side effects like sending notifications
- Different from traditional XA, State changes of different systems are kept internally in the stream processor
- Idempotence
- Try to have logic which is idempotent naturally
- Make operation idempotent
- can include monotonically increasing offset, for example writing DB with the offset will always produce same result
- Rebuilding state from Failure
- Operation requiring state like Aggregations (counters, averages etc.), Indexes etc. need to be recovered in case of failures
- Flink captures periodic snapshots and write to HDFS
- Kafka Streams replicate changes to dedicated Kafka topic with log compaction