Faults in Distributed Systems
- Unreliable clocks
- Process pauses
- Partial Failures
Unreliable Clocks
- Clocks can move backwards in time
- Time on multiple nodes maybe quite different from other nodes
- machine’s CPU is defective
- Network is not configured properly
- Large Network delays
- If using distributed systems
- carefully monitor clocks on all nodes
- declare nodes dead if their clock drifts too much
- Types of Clocks
Physical Clock
- Uses device called Quartz crystal oscillator
- Can drift (faster/slower) up to several milliseconds
- depends on temperature of machine
- Google resynchronize multiple times in day for its servers
- sync via NTP servers
- Types
- Time of Day Clock
- measures time
- based on number of seconds since epoch
- Unix:
clock_gettime(CLOCK_REALTIME)
- Java:
System.currentTimeMillis()
- unsuitable for measuring durations since sync can reset the clock
- causes jump forward/backwards in time
- Monotonic Clock
- measures duration
- absolute value is meaningless
- based on something meaningless like nanoseconds since computer started
- guaranteed to move forward in time
- Unix:
clock_gettime(CLOCK_MONOTONIC)
- Java:
System.nanoTime()
- NTP can adjust speed of clock
- Resolution of time is good, in order of microseconds
- Distributed systems usually can use monotonic clock, for example to calculate timeouts
- since they don’t rely on synchronization of time on nodes
Logical Clocks
- measures only the relative ordering of events
- based on incrementing “counters”
Epoch
- Midnight UTC on Jan 1, 1970 on Gregorian Calendar not counting leap seconds
NTP
- Network Time Protocol
- Allows correction of time
- NTP servers get more accurate time information from GPS receiver
- Allows clock rate (speed of clock) to change by ± 0.05% by adjusting frequency of quartz crystal
- NTP adjusts leap seconds gradually over a day
- Clients can query multiple NTP servers ignore outliers for robustness
- Best possible accuracy ~ 10ms
- if network congestion ~ 100ms
PTP
- Precision Time Protocol
- Uses GPS receivers
Uncertainty Bound
- Time source: server
- Expected Quartz drift since last sync
- NTP server’s uncertainty
- Network round trip
- Time source: GPS receiver or atomic (cesium clock) attached to computer
- reported by the manufacturer
- For Snapshot Isolation
- on single node
- logical clocks can be used
- monotonically increasing transaction ID
- For distributed systems
Process Pauses
- Unexpected pause in the execution of program
- Stop the world GC pauses
- VMs can be suspended/resumed effectively stopping/resuming time
- End user can suspend execution of programs by closing laptop
- Context switching in threads
- Context switching multiple VMs
- Synchronous I/O like disk reading
- Paging if enabled, can cause page fault if require disk to be loaded into memory
- too much time spend on swapping pages b/w RAM and disk
- paging is often disabled on server machines
- In Unix sending signals like
SIGSTOP (Ctrl+Z) can cause program to stop using CPU cycles
- In multi-threading we have mutexes, semaphores, atomic counters, etc. but in distributed world, system has no shared memory
- A node must assume that its execution can be paused for significant time at any point
- Also See Realtime_OS
Partial Failures
- In distributed systems, when some parts are broken in unpredictable way, it is called partial failure
- Partial Failures are non-deterministic in nature
- time taken by message to travel in network is non-deterministic
Network Faults
- Network faults can happen
- triggered by software upgrade
- Sharks might bite undersea cable
- Faulty Network interface
- cluster could become deadlocked
- When one part of network is cutoff from the rest due to network fault, it is called Network Partition
- Testing Network Faults
- Deliberately trigger network problems and test response time
- Netflix Chaos Monkey is a tool to perform testing
- Detecting Network Faults
- Uncertainty about networks make it difficult
- If process in node is crashed, node can notify others
- Timeouts can be used although they don’t guarantee if node is really down
Timeouts
- If timeouts are long, it will take long time to declare nodes dead
- If timeouts are short, it will prematurely declare the node dead
- In Distributed Systems, we must assume that network congestion, queueing and unbounded delays will happen
- Hence no correct value of timeouts, need to be determined experimentally
Network congestion and Queueing
- On busy network link packet may have to wait to get a slot
- In public cloud, resources like network links, switches, CPUs, NICs, etc. are shared between two or more tenants
- Activity of one tenant can negatively affect another tenant’s use of the system.
- aka Noisy Neighbor Problem
- When packet reaches, if CPU is busy, OS will queue the request which may take a long time
Synchronous vs Asynchronous Networks
- Computer Networks are asynchronous in nature
- Suffer from queueing
- Hence they have unbounded delays
- Uses packet switching: opportunistically use whatever network bandwidth is available
- They are optimized for bursty traffic
- maximum utilization of the wire
- Telephone Network are synchronous in nature
- does not suffer from queueing
- maximum latency is fixed
- Hence they have bounded delays
- Uses circuit switching: fixed amount of bandwidth which nobody else can use while circuit is established
- Circuit switching if we want to send file, we will have to guess the bandwidth
- If guess is low, transfer will become very slow
- if guess is high, circuit cannot be setup
- Hence circuit switching cannot be used in variable sized data transfer
Byzantine Faults
- When systems intentionally or unintentionally act maliciously and arbitrarily not just crashing or stopping
- A system is Byzantine Fault tolerant (BFT) if it continues correctly when
- nodes are malfunctioning
- nodes not obeying protocols
- malicious attackers interfere with the network
- It is difficult and costly to design system as Byzantine Fault tolerant and require hardware level support
- BFT algorithms rely on supermajority of more than 2/3rd of nodes to be functioning correctly
- BFT uses
- Aerospace environment
- Radiation can corrupt system
- Flight system must work correctly
- Peer-to-Peer Networks
- In Distributed Systems, we can assume non-byzantine faults since nodes are deployed in our datacenter and radiations are low enough