Faults in Distributed Systems

Unreliable Clocks

Clocks can move backwards in time
Time on multiple nodes maybe quite different from other nodes
machine’s CPU is defective
Network is not configured properly
Large Network delays
If using distributed systems
- carefully monitor clocks on all nodes
- declare nodes dead if their clock drifts too much
Types of Clocks
- Physical
- Logical

Uses device called Quartz crystal oscillator
Can drift (faster/slower) up to several milliseconds
- depends on temperature of machine
- Google resynchronize multiple times in day for its servers
sync via NTP servers
Types
- Time of Day Clock
  - measures time
  - based on number of seconds since epoch
  - Unix: clock_gettime(CLOCK_REALTIME)
  - Java: System.currentTimeMillis()
  - unsuitable for measuring durations since sync can reset the clock
    - causes jump forward/backwards in time
- Monotonic Clock
  - measures duration
  - absolute value is meaningless
  - based on something meaningless like nanoseconds since computer started
  - guaranteed to move forward in time
  - Unix: clock_gettime(CLOCK_MONOTONIC)
  - Java: System.nanoTime()
  - NTP can adjust speed of clock
  - Resolution of time is good, in order of microseconds
  - Distributed systems usually can use monotonic clock, for example to calculate timeouts
    - since they don’t rely on synchronization of time on nodes

Network Time Protocol
Allows correction of time
NTP servers get more accurate time information from GPS receiver
Allows clock rate (speed of clock) to change by ± 0.05% by adjusting frequency of quartz crystal
- aka slewing of clock
NTP adjusts leap seconds gradually over a day
- aka smearing
Clients can query multiple NTP servers ignore outliers for robustness
Best possible accuracy ~ 10ms
- if network congestion ~ 100ms

Time source: server
- Expected Quartz drift since last sync
- NTP server’s uncertainty
- Network round trip
Time source: GPS receiver or atomic (cesium clock) attached to computer
- reported by the manufacturer
For Snapshot Isolation
- on single node
  - logical clocks can be used
  - monotonically increasing transaction ID
- For distributed systems
  - It is difficult to achieve
  - See how Google Spanner achieves it

Unexpected pause in the execution of program
- Stop the world GC pauses
- VMs can be suspended/resumed effectively stopping/resuming time
- End user can suspend execution of programs by closing laptop
- Context switching in threads
- Context switching multiple VMs
- Synchronous I/O like disk reading
- Paging if enabled, can cause page fault if require disk to be loaded into memory
  - too much time spend on swapping pages b/w RAM and disk
    - aka thrashing
  - paging is often disabled on server machines
- In Unix sending signals like SIGSTOP (Ctrl+Z) can cause program to stop using CPU cycles
  - program can be resumed with SIGCON signal
  - See terminal_signals
In multi-threading we have mutexes, semaphores, atomic counters, etc. but in distributed world, system has no shared memory
- A node must assume that its execution can be paused for significant time at any point
Also See Realtime_OS

In distributed systems, when some parts are broken in unpredictable way, it is called partial failure
Partial Failures are non-deterministic in nature
- time taken by message to travel in network is non-deterministic

Network faults can happen
- triggered by software upgrade
- Sharks might bite undersea cable
- Faulty Network interface
- cluster could become deadlocked
When one part of network is cutoff from the rest due to network fault, it is called Network Partition
- aka Netsplit
Testing Network Faults
- Deliberately trigger network problems and test response time
- Netflix Chaos Monkey is a tool to perform testing
Detecting Network Faults
- Uncertainty about networks make it difficult
- If process in node is crashed, node can notify others
- Timeouts can be used although they don’t guarantee if node is really down

If timeouts are long, it will take long time to declare nodes dead
If timeouts are short, it will prematurely declare the node dead
In Distributed Systems, we must assume that network congestion, queueing and unbounded delays will happen
- Hence no correct value of timeouts, need to be determined experimentally

On busy network link packet may have to wait to get a slot
- aka Network congestion
In public cloud, resources like network links, switches, CPUs, NICs, etc. are shared between two or more tenants
- Activity of one tenant can negatively affect another tenant’s use of the system.
- aka Noisy Neighbor Problem
When packet reaches, if CPU is busy, OS will queue the request which may take a long time

When systems intentionally or unintentionally act maliciously and arbitrarily not just crashing or stopping
A system is Byzantine Fault tolerant (BFT) if it continues correctly when
- nodes are malfunctioning
- nodes not obeying protocols
- malicious attackers interfere with the network
It is difficult and costly to design system as Byzantine Fault tolerant and require hardware level support
BFT algorithms rely on supermajority of more than 2/3rd of nodes to be functioning correctly
BFT uses
- Aerospace environment
  - Radiation can corrupt system
  - Flight system must work correctly
- Peer-to-Peer Networks
  - Bitcoin
  - Blockchain
In Distributed Systems, we can assume non-byzantine faults since nodes are deployed in our datacenter and radiations are low enough