Faults in Distributed Systems

  • Unreliable clocks
  • Process pauses
  • Partial Failures

Unreliable Clocks

  • Clocks can move backwards in time
  • Time on multiple nodes maybe quite different from other nodes
  • machine’s CPU is defective
  • Network is not configured properly
  • Large Network delays
  • If using distributed systems
    • carefully monitor clocks on all nodes
    • declare nodes dead if their clock drifts too much
  • Types of Clocks
    • Physical
    • Logical

Physical Clock

  • Uses device called Quartz crystal oscillator
  • Can drift (faster/slower) up to several milliseconds
    • depends on temperature of machine
    • Google resynchronize multiple times in day for its servers
  • sync via NTP servers
  • Types
    • Time of Day Clock
      • measures time
      • based on number of seconds since epoch
      • Unix: clock_gettime(CLOCK_REALTIME)
      • Java: System.currentTimeMillis()
      • unsuitable for measuring durations since sync can reset the clock
        • causes jump forward/backwards in time
    • Monotonic Clock
      • measures duration
      • absolute value is meaningless
      • based on something meaningless like nanoseconds since computer started
      • guaranteed to move forward in time
      • Unix: clock_gettime(CLOCK_MONOTONIC)
      • Java: System.nanoTime()
      • NTP can adjust speed of clock
      • Resolution of time is good, in order of microseconds
      • Distributed systems usually can use monotonic clock, for example to calculate timeouts
        • since they don’t rely on synchronization of time on nodes

Logical Clocks

  • measures only the relative ordering of events
  • based on incrementing “counters”

Epoch

  • Midnight UTC on Jan 1, 1970 on Gregorian Calendar not counting leap seconds

NTP

  • Network Time Protocol
  • Allows correction of time
  • NTP servers get more accurate time information from GPS receiver
  • Allows clock rate (speed of clock) to change by ± 0.05% by adjusting frequency of quartz crystal
    • aka slewing of clock
  • NTP adjusts leap seconds gradually over a day
    • aka smearing
  • Clients can query multiple NTP servers ignore outliers for robustness
  • Best possible accuracy ~ 10ms
    • if network congestion ~ 100ms

PTP

  • Precision Time Protocol
  • Uses GPS receivers

Uncertainty Bound

  • Time source: server
    • Expected Quartz drift since last sync
    • NTP server’s uncertainty
    • Network round trip
  • Time source: GPS receiver or atomic (cesium clock) attached to computer
    • reported by the manufacturer
  • For Snapshot Isolation
    • on single node
      • logical clocks can be used
      • monotonically increasing transaction ID
    • For distributed systems

Process Pauses

  • Unexpected pause in the execution of program
    • Stop the world GC pauses
    • VMs can be suspended/resumed effectively stopping/resuming time
    • End user can suspend execution of programs by closing laptop
    • Context switching in threads
    • Context switching multiple VMs
    • Synchronous I/O like disk reading
    • Paging if enabled, can cause page fault if require disk to be loaded into memory
      • too much time spend on swapping pages b/w RAM and disk
        • aka thrashing
      • paging is often disabled on server machines
    • In Unix sending signals like SIGSTOP (Ctrl+Z) can cause program to stop using CPU cycles
  • In multi-threading we have mutexes, semaphores, atomic counters, etc. but in distributed world, system has no shared memory
    • A node must assume that its execution can be paused for significant time at any point
  • Also See Realtime_OS

Partial Failures

  • In distributed systems, when some parts are broken in unpredictable way, it is called partial failure
  • Partial Failures are non-deterministic in nature
    • time taken by message to travel in network is non-deterministic

Network Faults

  • Network faults can happen
    • triggered by software upgrade
    • Sharks might bite undersea cable
    • Faulty Network interface
    • cluster could become deadlocked
  • When one part of network is cutoff from the rest due to network fault, it is called Network Partition
    • aka Netsplit
  • Testing Network Faults
    • Deliberately trigger network problems and test response time
    • Netflix Chaos Monkey is a tool to perform testing
  • Detecting Network Faults
    • Uncertainty about networks make it difficult
    • If process in node is crashed, node can notify others
    • Timeouts can be used although they don’t guarantee if node is really down

Timeouts

  • If timeouts are long, it will take long time to declare nodes dead
  • If timeouts are short, it will prematurely declare the node dead
  • In Distributed Systems, we must assume that network congestion, queueing and unbounded delays will happen
    • Hence no correct value of timeouts, need to be determined experimentally

Network congestion and Queueing

  • On busy network link packet may have to wait to get a slot
    • aka Network congestion
  • In public cloud, resources like network links, switches, CPUs, NICs, etc. are shared between two or more tenants
    • Activity of one tenant can negatively affect another tenant’s use of the system.
    • aka Noisy Neighbor Problem
  • When packet reaches, if CPU is busy, OS will queue the request which may take a long time

Synchronous vs Asynchronous Networks

  • Computer Networks are asynchronous in nature
    • Suffer from queueing
    • Hence they have unbounded delays
    • Uses packet switching: opportunistically use whatever network bandwidth is available
    • They are optimized for bursty traffic
    • maximum utilization of the wire
  • Telephone Network are synchronous in nature
    • does not suffer from queueing
    • maximum latency is fixed
    • Hence they have bounded delays
    • Uses circuit switching: fixed amount of bandwidth which nobody else can use while circuit is established
    • Circuit switching if we want to send file, we will have to guess the bandwidth
      • If guess is low, transfer will become very slow
      • if guess is high, circuit cannot be setup
      • Hence circuit switching cannot be used in variable sized data transfer

Byzantine Faults

  • When systems intentionally or unintentionally act maliciously and arbitrarily not just crashing or stopping
  • A system is Byzantine Fault tolerant (BFT) if it continues correctly when
    • nodes are malfunctioning
    • nodes not obeying protocols
    • malicious attackers interfere with the network
  • It is difficult and costly to design system as Byzantine Fault tolerant and require hardware level support
  • BFT algorithms rely on supermajority of more than 2/3rd of nodes to be functioning correctly
  • BFT uses
    • Aerospace environment
      • Radiation can corrupt system
      • Flight system must work correctly
    • Peer-to-Peer Networks
      • Bitcoin
      • Blockchain
  • In Distributed Systems, we can assume non-byzantine faults since nodes are deployed in our datacenter and radiations are low enough