Ingress and Egress

  • Egress in the world of networking implies traffic that exits an entity or a network boundary, while Ingress is traffic that enters the boundary of a network.

RAS (or RAM)

Scalability

  • It is used to describe a system’s ability to cope with increased load
  • Load can be described with a few numbers which we call Load Parameters
  • Example of Load Parameters:
    • Request per second to a web server
    • Ratio of reads to writes in a database
    • Number of Simultaneously active users in a chat room
    • Hit rate on a cache
  • Whichever you are interested in: Average case or extreme case
  • We monitor the performance when load increases

Performance

  • Percentiles can be used
    • p50 (Median)
    • p95, p99, p999
  • See p95, p99, p999 latency
  • Example of measuring Performance:
    • Throughput
    • Response time

Reliability

  • Probability that system will continue to work correctly in a given period
  • About making the system work correctly even when fault occurs
  • Fault-tolerance can hide certain faults from the end users
  • System can be operational but not working correctly, we say it is available but not reliable
    • Example: E-commerce website during sale is accessible but not reliable since users cannot perform checkout properly
  • https://www.wikiwand.com/en/articles/Reliability,_availability_and_serviceability
  • Faults:
    • Hardware
    • Software
    • Human
  • Measure reliability:
    • where = Mean Time between failures

Fault Tolerant

  • Keep the services functioning correctly, even if some internal component is faulty without any degradation or downtime
  • Examples of faults:
    • packets can be lost, reordered, duplicated, arbitrarily delayed
    • clocks are approximate at best
    • nodes can pause (example: due to garbage collection)
    • nodes can crash

Faults vs Failure

  • Fault is one component of the system deviating from the spec
  • Failure is when system as a whole stops providing required service to the user
  • It is usually best to design fault tolerant mechanism that prevent faults from causing failures

Resiliency

NameHow does it work?Description
Retryrepeats failed executionsMany faults are transient and may self-correct after a short delay
Circuit Breakertemporary blocks possible failuresWhen a system is seriously struggling, failing fast is better than making clients wait
Rate Limiterlimits executions/periodLimit the rate of incoming requests
Time Limiterlimits duration of executionBeyond a certain wait interval, a successful result is unlikely
Bulkheadlimits concurrent executionsResources are isolated into pools so that if one fails, the others will continue working
Cachememorizes a successful resultSome proportion of requests may be similar
Fallbackprovides an alternative result for failuresThings will still fail - plan what you will do when that happens

Availability

  • Time a system remains operational to perform its required function in a specific period
    • Same as probability that a system is operational at a given time
  • Evaluated as percentage of time that a system, service or a machine remains operational under normal conditions
  • If service is down for maintenance then it is considered not available during that time
    • Downtime is the time when service is not available
  • If system is reliable then it is available but not vice versa
    • Reliability Availability
  • Involves:
    • Maintainability
    • Repair time
    • spares availability
  • Github SLA provides 99.9% (3-nines) Availability in a year
  • Measure Availability:
    • We will use uptime loosely here, it has slightly different meaning and might/might not include maintenance time etc.
    • Here:
      • = Mean Time between failures (average uptime)
        • hard to interpret
      • = Mean Time to repair (average downtime)
        • tells you average time your customers have been affected
Availabilityknown asDowntime per year
90.0%one nine36.53 days
99.0%two nines3.65 days
99.9%three nines8.77 hours
99.99%four nines52.60 minutes
99.999%five nines5.26 minutes

Serviceability or Maintainability or Manageability

  • Simplicity and speed with which a system can be repaired or maintained
  • Involves
    • Ease of diagnosing and understanding problems
    • Ease of making updates or modifications
    • Simplicity to operate the system

Other Terms

  • May not have good definitions in distributed systems
    • Robustness
    • Efficiency