Ingress and Egress
Egress in the world of networking implies traffic that exits an entity or a network boundary, while Ingress is traffic that enters the boundary of a network.
RAS (or RAM)
Scalability
It is used to describe a system’s ability to cope with increased load
Load can be described with a few numbers which we call Load Parameters
Example of Load Parameters:
Request per second to a web server
Ratio of reads to writes in a database
Number of Simultaneously active users in a chat room
Hit rate on a cache
Whichever you are interested in: Average case or extreme case
We monitor the performance when load increases
Percentiles can be used
p50 (Median)
p95, p99, p999
See p95, p99, p999 latency
Example of measuring Performance:
Reliability
Probability that system will continue to work correctly in a given period
About making the system work correctly even when fault occurs
Fault-tolerance can hide certain faults from the end users
System can be operational but not working correctly, we say it is available but not reliable
Example: E-commerce website during sale is accessible but not reliable since users cannot perform checkout properly
https://www.wikiwand.com/en/articles/Reliability,_availability_and_serviceability
Faults:
Measure reliability:
exp ( MTBF − t ) where MTBF = Mean Time between failures
Fault Tolerant
Keep the services functioning correctly , even if some internal component is faulty without any degradation or downtime
Examples of faults:
packets can be lost, reordered, duplicated, arbitrarily delayed
clocks are approximate at best
nodes can pause (example: due to garbage collection)
nodes can crash
Faults vs Failure
Fault is one component of the system deviating from the spec
Failure is when system as a whole stops providing required service to the user
It is usually best to design fault tolerant mechanism that prevent faults from causing failures
Resiliency
Ability of a system to adapt to failures, recover quickly, and continue delivering a reliable experience to users
In the event of error, show some interruption in service or graceful degradation of performance
Spring Cloud Hystrix project is deprecated
Reliability is the outcome and resilience is the way you achieve the outcome
Resilience Patterns:
Name How does it work? Description Retry repeats failed executions Many faults are transient and may self-correct after a short delay Circuit Breaker temporary blocks possible failures When a system is seriously struggling, failing fast is better than making clients wait Rate Limiter limits executions/period Limit the rate of incoming requests Time Limiter limits duration of execution Beyond a certain wait interval, a successful result is unlikely Bulkhead limits concurrent executions Resources are isolated into pools so that if one fails, the others will continue working Cache memorizes a successful result Some proportion of requests may be similar Fallback provides an alternative result for failures Things will still fail - plan what you will do when that happens
Availability
Time a system remains operational to perform its required function in a specific period
Same as probability that a system is operational at a given time
Evaluated as percentage of time that a system, service or a machine remains operational under normal conditions
If service is down for maintenance then it is considered not available during that time
Downtime is the time when service is not available
If system is reliable then it is available but not vice versa
Reliability ⟹ Availability
Involves:
Maintainability
Repair time
spares availability
Github SLA provides 99.9% (3-nines) Availability in a year
Measure Availability:
We will use uptime loosely here, it has slightly different meaning and might/might not include maintenance time etc.
Here:
MTBF = Mean Time between failures (average uptime)
MTTR = Mean Time to repair (average downtime)
tells you average time your customers have been affected
Availability = ( totaltime uptime ) ⋅ 100% = ( uptime + downtime uptime ) ⋅ 100% = ( #failures uptime + #failures downtime #failures uptime ) ⋅ 100% = ( MTBF + MTTR MTBF ) ⋅ 100%
Availability known as Downtime per year 90.0% one nine 36.53 days 99.0% two nines 3.65 days 99.9% three nines 8.77 hours 99.99% four nines 52.60 minutes 99.999% five nines 5.26 minutes
Serviceability or Maintainability or Manageability
Simplicity and speed with which a system can be repaired or maintained
Involves
Ease of diagnosing and understanding problems
Ease of making updates or modifications
Simplicity to operate the system
Other Terms
May not have good definitions in distributed systems