Ingress and Egress

Egress in the world of networking implies traffic that exits an entity or a network boundary, while Ingress is traffic that enters the boundary of a network.

RAS (or RAM)

It is used to describe a system’s ability to cope with increased load
Load can be described with a few numbers which we call Load Parameters
Example of Load Parameters:
- Request per second to a web server
- Ratio of reads to writes in a database
- Number of Simultaneously active users in a chat room
- Hit rate on a cache
Whichever you are interested in: Average case or extreme case
We monitor the performance when load increases

Probability that system will continue to work correctly in a given period
About making the system work correctly even when fault occurs
Fault-tolerance can hide certain faults from the end users
System can be operational but not working correctly, we say it is available but not reliable
- Example: E-commerce website during sale is accessible but not reliable since users cannot perform checkout properly
https://www.wikiwand.com/en/articles/Reliability,_availability_and_serviceability
Faults:
- Hardware
- Software
- Human
Measure reliability:
- $exp (\frac{- t}{MTBF})$ where $MTBF$ = Mean Time between failures

Keep the services functioning correctly, even if some internal component is faulty without any degradation or downtime
Examples of faults:
- packets can be lost, reordered, duplicated, arbitrarily delayed
- clocks are approximate at best
- nodes can pause (example: due to garbage collection)
- nodes can crash

Fault is one component of the system deviating from the spec
Failure is when system as a whole stops providing required service to the user
It is usually best to design fault tolerant mechanism that prevent faults from causing failures

Ability of a system to adapt to failures, recover quickly, and continue delivering a reliable experience to users
In the event of error, show some interruption in service or graceful degradation of performance
Spring Cloud Hystrix project is deprecated
Reliability is the outcome and resilience is the way you achieve the outcome
Resilience Patterns:

Name	How does it work?	Description
Retry	repeats failed executions	Many faults are transient and may self-correct after a short delay
Circuit Breaker	temporary blocks possible failures	When a system is seriously struggling, failing fast is better than making clients wait
Rate Limiter	limits executions/period	Limit the rate of incoming requests
Time Limiter	limits duration of execution	Beyond a certain wait interval, a successful result is unlikely
Bulkhead	limits concurrent executions	Resources are isolated into pools so that if one fails, the others will continue working
Cache	memorizes a successful result	Some proportion of requests may be similar
Fallback	provides an alternative result for failures	Things will still fail - plan what you will do when that happens

Time a system remains operational to perform its required function in a specific period
- Same as probability that a system is operational at a given time
Evaluated as percentage of time that a system, service or a machine remains operational under normal conditions
If service is down for maintenance then it is considered not available during that time
- Downtime is the time when service is not available
If system is reliable then it is available but not vice versa
- Reliability $⟹$ Availability
Involves:
- Maintainability
- Repair time
- spares availability
Github SLA provides 99.9% (3-nines) Availability in a year
Measure Availability:
- We will use uptime loosely here, it has slightly different meaning and might/might not include maintenance time etc.
- Here:
  - $MTBF$ = Mean Time between failures (average uptime)
    - hard to interpret
  - $MTTR$ = Mean Time to repair (average downtime)
    - tells you average time your customers have been affected

Availability = (\frac{uptime}{totaltime}) \cdot 100% = (\frac{uptime}{uptime + downtime}) \cdot 100% = (\frac{\frac{uptime}{#failures}}{\frac{uptime}{#failures} + \frac{downtime}{#failures}}) \cdot 100% = (\frac{MTBF}{MTBF + MTTR}) \cdot 100%

Simplicity and speed with which a system can be repaired or maintained
Involves
- Ease of diagnosing and understanding problems
- Ease of making updates or modifications
- Simplicity to operate the system