Disaster Recovery (DR)

Strategic and methodical approach to restoring systems or critical parts of them, after a major failure event
Major Failure Events include:
- Regional Outages
- Corrupted environment due to malicious activity
- Severe infrastructure failures
- Natural disasters or geopolitical events causing extended service unavailability
Recovery Time Objective (RTO)
- How quickly a system must be restored after a disruption
Recovery Point Objective (RPO)
- How much data loss is acceptable and reflects how frequently data must be backed up
https://learn.microsoft.com/en-us/azure/well-architected/design-guides/disaster-recovery

DR Strategies

Active-Active (Hot Standby)
- Two or more environments fully operational and serving live traffic simultaneously across multiple regions
- If one environment fails, others continue handling the load with zero or near-zero disruption.
Active-Passive (Warm Standby)
- Partially provisioned environment running minimal services that can scale up quickly during failures
Active-Passive (Cold Standby)
- Environment that isn’t running and requires provisioning and data restoration when activated. Lowest cost, longest recovery time.

Failover
- Process of shifting workloads from primary to standby environment during a disaster
Failback
- Process of returning workloads to the original primary environment after incident resolution
https://learn.microsoft.com/en-us/azure/reliability/concept-failover-failback

Designing solution to be resilient to day-to-day issues and to meet the business needs for Availability
HA Design Elements
- Fault tolerance
- Redundancy
- Scalability and elasticity
- Automated testing using Chaos engineering
- Monitoring and alerting
https://learn.microsoft.com/en-us/azure/reliability/concept-business-continuity-high-availability-disaster-recovery