Disaster Recovery (DR)
- Strategic and methodical approach to restoring systems or critical parts of them, after a major failure event
- Major Failure Events include:
- Regional Outages
- Corrupted environment due to malicious activity
- Severe infrastructure failures
- Natural disasters or geopolitical events causing extended service unavailability
- Recovery Time Objective (RTO)
- How quickly a system must be restored after a disruption
- Recovery Point Objective (RPO)
- How much data loss is acceptable and reflects how frequently data must be backed up
- https://learn.microsoft.com/en-us/azure/well-architected/design-guides/disaster-recovery
DR Strategies
- Active-Active (Hot Standby)
- Two or more environments fully operational and serving live traffic simultaneously across multiple regions
- If one environment fails, others continue handling the load with zero or near-zero disruption.
- Active-Passive (Warm Standby)
- Partially provisioned environment running minimal services that can scale up quickly during failures
- Active-Passive (Cold Standby)
- Environment that isn’t running and requires provisioning and data restoration when activated. Lowest cost, longest recovery time.
Failover and Failback
High Availability (HA)