Database downtime is expensive. Studies show that the average cost of IT downtime is $5,600 per minute for enterprises. For critical database systems, even minutes of unavailability can result in lost revenue, damaged reputation, and regulatory penalties.
In this article, we'll cover a comprehensive framework for designing and implementing highly available database architectures.
Understanding Availability Metrics
Before designing an HA solution, understand what level of availability you need:
- 99.9% (Three Nines) — Up to 8.76 hours downtime/year. Acceptable for internal tools.
- 99.99% (Four Nines) — Up to 52.56 minutes downtime/year. Standard for business-critical apps.
- 99.999% (Five Nines) — Up to 5.26 minutes downtime/year. Required for financial and healthcare systems.
HA Architecture Patterns
1. Active-Passive (Warm Standby)
One primary database handles all traffic while a standby replica stays synchronized. On primary failure, the standby is promoted.
- Oracle Data Guard — Physical or logical standby databases with automatic failover via Fast-Start Failover
- MySQL Replication — Semi-synchronous replication with automated failover via MHA or Orchestrator
- PostgreSQL Streaming Replication — Hot standby with Patroni or repmgr for failover management
2. Active-Active (Multi-Primary)
Multiple database nodes accept read-write traffic simultaneously. More complex but provides better scalability:
- Oracle RAC — Multiple instances sharing a single database on shared storage
- MySQL Group Replication / InnoDB Cluster — Multi-primary mode with built-in conflict detection
- PostgreSQL BDR — Bi-directional replication for multi-master setups
3. Distributed Database
Data is distributed across multiple nodes with built-in replication and failover:
- CockroachDB — PostgreSQL-compatible, globally distributed
- Amazon Aurora — MySQL/PostgreSQL-compatible with 6-way replication across 3 AZs
- Google Cloud Spanner — Globally consistent distributed database
Implementing Automatic Failover
Manual failover is error-prone under pressure. Automate your failover process:
Oracle Data Guard Broker
-- Enable Fast-Start Failover
DGMGRL> ENABLE FAST_START FAILOVER;
-- Configure failover conditions
DGMGRL> EDIT CONFIGURATION SET PROPERTY
FastStartFailoverThreshold = 30;
DGMGRL> EDIT CONFIGURATION SET PROPERTY
FastStartFailoverLagLimit = 10;
PostgreSQL with Patroni
# patroni.yml configuration
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
max_connections: 200
shared_buffers: 8GB
wal_level: replica
hot_standby: "on"
max_wal_senders: 5
Data Protection & Backup Strategy
HA protects against server failure, but you also need backups to protect against:
- Human error — Accidental DELETE/DROP statements
- Data corruption — Software bugs, storage failures
- Ransomware — Encrypted/destroyed data
- Site-level disaster — Fire, flood, power outage
3-2-1 Backup Rule
- 3 copies of your data
- 2 different storage media types
- 1 copy offsite (or in a different cloud region)
Monitoring & Alerting
You can't fix what you can't see. Implement comprehensive monitoring:
- Replication lag monitoring — Alert when lag exceeds your RPO threshold
- Connection pool health — Track active connections, wait times, and rejected connections
- Storage capacity — Alert at 80% capacity, auto-expand if possible
- Failover testing — Regularly test automatic failover (chaos engineering style)
Cloud-Native HA Options
- AWS RDS Multi-AZ — Automatic failover across availability zones with ~60 second failover
- Azure SQL Managed Instance — Built-in HA with auto-failover groups
- Google Cloud SQL — Regional instances with automatic failover
Conclusion
High availability is not a single technology — it's a combination of architecture, automation, monitoring, and tested procedures. Start by defining your RTO/RPO requirements, choose the right replication technology, automate failover, and test regularly.
Need help designing your HA architecture? Talk to our DBA experts for a customized recommendation.