Database downtime is expensive. Studies show that the average cost of IT downtime is $5,600 per minute for enterprises. For critical database systems, even minutes of unavailability can result in lost revenue, damaged reputation, and regulatory penalties.

In this article, we'll cover a comprehensive framework for designing and implementing highly available database architectures.

Understanding Availability Metrics

Before designing an HA solution, understand what level of availability you need:

  • 99.9% (Three Nines) — Up to 8.76 hours downtime/year. Acceptable for internal tools.
  • 99.99% (Four Nines) — Up to 52.56 minutes downtime/year. Standard for business-critical apps.
  • 99.999% (Five Nines) — Up to 5.26 minutes downtime/year. Required for financial and healthcare systems.
💡 Key Concepts: RTO (Recovery Time Objective) is how fast you need to recover. RPO (Recovery Point Objective) is how much data loss you can tolerate. These drive your HA architecture decisions.

HA Architecture Patterns

1. Active-Passive (Warm Standby)

One primary database handles all traffic while a standby replica stays synchronized. On primary failure, the standby is promoted.

  • Oracle Data Guard — Physical or logical standby databases with automatic failover via Fast-Start Failover
  • MySQL Replication — Semi-synchronous replication with automated failover via MHA or Orchestrator
  • PostgreSQL Streaming Replication — Hot standby with Patroni or repmgr for failover management

2. Active-Active (Multi-Primary)

Multiple database nodes accept read-write traffic simultaneously. More complex but provides better scalability:

  • Oracle RAC — Multiple instances sharing a single database on shared storage
  • MySQL Group Replication / InnoDB Cluster — Multi-primary mode with built-in conflict detection
  • PostgreSQL BDR — Bi-directional replication for multi-master setups

3. Distributed Database

Data is distributed across multiple nodes with built-in replication and failover:

  • CockroachDB — PostgreSQL-compatible, globally distributed
  • Amazon Aurora — MySQL/PostgreSQL-compatible with 6-way replication across 3 AZs
  • Google Cloud Spanner — Globally consistent distributed database

Implementing Automatic Failover

Manual failover is error-prone under pressure. Automate your failover process:

Oracle Data Guard Broker

-- Enable Fast-Start Failover
DGMGRL> ENABLE FAST_START FAILOVER;

-- Configure failover conditions
DGMGRL> EDIT CONFIGURATION SET PROPERTY 
  FastStartFailoverThreshold = 30;
DGMGRL> EDIT CONFIGURATION SET PROPERTY 
  FastStartFailoverLagLimit = 10;

PostgreSQL with Patroni

# patroni.yml configuration
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      parameters:
        max_connections: 200
        shared_buffers: 8GB
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 5

Data Protection & Backup Strategy

HA protects against server failure, but you also need backups to protect against:

  • Human error — Accidental DELETE/DROP statements
  • Data corruption — Software bugs, storage failures
  • Ransomware — Encrypted/destroyed data
  • Site-level disaster — Fire, flood, power outage

3-2-1 Backup Rule

  • 3 copies of your data
  • 2 different storage media types
  • 1 copy offsite (or in a different cloud region)

Monitoring & Alerting

You can't fix what you can't see. Implement comprehensive monitoring:

  • Replication lag monitoring — Alert when lag exceeds your RPO threshold
  • Connection pool health — Track active connections, wait times, and rejected connections
  • Storage capacity — Alert at 80% capacity, auto-expand if possible
  • Failover testing — Regularly test automatic failover (chaos engineering style)
💡 Best Practice: Schedule quarterly DR drills where you intentionally fail over to your standby and run production from it for several hours. This validates your entire HA chain and builds team confidence.

Cloud-Native HA Options

  • AWS RDS Multi-AZ — Automatic failover across availability zones with ~60 second failover
  • Azure SQL Managed Instance — Built-in HA with auto-failover groups
  • Google Cloud SQL — Regional instances with automatic failover

Conclusion

High availability is not a single technology — it's a combination of architecture, automation, monitoring, and tested procedures. Start by defining your RTO/RPO requirements, choose the right replication technology, automate failover, and test regularly.

Need help designing your HA architecture? Talk to our DBA experts for a customized recommendation.

← Previous Article Next Article →

Need a Highly Available Database Architecture?

We design and implement HA solutions for Oracle, MySQL, and PostgreSQL.

Get Free Database Audit →