Everything we’ve built across the first nine parts works beautifully — in a single region. One availability zone failure won’t take us down because Aurora Serverless v2 replicates across AZs automatically. But if us-east-1 has a bad day (and it does — three major outages in the last five years), everything is gone. Our frontend, API, database, cache — all sitting in one region.
For a children’s educational platform used by schools across North America, a 4-hour outage during school hours means thousands of lessons missed, hundreds of frustrated teachers, and a conversation with parents that starts with “I thought you said this was reliable.”
This final part makes Kids Learn production-ready with multi-region deployment, automated failover, load testing, and a launch checklist that covers every production concern.
This is Part 10, the conclusion of the AWS Full-Stack Mastery series.
Multi-Region Architecture
Active-Passive vs. Active-Active
Two approaches for multi-region:
Active-Active: Both regions serve traffic simultaneously. Lower latency for geographically distributed users. Higher complexity and cost. DynamoDB Global Tables and Aurora Global Database handle replication. Conflict resolution is your problem.
Active-Passive: Primary region handles all traffic. Secondary region stays warm, receiving replicated data. Route 53 health checks detect primary failure and automatically shifts DNS to the secondary. Simpler, cheaper, and sufficient for most applications.
Kids Learn uses Active-Passive. Our users are primarily in North America (US + Canada). The latency difference between us-east-1 and us-west-2 is ~60ms, which is negligible for our use case. Active-Passive gives us disaster recovery without the complexity of conflict resolution.
Normal Operation:
┌─────────────────────────────┐
Users ──→ Route53 ──→ us-east-1 (PRIMARY) │
│ ├── Amplify │
│ ├── API Gateway + Lambda │
│ ├── Aurora (Writer) │
│ └── ElastiCache │
└─────────────────────────────┘
┌─────────────────────────────┐
│ us-west-2 (STANDBY) │
│ ├── Amplify (deployed) │
│ ├── API Gateway + Lambda │
│ ├── Aurora (Reader replica) │
│ └── ElastiCache (cold) │
└─────────────────────────────┘
Failover:
┌─────────────────────────────┐
│ us-east-1 (DOWN) │
│ ✗ Health check failing │
└─────────────────────────────┘
┌─────────────────────────────┐
Users ──→ Route53 ──→ us-west-2 (PROMOTED) │
│ ├── Amplify (active) │
│ ├── API Gateway + Lambda │
│ ├── Aurora (promoted writer)│
│ └── ElastiCache (warming) │
└─────────────────────────────┘
Aurora Global Database CDK
// lib/stacks/multi-region-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as rds from 'aws-cdk-lib/aws-rds';
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';
export class MultiRegionStack extends cdk.Stack {
constructor(scope: Construct, id: string, props: cdk.StackProps) {
super(scope, id, props);
// Aurora Global Database setup
// Primary cluster (in primary region stack)
const globalCluster = new rds.CfnGlobalCluster(this, 'GlobalDB', {
globalClusterIdentifier: 'kidslearn-global',
sourceDbClusterIdentifier: 'kidslearn-production', // Existing primary
deletionProtection: true,
});
// Route 53 Health Check for the primary API
const primaryHealthCheck = new route53.CfnHealthCheck(this, 'PrimaryHealth', {
healthCheckConfig: {
type: 'HTTPS',
fullyQualifiedDomainName: 'api.kidslearn.app',
resourcePath: '/health',
port: 443,
requestInterval: 10,
failureThreshold: 3,
enableSni: true,
regions: ['us-east-1', 'us-west-2', 'eu-west-1'],
},
});
// Route 53 Failover records
const hostedZone = route53.HostedZone.fromLookup(this, 'Zone', {
domainName: 'kidslearn.app',
});
// Primary record
new route53.CfnRecordSet(this, 'PrimaryDNS', {
hostedZoneId: hostedZone.hostedZoneId,
name: 'api.kidslearn.app',
type: 'A',
aliasTarget: {
dnsName: 'primary-api-gateway.execute-api.us-east-1.amazonaws.com',
hostedZoneId: 'Z1UJRXOUMOOFQ8', // API Gateway hosted zone
evaluateTargetHealth: true,
},
failover: 'PRIMARY',
setIdentifier: 'primary',
healthCheckId: primaryHealthCheck.attrHealthCheckId,
});
// Secondary record
new route53.CfnRecordSet(this, 'SecondaryDNS', {
hostedZoneId: hostedZone.hostedZoneId,
name: 'api.kidslearn.app',
type: 'A',
aliasTarget: {
dnsName: 'secondary-api-gateway.execute-api.us-west-2.amazonaws.com',
hostedZoneId: 'Z2FDTNDATAQYW2',
evaluateTargetHealth: true,
},
failover: 'SECONDARY',
setIdentifier: 'secondary',
});
}
}
Load Testing with Artillery
Before going live, we need to know our breaking point. We use Artillery to simulate realistic traffic patterns:
# load-tests/school-day-simulation.yml
config:
target: "https://api.staging.kidslearn.app"
phases:
# Warm-up
- duration: 60
arrivalRate: 5
name: "Warm up"
# Morning ramp-up (students arriving)
- duration: 120
arrivalRate: 5
rampTo: 50
name: "Morning ramp"
# Peak lesson time
- duration: 300
arrivalRate: 50
name: "Peak load"
# Spike (all classes starting at once)
- duration: 60
arrivalRate: 100
name: "Traffic spike"
# Return to normal
- duration: 120
arrivalRate: 50
rampTo: 10
name: "Cool down"
plugins:
expect: {}
defaults:
headers:
Content-Type: "application/json"
scenarios:
- name: "Browse lessons"
weight: 40
flow:
- get:
url: "/api/lessons?subject=math&grade=2"
expect:
- statusCode: 200
- contentType: json
- hasProperty: "data"
- name: "Take a lesson"
weight: 35
flow:
- get:
url: "/api/lessons"
capture:
- json: "$.data[0].id"
as: "lessonId"
- get:
url: "/api/lessons/{{ lessonId }}"
expect:
- statusCode: 200
- post:
url: "/api/progress"
json:
childId: "test-child-123"
lessonId: "{{ lessonId }}"
eventType: "lesson_start"
expect:
- statusCode: 200
- name: "Check progress"
weight: 25
flow:
- get:
url: "/api/progress/test-child-123"
expect:
- statusCode: 200
Running Load Tests
# Install Artillery
npm install -g artillery@latest
# Run the load test
artillery run load-tests/school-day-simulation.yml --output results.json
# Generate HTML report
artillery report results.json --output results.html
Key Metrics to Watch During Load Tests
┌───────────────────────────┬──────────┬──────────┬──────────┐
│ Metric │ Target │ Warning │ Critical │
├───────────────────────────┼──────────┼──────────┼──────────┤
│ p50 latency │ < 200ms │ < 500ms │ > 1000ms │
│ p99 latency │ < 1000ms │ < 3000ms │ > 5000ms │
│ Error rate │ < 0.1% │ < 1% │ > 5% │
│ Lambda concurrent exec │ < 500 │ < 800 │ > 900 │
│ Aurora ACU usage │ < 50% │ < 75% │ > 90% │
│ ElastiCache CPU │ < 40% │ < 60% │ > 80% │
│ API Gateway throttles │ 0 │ < 10/min │ > 50/min │
└───────────────────────────┴──────────┴──────────┴──────────┘
Production Launch Checklist
Infrastructure
- Multi-AZ enabled for Aurora, ElastiCache, and Fargate
- Aurora Global Database configured with secondary region
- Route 53 health checks configured with failover records
- CloudFront distribution uses all edge locations (PriceClass_All)
- WAF rules deployed and blocking common attacks
- VPC flow logs enabled
- S3 bucket versioning enabled, public access blocked
- KMS key rotation enabled
Security
- COPPA consent flow tested end-to-end
- All IAM roles use least-privilege policies
- No hardcoded secrets — Secrets Manager for everything
- GuardDuty enabled
- Security Hub enabled with CIS Benchmarks
- Cognito advanced security features enabled
- TLS 1.2+ enforced everywhere
- Penetration test completed
Reliability
- Load test passed: 100 req/sec sustained for 30 minutes
- Failover test: primary region killed, DNS shifts within 60 seconds
- Circuit breaker tested: Fargate deployment rolls back on error spike
- Database failover tested: Aurora promotes reader in < 60 seconds
- Backup restoration tested: full database restore from snapshot
- Runbooks written for top 5 incident scenarios
Observability
- CloudWatch dashboard configured with all critical metrics
- Alarms configured for error rate, latency, and resource utilization
- X-Ray tracing enabled across all Lambda and Fargate services
- Structured logging deployed with correlation IDs
- Cost anomaly detection enabled
- Budget alerts set at 80% and 100% of monthly estimate
Performance
- Lambda cold starts eliminated with provisioned concurrency (critical paths)
- CloudFront caching verified — cache hit ratio > 90% for static assets
- Database query performance validated — no queries > 500ms
- Redis cache hit rate > 80% for lesson content
- Image optimization pipeline verified — WebP/AVIF serving correctly
Series Conclusion
Over ten parts, we’ve built a complete, production-ready AWS platform:
| Part | Component | AWS Services |
|---|---|---|
| 1 | Strategy | Organizations, Control Tower, Well-Architected |
| 2 | Infrastructure | CDK v2, CloudFormation, VPC |
| 3 | Frontend | Amplify Gen 2, CloudFront, S3 |
| 4 | Backend | API Gateway HTTP, Lambda, ECS Fargate |
| 5 | Database | Aurora Serverless v2, DynamoDB, ElastiCache |
| 6 | AI/ML | Bedrock, SageMaker, Personalize |
| 7 | DevOps | CodePipeline, CodeBuild, ECR |
| 8 | Security | IAM, Cognito, WAF, KMS, GuardDuty |
| 9 | Observability | CloudWatch, X-Ray, Cost Explorer |
| 10 | Production | Route 53, Global Database, Auto Scaling |
The architecture has been in production for Kids Learn for six months. The platform serves 5,000 monthly active users, generates 2 million learning events per month, and has maintained 99.95% uptime since launch. The monthly AWS bill averages $380 — less than hiring one additional part-time developer.
Thank you for reading this series. Build something that matters.
This is Part 10 of a 10-part series: AWS Full-Stack Mastery for Technical Leads.
Series outline:
- Why AWS & Getting Started (Part 1)
- Infrastructure as Code (CDK) (Part 2)
- Frontend (Amplify + CloudFront) (Part 3)
- Backend (API Gateway + Lambda + Fargate) (Part 4)
- Database (Aurora + DynamoDB + ElastiCache) (Part 5)
- AI/ML (Bedrock + SageMaker) (Part 6)
- DevOps (CodePipeline + CodeBuild) (Part 7)
- Security (IAM + Cognito + WAF) (Part 8)
- Observability (CloudWatch + X-Ray) (Part 9)
- Production (Multi-Region + DR) (this post)
References
- Aurora Global Database — Cross-region replication and failover.
- Route 53 Health Checks — Automated endpoint monitoring.
- Route 53 Failover Routing — DNS-based disaster recovery.
- DynamoDB Global Tables — Multi-region active-active replication.
- Artillery Load Testing — Load testing tool documentation.
- AWS Well-Architected Reliability Pillar — Resilience best practices.
- AWS Disaster Recovery Workloads — DR strategies whitepaper.
- ECS Auto Scaling — Dynamic task scaling.
- AWS Fault Injection Simulator — Chaos engineering service.
- AWS Backup — Centralized backup management.