Everything we’ve built across the first nine parts works beautifully — in a single region. One availability zone failure won’t take us down because Aurora Serverless v2 replicates across AZs automatically. But if us-east-1 has a bad day (and it does — three major outages in the last five years), everything is gone. Our frontend, API, database, cache — all sitting in one region.

For a children’s educational platform used by schools across North America, a 4-hour outage during school hours means thousands of lessons missed, hundreds of frustrated teachers, and a conversation with parents that starts with “I thought you said this was reliable.”

This final part makes Kids Learn production-ready with multi-region deployment, automated failover, load testing, and a launch checklist that covers every production concern.

This is Part 10, the conclusion of the AWS Full-Stack Mastery series.

Production architecture — Multi-region deployment with Route 53 failover and Aurora Global Database

Multi-Region Architecture

Active-Passive vs. Active-Active

Two approaches for multi-region:

Active-Active: Both regions serve traffic simultaneously. Lower latency for geographically distributed users. Higher complexity and cost. DynamoDB Global Tables and Aurora Global Database handle replication. Conflict resolution is your problem.

Active-Passive: Primary region handles all traffic. Secondary region stays warm, receiving replicated data. Route 53 health checks detect primary failure and automatically shifts DNS to the secondary. Simpler, cheaper, and sufficient for most applications.

Kids Learn uses Active-Passive. Our users are primarily in North America (US + Canada). The latency difference between us-east-1 and us-west-2 is ~60ms, which is negligible for our use case. Active-Passive gives us disaster recovery without the complexity of conflict resolution.

Normal Operation:
                     ┌─────────────────────────────┐
  Users ──→ Route53 ──→  us-east-1 (PRIMARY)       │
                     │  ├── Amplify                 │
                     │  ├── API Gateway + Lambda    │
                     │  ├── Aurora (Writer)         │
                     │  └── ElastiCache             │
                     └─────────────────────────────┘
                     
                     ┌─────────────────────────────┐
                     │  us-west-2 (STANDBY)        │
                     │  ├── Amplify (deployed)      │
                     │  ├── API Gateway + Lambda    │
                     │  ├── Aurora (Reader replica) │
                     │  └── ElastiCache (cold)      │
                     └─────────────────────────────┘

Failover:
                     ┌─────────────────────────────┐
                     │  us-east-1 (DOWN)           │
                     │  ✗ Health check failing      │
                     └─────────────────────────────┘
                     
                     ┌─────────────────────────────┐
  Users ──→ Route53 ──→  us-west-2 (PROMOTED)      │
                     │  ├── Amplify (active)        │
                     │  ├── API Gateway + Lambda    │
                     │  ├── Aurora (promoted writer)│
                     │  └── ElastiCache (warming)   │
                     └─────────────────────────────┘

Aurora Global Database CDK

// lib/stacks/multi-region-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as rds from 'aws-cdk-lib/aws-rds';
import * as route53 from 'aws-cdk-lib/aws-route53';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';

export class MultiRegionStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: cdk.StackProps) {
    super(scope, id, props);

    // Aurora Global Database setup
    // Primary cluster (in primary region stack)
    const globalCluster = new rds.CfnGlobalCluster(this, 'GlobalDB', {
      globalClusterIdentifier: 'kidslearn-global',
      sourceDbClusterIdentifier: 'kidslearn-production', // Existing primary
      deletionProtection: true,
    });

    // Route 53 Health Check for the primary API
    const primaryHealthCheck = new route53.CfnHealthCheck(this, 'PrimaryHealth', {
      healthCheckConfig: {
        type: 'HTTPS',
        fullyQualifiedDomainName: 'api.kidslearn.app',
        resourcePath: '/health',
        port: 443,
        requestInterval: 10,
        failureThreshold: 3,
        enableSni: true,
        regions: ['us-east-1', 'us-west-2', 'eu-west-1'],
      },
    });

    // Route 53 Failover records
    const hostedZone = route53.HostedZone.fromLookup(this, 'Zone', {
      domainName: 'kidslearn.app',
    });

    // Primary record
    new route53.CfnRecordSet(this, 'PrimaryDNS', {
      hostedZoneId: hostedZone.hostedZoneId,
      name: 'api.kidslearn.app',
      type: 'A',
      aliasTarget: {
        dnsName: 'primary-api-gateway.execute-api.us-east-1.amazonaws.com',
        hostedZoneId: 'Z1UJRXOUMOOFQ8', // API Gateway hosted zone
        evaluateTargetHealth: true,
      },
      failover: 'PRIMARY',
      setIdentifier: 'primary',
      healthCheckId: primaryHealthCheck.attrHealthCheckId,
    });

    // Secondary record
    new route53.CfnRecordSet(this, 'SecondaryDNS', {
      hostedZoneId: hostedZone.hostedZoneId,
      name: 'api.kidslearn.app',
      type: 'A',
      aliasTarget: {
        dnsName: 'secondary-api-gateway.execute-api.us-west-2.amazonaws.com',
        hostedZoneId: 'Z2FDTNDATAQYW2',
        evaluateTargetHealth: true,
      },
      failover: 'SECONDARY',
      setIdentifier: 'secondary',
    });
  }
}

Load Testing with Artillery

Before going live, we need to know our breaking point. We use Artillery to simulate realistic traffic patterns:

# load-tests/school-day-simulation.yml
config:
  target: "https://api.staging.kidslearn.app"
  phases:
    # Warm-up
    - duration: 60
      arrivalRate: 5
      name: "Warm up"
    
    # Morning ramp-up (students arriving)
    - duration: 120
      arrivalRate: 5
      rampTo: 50
      name: "Morning ramp"
    
    # Peak lesson time
    - duration: 300
      arrivalRate: 50
      name: "Peak load"
    
    # Spike (all classes starting at once)
    - duration: 60
      arrivalRate: 100
      name: "Traffic spike"
    
    # Return to normal
    - duration: 120
      arrivalRate: 50
      rampTo: 10
      name: "Cool down"

  plugins:
    expect: {}
    
  defaults:
    headers:
      Content-Type: "application/json"

scenarios:
  - name: "Browse lessons"
    weight: 40
    flow:
      - get:
          url: "/api/lessons?subject=math&grade=2"
          expect:
            - statusCode: 200
            - contentType: json
            - hasProperty: "data"

  - name: "Take a lesson"
    weight: 35
    flow:
      - get:
          url: "/api/lessons"
          capture:
            - json: "$.data[0].id"
              as: "lessonId"
      - get:
          url: "/api/lessons/{{ lessonId }}"
          expect:
            - statusCode: 200
      - post:
          url: "/api/progress"
          json:
            childId: "test-child-123"
            lessonId: "{{ lessonId }}"
            eventType: "lesson_start"
          expect:
            - statusCode: 200

  - name: "Check progress"
    weight: 25
    flow:
      - get:
          url: "/api/progress/test-child-123"
          expect:
            - statusCode: 200

Running Load Tests

# Install Artillery
npm install -g artillery@latest

# Run the load test
artillery run load-tests/school-day-simulation.yml --output results.json

# Generate HTML report
artillery report results.json --output results.html

Key Metrics to Watch During Load Tests

┌───────────────────────────┬──────────┬──────────┬──────────┐
│ Metric                    │ Target   │ Warning  │ Critical │
├───────────────────────────┼──────────┼──────────┼──────────┤
│ p50 latency               │ < 200ms  │ < 500ms  │ > 1000ms │
│ p99 latency               │ < 1000ms │ < 3000ms │ > 5000ms │
│ Error rate                │ < 0.1%   │ < 1%     │ > 5%     │
│ Lambda concurrent exec    │ < 500    │ < 800    │ > 900    │
│ Aurora ACU usage          │ < 50%    │ < 75%    │ > 90%    │
│ ElastiCache CPU           │ < 40%    │ < 60%    │ > 80%    │
│ API Gateway throttles     │ 0        │ < 10/min │ > 50/min │
└───────────────────────────┴──────────┴──────────┴──────────┘

Production Launch Checklist

Infrastructure

  • Multi-AZ enabled for Aurora, ElastiCache, and Fargate
  • Aurora Global Database configured with secondary region
  • Route 53 health checks configured with failover records
  • CloudFront distribution uses all edge locations (PriceClass_All)
  • WAF rules deployed and blocking common attacks
  • VPC flow logs enabled
  • S3 bucket versioning enabled, public access blocked
  • KMS key rotation enabled

Security

  • COPPA consent flow tested end-to-end
  • All IAM roles use least-privilege policies
  • No hardcoded secrets — Secrets Manager for everything
  • GuardDuty enabled
  • Security Hub enabled with CIS Benchmarks
  • Cognito advanced security features enabled
  • TLS 1.2+ enforced everywhere
  • Penetration test completed

Reliability

  • Load test passed: 100 req/sec sustained for 30 minutes
  • Failover test: primary region killed, DNS shifts within 60 seconds
  • Circuit breaker tested: Fargate deployment rolls back on error spike
  • Database failover tested: Aurora promotes reader in < 60 seconds
  • Backup restoration tested: full database restore from snapshot
  • Runbooks written for top 5 incident scenarios

Observability

  • CloudWatch dashboard configured with all critical metrics
  • Alarms configured for error rate, latency, and resource utilization
  • X-Ray tracing enabled across all Lambda and Fargate services
  • Structured logging deployed with correlation IDs
  • Cost anomaly detection enabled
  • Budget alerts set at 80% and 100% of monthly estimate

Performance

  • Lambda cold starts eliminated with provisioned concurrency (critical paths)
  • CloudFront caching verified — cache hit ratio > 90% for static assets
  • Database query performance validated — no queries > 500ms
  • Redis cache hit rate > 80% for lesson content
  • Image optimization pipeline verified — WebP/AVIF serving correctly

Series Conclusion

Over ten parts, we’ve built a complete, production-ready AWS platform:

PartComponentAWS Services
1StrategyOrganizations, Control Tower, Well-Architected
2InfrastructureCDK v2, CloudFormation, VPC
3FrontendAmplify Gen 2, CloudFront, S3
4BackendAPI Gateway HTTP, Lambda, ECS Fargate
5DatabaseAurora Serverless v2, DynamoDB, ElastiCache
6AI/MLBedrock, SageMaker, Personalize
7DevOpsCodePipeline, CodeBuild, ECR
8SecurityIAM, Cognito, WAF, KMS, GuardDuty
9ObservabilityCloudWatch, X-Ray, Cost Explorer
10ProductionRoute 53, Global Database, Auto Scaling

The architecture has been in production for Kids Learn for six months. The platform serves 5,000 monthly active users, generates 2 million learning events per month, and has maintained 99.95% uptime since launch. The monthly AWS bill averages $380 — less than hiring one additional part-time developer.

Thank you for reading this series. Build something that matters.


This is Part 10 of a 10-part series: AWS Full-Stack Mastery for Technical Leads.

Series outline:

  1. Why AWS & Getting Started (Part 1)
  2. Infrastructure as Code (CDK) (Part 2)
  3. Frontend (Amplify + CloudFront) (Part 3)
  4. Backend (API Gateway + Lambda + Fargate) (Part 4)
  5. Database (Aurora + DynamoDB + ElastiCache) (Part 5)
  6. AI/ML (Bedrock + SageMaker) (Part 6)
  7. DevOps (CodePipeline + CodeBuild) (Part 7)
  8. Security (IAM + Cognito + WAF) (Part 8)
  9. Observability (CloudWatch + X-Ray) (Part 9)
  10. Production (Multi-Region + DR) (this post)

References

Export for reading

Comments