The conversation started at a whiteboard. Toan had drawn a box labeled “Kids Learn” and surrounded it with question marks. “We need a platform,” he said. “Not a hobby project. Not a prototype. A platform that scales to thousands of concurrent students, handles AI inference for personalized lessons, keeps children’s data safe under COPPA and GDPR, and doesn’t wake us up at 3 AM because a single server fell over.”

I looked at the whiteboard and started listing what we actually needed:

  • Compute that scales from zero to thousands of concurrent API requests during school hours, then back to near-zero at midnight
  • Database that handles relational data (users, subscriptions, progress) and vector search (curriculum embeddings, knowledge states) in a single transactional system
  • AI/ML services for content generation, adaptive difficulty, and personalized learning paths — without managing GPU instances
  • Storage for media assets (lesson images, audio instructions, celebration animations) with global edge delivery
  • Security that satisfies COPPA requirements: verifiable parental consent, zero data collection from children without permission, encrypted everything
  • CI/CD that lets four developers ship multiple times a day without stepping on each other
  • Observability that tells us when a lesson is loading slowly in rural Queensland before a parent files a support ticket
  • Cost model where we pay for what we use, not for idle capacity at 2 AM

That’s not a VPS. That’s not a PaaS with training wheels. That’s AWS.

This is Part 1 of a 10-part series where I document exactly how we built Kids Learn — an AI-powered educational SaaS platform for children aged 4–12 — entirely on AWS. Not theory. Not “here’s what you could do.” Here’s what we did, why we did it, and the exact commands, configurations, and CDK constructs you need to do it yourself.

AWS Full-Stack Overview — Complete service map for Kids Learn showing compute, database, AI/ML, security, DevOps, and observability layers

Why AWS Over the Alternatives

Let me be direct. There are excellent cloud providers. GCP has strong AI/ML services. Azure integrates well with enterprise Microsoft stacks. Cloudflare has remarkable edge computing at low cost. We evaluated all of them.

The Decision Matrix

┌──────────────────────────────┬─────────┬─────────┬─────────┬─────────────┐
│ Requirement                  │ AWS     │ GCP     │ Azure   │ Cloudflare  │
├──────────────────────────────┼─────────┼─────────┼─────────┼─────────────┤
│ Managed PostgreSQL + pgvector│ Aurora  │ CloudSQL│ Flex    │ ❌ (D1 only) │
│ Serverless containers        │ Fargate │ Run     │ ACA     │ ❌           │
│ AI/ML managed services       │ Bedrock │ Vertex  │ OpenAI  │ Workers AI  │
│ Child-safe auth (COPPA)      │ Cognito │ Identity│ AD B2C  │ Access      │
│ WAF + DDoS protection        │ WAF+    │ Armor   │ FD      │ Built-in    │
│ IaC maturity                 │ CDK     │ DM      │ Bicep   │ Terraform   │
│ CI/CD native                 │ CodeP.  │ Build   │ DevOps  │ Pages       │
│ Edge locations (global)      │ 600+    │ 200+    │ 300+    │ 300+        │
│ SaaS team experience         │ High    │ Medium  │ Low     │ Medium      │
│ Enterprise adoption          │ #1      │ #3      │ #2      │ Growing     │
│ Compliance certifications    │ 143     │ 100+    │ 100+    │ Limited     │
└──────────────────────────────┴─────────┴─────────┴─────────┴─────────────┘

AWS won on three decisive factors:

1. Service breadth for a full-stack SaaS. We need relational databases, vector search, serverless compute, container orchestration, AI inference, CDN, DNS, email sending, push notifications, real-time analytics, and compliance tooling. AWS has mature, production-proven services for every single one. No gaps that require third-party integrations.

2. Aurora PostgreSQL with pgvector. Our adaptive learning engine requires vector similarity search alongside traditional relational queries — in the same transaction. Aurora Serverless v2 gives us PostgreSQL 16 with pgvector, auto-scaling from 0.5 ACU to 128 ACU, and we only pay for what we consume. Cloud SQL on GCP comes close, but Aurora’s instant failover and read replica scaling are superior for our latency requirements.

3. Amazon Bedrock. We need AI content generation (creating personalized lessons), but we don’t want to manage model infrastructure. Bedrock gives us access to Claude, Titan, and Llama models through a unified API with no model hosting required. The pricing is per-token with no minimum commitment. For a startup-stage SaaS, this cost model is critical.

What About Cloudflare?

I have deep respect for Cloudflare — I’ve written extensively about their services and still use them for DNS and edge caching in front of AWS. But for a full-stack SaaS with databases, containers, and AI inference? Cloudflare’s platform isn’t there yet. D1 is SQLite-based (no pgvector), Workers have CPU time limits that kill long-running AI inference, and their compute options are limited to Workers and Pages Functions. For a portfolio site or an API proxy — Cloudflare is brilliant. For Kids Learn — we need AWS.

The AWS Well-Architected Framework

Before writing a single line of CDK, I made the team read the Well-Architected Framework. Not the marketing summary. The actual documentation. All six pillars. Here’s why each one matters for Kids Learn specifically.

Pillar 1: Operational Excellence

What it means for Kids Learn: Can we deploy a fix at 2 PM on a Tuesday without taking the platform down? Can we roll back a bad deployment in under 60 seconds? Do we know what’s happening in production right now?

Our practices:

  • Every deployment goes through CodePipeline with automated testing gates
  • Blue/green deployments on ECS Fargate — zero downtime, instant rollback
  • Runbooks for every operational scenario documented in our internal wiki
  • Post-incident reviews after every production issue, no matter how small

Pillar 2: Security

What it means for Kids Learn: We handle children’s personal data. COPPA violations carry $50,000+ per-incident fines. This isn’t “we should probably encrypt that” security. This is “our company’s existence depends on getting this right” security.

Our practices:

  • IAM roles follow least privilege — Lambda functions can only access the specific DynamoDB tables and Aurora databases they need
  • All data encrypted at rest (KMS) and in transit (TLS 1.3)
  • Cognito handles authentication with COPPA-compliant parental consent flows
  • WAF rules block SQL injection, XSS, and bot traffic at the edge
  • GuardDuty monitors for anomalous API calls 24/7
  • Quarterly security audits with automated scanning

Pillar 3: Reliability

What it means for Kids Learn: When 500 students start their math lessons at 9 AM, every single one needs to see their personalized content load in under 2 seconds. When the AI generates a lesson, it cannot serve stale or corrupt content. When one Availability Zone has issues, the platform keeps running.

Our practices:

  • Multi-AZ deployment for Aurora, ElastiCache, and ECS Fargate
  • Health checks on every service with automatic replacement of unhealthy instances
  • Circuit breaker pattern on AI inference calls — if Bedrock is slow, serve cached lessons
  • Chaos engineering experiments quarterly (we intentionally kill services to verify recovery)

Pillar 4: Performance Efficiency

What it means for Kids Learn: AI content generation takes 2-4 seconds. Database vector similarity search takes 50-200ms. We need to make the user experience feel instant despite these latencies.

Our practices:

  • Pre-generate lessons during off-peak hours, serve from cache during peak
  • Aurora read replicas for analytics queries that don’t need real-time consistency
  • CloudFront caches static assets at 600+ edge locations globally
  • Lambda functions use provisioned concurrency for critical paths (lesson API, auth)

Pillar 5: Cost Optimization

What it means for Kids Learn: We’re a startup. Every dollar matters. But we also can’t cut corners on security or reliability for a kids’ platform.

Our practices:

  • Aurora Serverless v2 scales to zero-ish (0.5 ACU minimum) during off-hours
  • Lambda functions sized with AWS Lambda Power Tuning for cost-optimal memory/duration
  • S3 Intelligent Tiering for media assets — frequently accessed lessons stay hot, archived content moves to cheaper storage
  • Monthly cost reviews with alerts at 80% of budget thresholds
  • Reserved capacity for predictable baseline, on-demand for peak school hours

Pillar 6: Sustainability

What it means for Kids Learn: Serverless and auto-scaling mean we’re not running idle servers 24/7. Our compute consumes energy proportional to actual usage.

Our practices:

  • Serverless-first architecture — Lambda and Fargate consume compute only when processing requests
  • Right-sized containers based on actual resource utilization data
  • Region selection considers carbon intensity (us-west-2 runs on 90%+ renewable energy)

AWS Account Setup — The Right Way

Most tutorials start with “create an AWS account.” Most production teams discover six months later that their account structure is a mess. Let’s do it right from the start.

AWS Organizations and Multi-Account Strategy

A single AWS account for everything is how you end up with a developer accidentally deleting the production database because they thought they were in staging. We use AWS Organizations with separate accounts:

AWS Organization (Root)
├── Management Account (billing, organizations, SSO)
├── Security Account (GuardDuty, SecurityHub, CloudTrail logs)
├── Development Account
│   ├── Dev environment
│   └── Feature branch environments
├── Staging Account
│   └── Pre-production environment (mirrors prod)
└── Production Account
    ├── Primary region (ap-southeast-1)
    └── DR region (us-west-2)

Setting Up the Organization

# Install AWS CLI v2
# https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

# Configure SSO (recommended over long-lived access keys)
aws configure sso
# SSO session name: kidslearn
# SSO start URL: https://kidslearn.awsapps.com/start
# SSO Region: ap-southeast-1
# SSO registration scopes: sso:account:access

# Create the organization (from management account)
aws organizations create-organization --feature-set ALL

# Create child accounts
aws organizations create-account \
  --email "aws-security@kidslearn.app" \
  --account-name "KidsLearn-Security"

aws organizations create-account \
  --email "aws-dev@kidslearn.app" \
  --account-name "KidsLearn-Development"

aws organizations create-account \
  --email "aws-staging@kidslearn.app" \
  --account-name "KidsLearn-Staging"

aws organizations create-account \
  --email "aws-prod@kidslearn.app" \
  --account-name "KidsLearn-Production"

IAM Identity Center (SSO)

Nobody on the team uses root credentials. Nobody uses long-lived access keys. Everyone authenticates through IAM Identity Center with MFA enforced.

# Enable IAM Identity Center
aws sso-admin create-instance

# Create permission sets
aws sso-admin create-permission-set \
  --instance-arn "arn:aws:sso:::instance/ssoins-XXXX" \
  --name "DeveloperAccess" \
  --description "Developer access - no IAM or billing" \
  --session-duration "PT8H" \
  --relay-state ""

# Attach managed policies
aws sso-admin attach-managed-policy-to-permission-set \
  --instance-arn "arn:aws:sso:::instance/ssoins-XXXX" \
  --permission-set-arn "arn:aws:sso:::permissionSet/ssoins-XXXX/ps-XXXX" \
  --managed-policy-arn "arn:aws:iam::aws:policy/PowerUserAccess"

Budget Alerts — Don’t Get Surprised

The most important thing you can do in the first 10 minutes of a new AWS account:

# Create a budget with alerts
aws budgets create-budget \
  --account-id "123456789012" \
  --budget '{
    "BudgetName": "KidsLearn-Monthly",
    "BudgetLimit": {
      "Amount": "500",
      "Unit": "USD"
    },
    "BudgetType": "COST",
    "TimeUnit": "MONTHLY"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 50,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "thuan@kidslearn.app"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "thuan@kidslearn.app"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "thuan@kidslearn.app"
        }
      ]
    }
  ]'

This creates three alerts:

  • 50% actual: “You’ve spent half the budget mid-month — is this normal?”
  • 80% actual: “You’re approaching the limit — investigate now.”
  • 100% forecasted: “At this rate, you’ll exceed the budget — take action.”

CloudTrail — Audit Everything from Day One

# Create a CloudTrail trail that logs to S3
aws cloudtrail create-trail \
  --name "kidslearn-audit" \
  --s3-bucket-name "kidslearn-cloudtrail-logs" \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events

aws cloudtrail start-logging --name "kidslearn-audit"

Every API call in every account is now logged. When the inevitable “who changed the database security group?” question comes up, you have the answer.

The Kids Learn Architecture on AWS

Here’s what we’re building across this 10-part series. Every service mentioned here gets deep, hands-on coverage in its dedicated part.

The Complete Service Map

┌─────────────────────────────────────────────────────────────────┐
│                        USERS (Global)                          │
│    Parents (Dashboard)  │  Children (Learning)  │  Teachers    │
└─────────────┬───────────────────────┬───────────────────────────┘
              │                       │
              ▼                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  EDGE LAYER                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │ Route 53     │  │ CloudFront   │  │ WAF + Shield         │  │
│  │ DNS + Health │  │ CDN (600+    │  │ DDoS protection,     │  │
│  │ checks       │  │ edge locs)   │  │ SQL injection,       │  │
│  │              │  │              │  │ rate limiting         │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
└─────────────┬───────────────────────┬───────────────────────────┘
              │                       │
              ▼                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  FRONTEND                                                       │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │ AWS Amplify Hosting (Next.js SSR + SSG)                   │   │
│  │ Static pages → S3, SSR → Lambda@Edge, API → Lambda       │   │
│  └──────────────────────────────────────────────────────────┘   │
│  ┌──────────────┐                                               │
│  │ S3 Bucket    │ (media assets: images, audio, animations)    │
│  └──────────────┘                                               │
└─────────────┬───────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│  BACKEND COMPUTE                                                │
│  ┌──────────────────┐  ┌─────────────────────────────────────┐ │
│  │ API Gateway      │  │ ECS Fargate                         │ │
│  │ REST + HTTP APIs │  │ Adaptive learning engine (long-     │ │
│  │ → Lambda funcs   │  │ running AI inference, batch         │ │
│  │ (lessons, auth,  │  │ processing, content generation)     │ │
│  │ progress)        │  │                                     │ │
│  └──────────────────┘  └─────────────────────────────────────┘ │
└─────────────┬───────────────────────┬───────────────────────────┘
              │                       │
              ▼                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  AI / ML SERVICES                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │ Bedrock      │  │ SageMaker    │  │ Personalize          │  │
│  │ Claude/Titan │  │ Custom model │  │ Learning path        │  │
│  │ for content  │  │ fine-tuning  │  │ recommendations      │  │
│  │ generation   │  │ & evaluation │  │                      │  │
│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
└─────────────┬───────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│  DATA LAYER                                                     │
│  ┌──────────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ Aurora Serverless │  │ DynamoDB     │  │ ElastiCache      │  │
│  │ v2 (PostgreSQL   │  │ Session      │  │ Redis            │  │
│  │ 16 + pgvector)   │  │ events,      │  │ Caching, rate    │  │
│  │ Users, lessons,  │  │ real-time    │  │ limiting, real-  │  │
│  │ progress, vectors│  │ analytics    │  │ time state       │  │
│  └──────────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  SECURITY & IDENTITY                                            │
│  ┌──────────┐ ┌──────────┐ ┌────────┐ ┌────────┐ ┌──────────┐ │
│  │ Cognito  │ │ IAM      │ │ KMS    │ │ Secrets│ │ GuardDuty│ │
│  │ User &   │ │ Least    │ │ Encrypt│ │ Manager│ │ Threat   │ │
│  │ Identity │ │ Privilege│ │ at rest│ │ Rotate │ │ Detection│ │
│  │ Pools    │ │ Roles    │ │ & TLS  │ │ keys   │ │ 24/7     │ │
│  └──────────┘ └──────────┘ └────────┘ └────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  DEVOPS & OBSERVABILITY                                         │
│  ┌──────────────┐ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │
│  │ CodePipeline │ │ CodeBuild│ │CloudWatch│ │ X-Ray         │ │
│  │ CI/CD        │ │ Build &  │ │ Metrics, │ │ Distributed   │ │
│  │ pipeline     │ │ test     │ │ logs,    │ │ tracing       │ │
│  │              │ │          │ │ alarms   │ │               │ │
│  └──────────────┘ └──────────┘ └──────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Why These Specific Services

Let me explain the non-obvious choices:

Aurora Serverless v2 over RDS PostgreSQL. Both run PostgreSQL. But Aurora Serverless v2 scales compute automatically based on load. During school hours (9 AM – 3 PM), we might need 16 ACUs. At midnight, we drop to 0.5 ACU. Traditional RDS would keep a db.r6g.xlarge running 24/7 at ~$400/month. Aurora Serverless v2 costs us ~$150/month for the same workload because it scales down when nobody’s learning.

DynamoDB for session events over PostgreSQL. Session events — “child tapped option A at timestamp X” — are write-heavy, schema-flexible, and never need complex joins. DynamoDB handles millions of writes per second with single-digit millisecond latency. Putting this in PostgreSQL would mean constantly growing tables that slow down our relational queries.

API Gateway + Lambda over just ECS Fargate. Most of our API endpoints are request-response: get a lesson, submit an answer, check progress. These are perfect for Lambda — sub-second cold starts with provisioned concurrency, pay-per-request pricing, no servers to manage. But our adaptive learning engine needs to run complex inference that takes 5-15 seconds and requires persistent state — that goes on Fargate.

Bedrock over self-hosted models. We tried running Llama on a SageMaker endpoint. It works. But managing GPU instances, handling scaling, optimizing inference — that’s a full-time job for a team our size. Bedrock gives us Claude and Titan through an API. We trade some cost efficiency for dramatically reduced operational complexity. When we have a dedicated ML ops team, we’ll revisit.

Cognito over Auth0/Firebase Auth. Cognito integrates natively with every other AWS service. IAM policies can reference Cognito user pool attributes directly. For our COPPA-compliant parental consent flow, we need tight integration between authentication and authorization at every layer. Third-party auth would mean building adapters everywhere.

Cost Analysis — What Does This Actually Cost?

The question every technical lead dreads from their CFO: “How much will AWS cost us?”

Here’s our actual monthly cost breakdown for Kids Learn at 5,000 monthly active users (our Year 1 target):

┌──────────────────────────────────┬──────────────┬────────────────┐
│ Service                          │ Monthly Cost │ Notes          │
├──────────────────────────────────┼──────────────┼────────────────┤
│ Aurora Serverless v2             │ $75 – $150   │ 0.5–8 ACU      │
│ DynamoDB (on-demand)             │ $15 – $30    │ ~2M writes/mo  │
│ ElastiCache Redis (t4g.micro)   │ $12          │ Single node    │
│ Lambda (lessons API)            │ $5 – $20     │ ~500K invoke   │
│ ECS Fargate (adaptive engine)   │ $30 – $60    │ 0.5 vCPU/1GB   │
│ API Gateway                     │ $5 – $10     │ HTTP API       │
│ CloudFront                      │ $10 – $25    │ ~100GB/mo      │
│ S3 (media assets)               │ $3 – $5      │ ~50GB stored   │
│ Amplify Hosting                 │ $0 – $15     │ Build minutes  │
│ Cognito                         │ $0 – $28     │ First 50K free │
│ Bedrock (AI inference)          │ $50 – $150   │ Varies by usage│
│ Route 53                        │ $2           │ 1 hosted zone  │
│ Secrets Manager                 │ $2           │ ~5 secrets     │
│ CloudWatch                      │ $5 – $15     │ Logs + metrics │
│ CodePipeline + CodeBuild        │ $5 – $15     │ ~100 builds/mo │
│ WAF                             │ $5 – $10     │ 2 web ACLs     │
│ GuardDuty                       │ $5 – $10     │ Event analysis │
│ KMS                             │ $1           │ 1 CMK          │
├──────────────────────────────────┼──────────────┼────────────────┤
│ TOTAL                           │ $230 – $560  │ 5K MAU         │
└──────────────────────────────────┴──────────────┴────────────────┘

Per-user cost: $0.05 – $0.11/month. At our subscription price of $9.99/month, that’s a healthy margin even at the low end of adoption.

Comparison to alternatives:

  • Heroku with add-ons: Similar functionality would cost $400–$800/month with less flexibility
  • Self-hosted on EC2: We’d need at least 3 instances running 24/7 for HA: ~$300/month for compute alone, plus all the ops burden
  • Vercel + PlanetScale + separate AI: $200–$500/month but missing security services, requiring third-party integrations

Cost Optimization Strategies We Use

1. Serverless-first. Lambda and Aurora Serverless mean we pay proportional to actual traffic. A quiet Sunday morning costs pennies. School-day peak costs more, but we’re actually serving users.

2. Reserved capacity where predictable. Our ElastiCache Redis runs 24/7 — we buy a 1-year reserved instance and save 30%.

3. S3 Intelligent Tiering. Media assets for old lessons move to infrequent access automatically. We save ~40% on storage for historical content.

4. Right-sized Lambda functions. We used AWS Lambda Power Tuning to find the optimal memory setting for each function. Our lesson API runs at 512MB (not the default 128MB or an overkill 1024MB) — this is the sweet spot where it completes fastest per dollar.

5. Spot instances for non-critical workloads. Our nightly batch jobs (content pre-generation, analytics aggregation) run on Fargate Spot at 70% discount.

What This Series Covers

This is a 10-part series. Each part is designed to be both a reference guide and a hands-on tutorial. You can read them in order for the full journey, or jump to the part that matches your current challenge.

Part 1 (this post): Why AWS & Getting Started. The decision framework, Well-Architected pillars, account setup, cost analysis, and the complete service map.

Part 2: Infrastructure as Code with AWS CDK. CDK v2 project structure, L1/L2/L3 constructs, multi-stack architecture, VPC design, and deploying to staging vs production.

Part 3: Frontend with Amplify, CloudFront & S3. Next.js SSR deployment on Amplify Gen 2, static assets on S3 with CloudFront OAC, custom domain setup, and cache invalidation strategies.

Part 4: Backend with API Gateway, Lambda & ECS Fargate. Building the lessons API with Lambda, running the adaptive engine on Fargate, API Gateway configuration, and VPC networking.

Part 5: Database with Aurora, DynamoDB & ElastiCache. Aurora Serverless v2 setup with pgvector, DynamoDB table design for session events, Redis caching patterns, and RDS Proxy.

Part 6: AI/ML with Bedrock, SageMaker & Personalize. AI content generation with Bedrock, custom model training on SageMaker, RAG pipelines with Knowledge Bases for Bedrock, and learning path recommendations with Personalize.

Part 7: DevOps with CodePipeline, CodeBuild & ECR. Complete CI/CD pipeline from GitHub push to production deployment, Docker image management, blue/green deployments, and GitOps patterns.

Part 8: Security with IAM, Cognito, WAF & Secrets Manager. Least-privilege IAM policies, Cognito user pools with COPPA-compliant parental consent, WAF rule groups, secrets rotation, and KMS encryption.

Part 9: Observability with CloudWatch, X-Ray & Cost Explorer. Custom metrics, structured logging, distributed tracing, alerting strategies, cost dashboards, and anomaly detection.

Part 10: Production with Multi-Region, DR & Scaling. Multi-region deployment with Route 53 failover, Aurora Global Database, disaster recovery strategies, auto-scaling policies, and load testing.

Getting Started — Your First CDK Project

Let’s end Part 1 with something hands-on. Here’s how to initialize the CDK project that we’ll build throughout this series.

Prerequisites

# Node.js 20+ (LTS)
node --version  # v20.x or higher

# AWS CLI v2
aws --version  # aws-cli/2.x

# AWS CDK v2
npm install -g aws-cdk
cdk --version  # 2.x

# Configure your AWS profile
aws configure sso

Initialize the Project

# Create the project directory
mkdir kids-learn-aws && cd kids-learn-aws

# Initialize CDK with TypeScript
cdk init app --language typescript

# Install core dependencies
npm install @aws-cdk/aws-ec2 @aws-cdk/aws-rds @aws-cdk/aws-lambda
npm install @aws-cdk/aws-apigateway @aws-cdk/aws-s3
npm install @aws-cdk/aws-cloudfront @aws-cdk/aws-cognito
npm install @aws-cdk/aws-ecs @aws-cdk/aws-ecr

Project Structure

kids-learn-aws/
├── bin/
│   └── kids-learn.ts          # Entry point — creates the App
├── lib/
│   ├── stacks/
│   │   ├── network-stack.ts    # VPC, subnets, NAT gateway
│   │   ├── database-stack.ts   # Aurora, DynamoDB, ElastiCache
│   │   ├── compute-stack.ts    # Lambda, Fargate, API Gateway
│   │   ├── frontend-stack.ts   # Amplify, CloudFront, S3
│   │   ├── security-stack.ts   # Cognito, WAF, KMS
│   │   ├── ai-stack.ts         # Bedrock, SageMaker setup
│   │   ├── pipeline-stack.ts   # CodePipeline CI/CD
│   │   └── monitoring-stack.ts # CloudWatch, X-Ray, alerts
│   ├── constructs/
│   │   ├── aurora-serverless.ts
│   │   ├── lambda-api.ts
│   │   ├── fargate-service.ts
│   │   └── secure-bucket.ts
│   └── stages/
│       ├── dev-stage.ts
│       ├── staging-stage.ts
│       └── production-stage.ts
├── src/
│   ├── lambda/                 # Lambda function source code
│   │   ├── lessons/
│   │   ├── auth/
│   │   └── progress/
│   └── fargate/                # Fargate container source code
│       └── adaptive-engine/
├── test/
│   ├── stacks/
│   └── constructs/
├── parameters/
│   ├── dev.json
│   ├── staging.json
│   └── production.json
├── cdk.json
├── tsconfig.json
└── package.json

The Entry Point

// bin/kids-learn.ts
import * as cdk from 'aws-cdk-lib';
import { DevStage } from '../lib/stages/dev-stage';
import { StagingStage } from '../lib/stages/staging-stage';
import { ProductionStage } from '../lib/stages/production-stage';
import { PipelineStack } from '../lib/stacks/pipeline-stack';

const app = new cdk.App();

// Environment configurations
const devEnv = {
  account: '111111111111',
  region: 'ap-southeast-1',
};

const stagingEnv = {
  account: '222222222222',
  region: 'ap-southeast-1',
};

const prodEnv = {
  account: '333333333333',
  region: 'ap-southeast-1',
};

// CI/CD Pipeline (deployed to dev account, deploys to all environments)
new PipelineStack(app, 'KidsLearn-Pipeline', {
  env: devEnv,
  stages: [
    { stage: new DevStage(app, 'Dev', { env: devEnv }), name: 'Dev' },
    { stage: new StagingStage(app, 'Staging', { env: stagingEnv }), name: 'Staging' },
    { stage: new ProductionStage(app, 'Prod', { env: prodEnv }), name: 'Production' },
  ],
});

app.synth();

The Network Stack (Foundation)

Everything starts with the VPC. We’ll expand this in Part 2, but here’s the foundation:

// lib/stacks/network-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';

export interface NetworkStackProps extends cdk.StackProps {
  maxAzs: number;
  natGateways: number;
}

export class NetworkStack extends cdk.Stack {
  public readonly vpc: ec2.Vpc;

  constructor(scope: Construct, id: string, props: NetworkStackProps) {
    super(scope, id, props);

    this.vpc = new ec2.Vpc(this, 'KidsLearnVpc', {
      maxAzs: props.maxAzs,
      natGateways: props.natGateways,
      subnetConfiguration: [
        {
          name: 'Public',
          subnetType: ec2.SubnetType.PUBLIC,
          cidrMask: 24,
        },
        {
          name: 'Private',
          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
          cidrMask: 24,
        },
        {
          name: 'Isolated',
          subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
          cidrMask: 24,
        },
      ],
      // Enable flow logs for network monitoring
      flowLogs: {
        'VpcFlowLog': {
          destination: ec2.FlowLogDestination.toCloudWatchLogs(),
          trafficType: ec2.FlowLogTrafficType.REJECT,
        },
      },
    });

    // VPC Endpoints for AWS services (avoid NAT Gateway costs)
    this.vpc.addGatewayEndpoint('S3Endpoint', {
      service: ec2.GatewayVpcEndpointAwsService.S3,
    });

    this.vpc.addGatewayEndpoint('DynamoDBEndpoint', {
      service: ec2.GatewayVpcEndpointAwsService.DYNAMODB,
    });

    this.vpc.addInterfaceEndpoint('SecretsManagerEndpoint', {
      service: ec2.InterfaceVpcEndpointAwsService.SECRETS_MANAGER,
    });

    // Tags for cost allocation
    cdk.Tags.of(this).add('Project', 'KidsLearn');
    cdk.Tags.of(this).add('Environment', 'shared');
  }
}

Why three subnet tiers?

  • Public subnets: Load balancers and NAT gateways. Nothing else should have a public IP.
  • Private subnets: Lambda functions, Fargate tasks, ElastiCache. These can reach the internet through NAT gateways for downloading dependencies and calling AWS APIs.
  • Isolated subnets: Aurora database. No internet access at all — not even outbound. The database communicates only with other resources in the VPC through security groups.

Why VPC endpoints? Every time a Lambda function in a private subnet calls S3 or DynamoDB, that traffic goes through the NAT gateway — and you pay for data transfer. VPC endpoints route that traffic through the AWS backbone instead, saving both latency and money.

The Bottom Line

AWS is not the simplest cloud. It’s not the cheapest for toy projects. It’s not the trendiest option at a hackathon. But for building a production SaaS that needs to be secure, scalable, observable, and cost-effective — it’s the most complete platform available.

In this series, we’re not just going to use AWS. We’re going to use it well. Following the Well-Architected Framework isn’t about checking compliance boxes. It’s about making decisions upfront that save you from 3 AM production incidents six months from now.

Kids Learn isn’t a demo. It’s a real product serving real children. Every architectural choice we make in this series reflects that responsibility.

In Part 2, we’ll dive deep into AWS CDK — the infrastructure-as-code tool that lets us define all of this as TypeScript, test it with unit tests, and deploy it with a single command. No clicking around the AWS Console. No “I think someone changed that security group last week.” Everything in code, everything in version control, everything reviewable.

See you in Part 2.


This is Part 1 of a 10-part series: AWS Full-Stack Mastery for Technical Leads.

Series outline:

  1. Why AWS & Getting Started — Decision framework, Well-Architected, account setup, cost analysis (this post)
  2. Infrastructure as Code (CDK) — CDK v2, constructs, multi-stack, VPC design (Part 2)
  3. Frontend (Amplify + CloudFront) — Next.js SSR, S3, CDN, custom domains (Part 3)
  4. Backend (API Gateway + Lambda + Fargate) — REST APIs, serverless compute, containers (Part 4)
  5. Database (Aurora + DynamoDB + ElastiCache) — PostgreSQL, NoSQL, caching (Part 5)
  6. AI/ML (Bedrock + SageMaker) — Content generation, custom models, RAG (Part 6)
  7. DevOps (CodePipeline + CodeBuild) — CI/CD, Docker, blue/green deploys (Part 7)
  8. Security (IAM + Cognito + WAF) — Least privilege, auth, COPPA compliance (Part 8)
  9. Observability (CloudWatch + X-Ray) — Metrics, tracing, cost management (Part 9)
  10. Production (Multi-Region + DR) — Scaling, disaster recovery, load testing (Part 10)

References

Export for reading

Comments