AWS Full-Stack Mastery: Observability with CloudWatch, X-Ray & Cost Explorer (Part 9 of 10)

You can’t fix what you can’t see. Before we had observability, debugging production issues meant SSHing into servers and tailing log files while a parent was on the phone asking why their child’s lesson wasn’t loading. Now, when something goes wrong, I open the CloudWatch dashboard and within 30 seconds I know: the request hit API Gateway at 09:14:23, the Lambda function executed in 847ms, the database query took 340ms, and the Bedrock inference call timed out at 10 seconds because someone at AWS is having a worse day than I am.

Kids Learn’s observability stack covers three pillars: CloudWatch for metrics, logs, and alarms, X-Ray for distributed tracing across all services, and Cost Explorer for financial visibility. Together, they give us the ability to detect issues before users notice them, trace the exact cause in minutes, and ensure we’re not burning money on idle resources.

This is Part 9 of the AWS Full-Stack Mastery series.

Observability architecture — CloudWatch, X-Ray, and Cost Explorer providing full-stack visibility

CloudWatch — Metrics, Logs, and Alarms

Custom Metrics with Lambda Powertools

AWS Lambda Powertools provides a batteries-included approach to observability. We use three features: Logger for structured logging, Tracer for X-Ray integration, and Metrics for custom CloudWatch metrics.

// src/lambda/lessons/observability.ts
import { Logger } from '@aws-lambda-powertools/logger';
import { Tracer } from '@aws-lambda-powertools/tracer';
import { Metrics, MetricUnit } from '@aws-lambda-powertools/metrics';

// Initialize once per cold start
export const logger = new Logger({
  serviceName: 'lessons-api',
  logLevel: process.env.LOG_LEVEL || 'INFO',
  persistentLogAttributes: {
    environment: process.env.NODE_ENV,
    version: process.env.APP_VERSION || 'unknown',
  },
});

export const tracer = new Tracer({
  serviceName: 'lessons-api',
  captureHTTPsRequests: true,
});

export const metrics = new Metrics({
  namespace: 'KidsLearn',
  serviceName: 'lessons-api',
  defaultDimensions: {
    environment: process.env.NODE_ENV || 'unknown',
  },
});

// Custom metric helpers
export function recordLatency(operation: string, durationMs: number): void {
  metrics.addMetric(`${operation}Latency`, MetricUnit.Milliseconds, durationMs);
}

export function recordCacheHit(hit: boolean): void {
  metrics.addMetric('CacheHit', MetricUnit.Count, hit ? 1 : 0);
  metrics.addMetric('CacheMiss', MetricUnit.Count, hit ? 0 : 1);
}

export function recordError(errorType: string): void {
  metrics.addMetric('Errors', MetricUnit.Count, 1);
  metrics.addDimension('ErrorType', errorType);
}

CloudWatch Dashboard CDK

// lib/stacks/monitoring-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as actions from 'aws-cdk-lib/aws-cloudwatch-actions';
import { Construct } from 'constructs';
import { EnvironmentConfig } from '../config/environments';

export class MonitoringStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: {
    config: EnvironmentConfig;
    lessonsFunction: cdk.aws_lambda.Function;
    progressFunction: cdk.aws_lambda.Function;
    auroraCluster: cdk.aws_rds.DatabaseCluster;
    apiId: string;
  } & cdk.StackProps) {
    super(scope, id, props);

    const { config } = props;

    // =========================================
    // Alarm Topic
    // =========================================
    const alarmTopic = new sns.Topic(this, 'AlarmTopic', {
      topicName: `kidslearn-alarms-${config.envName}`,
    });

    // =========================================
    // Lambda Alarms
    // =========================================
    
    // High error rate alarm
    const lessonsErrorAlarm = new cloudwatch.Alarm(this, 'LessonsErrorRate', {
      alarmName: `${config.envName}-lessons-error-rate`,
      metric: props.lessonsFunction.metricErrors({
        period: cdk.Duration.minutes(5),
        statistic: 'Sum',
      }),
      threshold: 10,
      evaluationPeriods: 2,
      comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
      treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
      alarmDescription: 'Lessons API error rate exceeds 10 errors in 5 minutes',
    });
    lessonsErrorAlarm.addAlarmAction(new actions.SnsAction(alarmTopic));

    // High latency alarm
    const lessonsLatencyAlarm = new cloudwatch.Alarm(this, 'LessonsLatency', {
      alarmName: `${config.envName}-lessons-p99-latency`,
      metric: props.lessonsFunction.metricDuration({
        period: cdk.Duration.minutes(5),
        statistic: 'p99',
      }),
      threshold: 5000, // 5 seconds
      evaluationPeriods: 3,
      comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
      alarmDescription: 'Lessons API p99 latency exceeds 5 seconds',
    });
    lessonsLatencyAlarm.addAlarmAction(new actions.SnsAction(alarmTopic));

    // =========================================
    // Aurora Alarms
    // =========================================
    const dbCpuAlarm = new cloudwatch.Alarm(this, 'AuroraCPU', {
      alarmName: `${config.envName}-aurora-cpu`,
      metric: new cloudwatch.Metric({
        namespace: 'AWS/RDS',
        metricName: 'CPUUtilization',
        dimensionsMap: {
          DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
        },
        statistic: 'Average',
        period: cdk.Duration.minutes(5),
      }),
      threshold: 80,
      evaluationPeriods: 3,
      alarmDescription: 'Aurora CPU utilization exceeds 80% for 15 minutes',
    });
    dbCpuAlarm.addAlarmAction(new actions.SnsAction(alarmTopic));

    // =========================================
    // Dashboard
    // =========================================
    const dashboard = new cloudwatch.Dashboard(this, 'KidsLearnDashboard', {
      dashboardName: `KidsLearn-${config.envName}`,
      periodOverride: cloudwatch.PeriodOverride.AUTO,
    });

    // Row 1: API Health
    dashboard.addWidgets(
      new cloudwatch.GraphWidget({
        title: 'API Requests',
        left: [
          new cloudwatch.Metric({
            namespace: 'AWS/ApiGateway',
            metricName: 'Count',
            dimensionsMap: { ApiId: props.apiId },
            statistic: 'Sum',
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      }),
      new cloudwatch.GraphWidget({
        title: 'API Latency (p50, p90, p99)',
        left: [
          new cloudwatch.Metric({
            namespace: 'AWS/ApiGateway',
            metricName: 'Latency',
            dimensionsMap: { ApiId: props.apiId },
            statistic: 'p50',
            period: cdk.Duration.minutes(1),
          }),
          new cloudwatch.Metric({
            namespace: 'AWS/ApiGateway',
            metricName: 'Latency',
            dimensionsMap: { ApiId: props.apiId },
            statistic: 'p90',
            period: cdk.Duration.minutes(1),
          }),
          new cloudwatch.Metric({
            namespace: 'AWS/ApiGateway',
            metricName: 'Latency',
            dimensionsMap: { ApiId: props.apiId },
            statistic: 'p99',
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      }),
      new cloudwatch.GraphWidget({
        title: 'API Errors (4xx, 5xx)',
        left: [
          new cloudwatch.Metric({
            namespace: 'AWS/ApiGateway',
            metricName: '4xxError',
            dimensionsMap: { ApiId: props.apiId },
            statistic: 'Sum',
            period: cdk.Duration.minutes(1),
          }),
          new cloudwatch.Metric({
            namespace: 'AWS/ApiGateway',
            metricName: '5xxError',
            dimensionsMap: { ApiId: props.apiId },
            statistic: 'Sum',
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      })
    );

    // Row 2: Lambda Performance
    dashboard.addWidgets(
      new cloudwatch.GraphWidget({
        title: 'Lambda Invocations',
        left: [
          props.lessonsFunction.metricInvocations({
            period: cdk.Duration.minutes(1),
          }),
          props.progressFunction.metricInvocations({
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      }),
      new cloudwatch.GraphWidget({
        title: 'Lambda Duration',
        left: [
          props.lessonsFunction.metricDuration({
            period: cdk.Duration.minutes(1),
            statistic: 'Average',
          }),
          props.progressFunction.metricDuration({
            period: cdk.Duration.minutes(1),
            statistic: 'Average',
          }),
        ],
        width: 8,
      }),
      new cloudwatch.GraphWidget({
        title: 'Lambda Errors + Throttles',
        left: [
          props.lessonsFunction.metricErrors({
            period: cdk.Duration.minutes(1),
          }),
          props.lessonsFunction.metricThrottles({
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      })
    );

    // Row 3: Database Health
    dashboard.addWidgets(
      new cloudwatch.GraphWidget({
        title: 'Aurora ACU Utilization',
        left: [
          new cloudwatch.Metric({
            namespace: 'AWS/RDS',
            metricName: 'ServerlessDatabaseCapacity',
            dimensionsMap: {
              DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
            },
            statistic: 'Average',
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      }),
      new cloudwatch.GraphWidget({
        title: 'Aurora Connections',
        left: [
          new cloudwatch.Metric({
            namespace: 'AWS/RDS',
            metricName: 'DatabaseConnections',
            dimensionsMap: {
              DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
            },
            statistic: 'Sum',
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      }),
      new cloudwatch.GraphWidget({
        title: 'Aurora Query Latency',
        left: [
          new cloudwatch.Metric({
            namespace: 'AWS/RDS',
            metricName: 'SelectLatency',
            dimensionsMap: {
              DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
            },
            statistic: 'Average',
            period: cdk.Duration.minutes(1),
          }),
        ],
        width: 8,
      })
    );

    // Row 4: Custom Business Metrics
    dashboard.addWidgets(
      new cloudwatch.GraphWidget({
        title: 'Cache Hit Rate',
        left: [
          new cloudwatch.Metric({
            namespace: 'KidsLearn',
            metricName: 'CacheHit',
            statistic: 'Sum',
            period: cdk.Duration.minutes(5),
          }),
          new cloudwatch.Metric({
            namespace: 'KidsLearn',
            metricName: 'CacheMiss',
            statistic: 'Sum',
            period: cdk.Duration.minutes(5),
          }),
        ],
        width: 12,
      }),
      new cloudwatch.GraphWidget({
        title: 'Lessons Completed per Hour',
        left: [
          new cloudwatch.Metric({
            namespace: 'KidsLearn',
            metricName: 'LessonCompleted',
            statistic: 'Sum',
            period: cdk.Duration.hours(1),
          }),
        ],
        width: 12,
      })
    );
  }
}

X-Ray — Distributed Tracing

X-Ray traces a request across every service it touches. When a parent reports “my child’s lesson took 10 seconds to load,” I open X-Ray and see:

Request trace: GET /api/lessons/abc123
├── API Gateway: 12ms
├── Lambda Cold Start: 450ms
├── Lambda Handler: 847ms
│   ├── Redis GET (cache miss): 3ms
│   ├── RDS Proxy → Aurora: 340ms
│   │   └── SQL Query: 285ms (SELECT with pgvector)
│   ├── Redis SET (cache write): 2ms
│   └── Response serialization: 2ms
└── API Gateway Response: 5ms
Total: 1314ms

Without tracing, I’d be guessing. With X-Ray, I know the vector similarity query took 285ms and the Lambda cold start added 450ms. The fix is clear: add provisioned concurrency to eliminate cold starts, and optimize the pgvector index.

Structured Logging

// Structured log output
{
  "level": "INFO",
  "message": "Lesson retrieved",
  "timestamp": "2026-02-18T09:14:23.456Z",
  "service": "lessons-api",
  "environment": "production",
  "xray_trace_id": "1-abc123-def456",
  "cold_start": false,
  "function_name": "kidslearn-lessons-production",
  "function_memory_size": 512,
  "lesson_id": "abc123",
  "cache_hit": false,
  "db_query_ms": 285,
  "total_ms": 847
}

CloudWatch Logs Insights lets us query these structured logs:

-- Find slowest lesson requests in the last hour
fields @timestamp, lesson_id, total_ms, cache_hit, db_query_ms
| filter total_ms > 2000
| sort total_ms desc
| limit 20

-- Cache hit rate over time
stats count(*) as total,
  sum(case when cache_hit = 1 then 1 else 0 end) as hits,
  (sum(case when cache_hit = 1 then 1 else 0 end) / count(*)) * 100 as hit_rate
by bin(5m)

The Bottom Line

Observability is what separates a production platform from a demo:

CloudWatch dashboards give us instant visibility into platform health
Custom metrics track business KPIs alongside technical metrics
Alarms detect issues before users report them
X-Ray traces requests across every service, reducing debugging from hours to minutes
Structured logging makes log analysis possible at scale
Cost dashboards prevent budget surprises

In Part 10, we cover the final piece: production readiness with multi-region deployment, disaster recovery, auto-scaling, and load testing.

See you in Part 10.

This is Part 9 of a 10-part series: AWS Full-Stack Mastery for Technical Leads.

Series outline:

Why AWS & Getting Started (Part 1)
Infrastructure as Code (CDK) (Part 2)
Frontend (Amplify + CloudFront) (Part 3)
Backend (API Gateway + Lambda + Fargate) (Part 4)
Database (Aurora + DynamoDB + ElastiCache) (Part 5)
AI/ML (Bedrock + SageMaker) (Part 6)
DevOps (CodePipeline + CodeBuild) (Part 7)
Security (IAM + Cognito + WAF) (Part 8)
Observability (CloudWatch + X-Ray) (this post)
Production (Multi-Region + DR) (Part 10)

References

CloudWatch User Guide — Metrics, logs, alarms, and dashboards.
Lambda Powertools Logger — Structured logging.
Lambda Powertools Metrics — Custom CloudWatch metrics.
AWS X-Ray Developer Guide — Distributed tracing configuration.
CloudWatch Logs Insights — Querying structured logs.
CloudWatch Anomaly Detection — ML-based anomaly detection.
CloudWatch Dashboard CDK — Dashboard constructs.
AWS Cost Explorer — Spend analysis and forecasting.
CloudWatch Alarms Best Practices — Effective alerting.
X-Ray SDK for Node.js — Application instrumentation.

Export for reading

AWS Full-Stack Mastery: Observability with CloudWatch, X-Ray & Cost Explorer (Part 9 of 10)

CloudWatch — Metrics, Logs, and Alarms

Custom Metrics with Lambda Powertools

CloudWatch Dashboard CDK

X-Ray — Distributed Tracing

Structured Logging

The Bottom Line

References

Comments

On this page

AWS Full-Stack Mastery: Observability with CloudWatch, X-Ray & Cost Explorer (Part 9 of 10)