You can’t fix what you can’t see. Before we had observability, debugging production issues meant SSHing into servers and tailing log files while a parent was on the phone asking why their child’s lesson wasn’t loading. Now, when something goes wrong, I open the CloudWatch dashboard and within 30 seconds I know: the request hit API Gateway at 09:14:23, the Lambda function executed in 847ms, the database query took 340ms, and the Bedrock inference call timed out at 10 seconds because someone at AWS is having a worse day than I am.
Kids Learn’s observability stack covers three pillars: CloudWatch for metrics, logs, and alarms, X-Ray for distributed tracing across all services, and Cost Explorer for financial visibility. Together, they give us the ability to detect issues before users notice them, trace the exact cause in minutes, and ensure we’re not burning money on idle resources.
This is Part 9 of the AWS Full-Stack Mastery series.
CloudWatch — Metrics, Logs, and Alarms
Custom Metrics with Lambda Powertools
AWS Lambda Powertools provides a batteries-included approach to observability. We use three features: Logger for structured logging, Tracer for X-Ray integration, and Metrics for custom CloudWatch metrics.
// src/lambda/lessons/observability.ts
import { Logger } from '@aws-lambda-powertools/logger';
import { Tracer } from '@aws-lambda-powertools/tracer';
import { Metrics, MetricUnit } from '@aws-lambda-powertools/metrics';
// Initialize once per cold start
export const logger = new Logger({
serviceName: 'lessons-api',
logLevel: process.env.LOG_LEVEL || 'INFO',
persistentLogAttributes: {
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION || 'unknown',
},
});
export const tracer = new Tracer({
serviceName: 'lessons-api',
captureHTTPsRequests: true,
});
export const metrics = new Metrics({
namespace: 'KidsLearn',
serviceName: 'lessons-api',
defaultDimensions: {
environment: process.env.NODE_ENV || 'unknown',
},
});
// Custom metric helpers
export function recordLatency(operation: string, durationMs: number): void {
metrics.addMetric(`${operation}Latency`, MetricUnit.Milliseconds, durationMs);
}
export function recordCacheHit(hit: boolean): void {
metrics.addMetric('CacheHit', MetricUnit.Count, hit ? 1 : 0);
metrics.addMetric('CacheMiss', MetricUnit.Count, hit ? 0 : 1);
}
export function recordError(errorType: string): void {
metrics.addMetric('Errors', MetricUnit.Count, 1);
metrics.addDimension('ErrorType', errorType);
}
CloudWatch Dashboard CDK
// lib/stacks/monitoring-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
import * as sns from 'aws-cdk-lib/aws-sns';
import * as actions from 'aws-cdk-lib/aws-cloudwatch-actions';
import { Construct } from 'constructs';
import { EnvironmentConfig } from '../config/environments';
export class MonitoringStack extends cdk.Stack {
constructor(scope: Construct, id: string, props: {
config: EnvironmentConfig;
lessonsFunction: cdk.aws_lambda.Function;
progressFunction: cdk.aws_lambda.Function;
auroraCluster: cdk.aws_rds.DatabaseCluster;
apiId: string;
} & cdk.StackProps) {
super(scope, id, props);
const { config } = props;
// =========================================
// Alarm Topic
// =========================================
const alarmTopic = new sns.Topic(this, 'AlarmTopic', {
topicName: `kidslearn-alarms-${config.envName}`,
});
// =========================================
// Lambda Alarms
// =========================================
// High error rate alarm
const lessonsErrorAlarm = new cloudwatch.Alarm(this, 'LessonsErrorRate', {
alarmName: `${config.envName}-lessons-error-rate`,
metric: props.lessonsFunction.metricErrors({
period: cdk.Duration.minutes(5),
statistic: 'Sum',
}),
threshold: 10,
evaluationPeriods: 2,
comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
alarmDescription: 'Lessons API error rate exceeds 10 errors in 5 minutes',
});
lessonsErrorAlarm.addAlarmAction(new actions.SnsAction(alarmTopic));
// High latency alarm
const lessonsLatencyAlarm = new cloudwatch.Alarm(this, 'LessonsLatency', {
alarmName: `${config.envName}-lessons-p99-latency`,
metric: props.lessonsFunction.metricDuration({
period: cdk.Duration.minutes(5),
statistic: 'p99',
}),
threshold: 5000, // 5 seconds
evaluationPeriods: 3,
comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
alarmDescription: 'Lessons API p99 latency exceeds 5 seconds',
});
lessonsLatencyAlarm.addAlarmAction(new actions.SnsAction(alarmTopic));
// =========================================
// Aurora Alarms
// =========================================
const dbCpuAlarm = new cloudwatch.Alarm(this, 'AuroraCPU', {
alarmName: `${config.envName}-aurora-cpu`,
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'CPUUtilization',
dimensionsMap: {
DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
},
statistic: 'Average',
period: cdk.Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'Aurora CPU utilization exceeds 80% for 15 minutes',
});
dbCpuAlarm.addAlarmAction(new actions.SnsAction(alarmTopic));
// =========================================
// Dashboard
// =========================================
const dashboard = new cloudwatch.Dashboard(this, 'KidsLearnDashboard', {
dashboardName: `KidsLearn-${config.envName}`,
periodOverride: cloudwatch.PeriodOverride.AUTO,
});
// Row 1: API Health
dashboard.addWidgets(
new cloudwatch.GraphWidget({
title: 'API Requests',
left: [
new cloudwatch.Metric({
namespace: 'AWS/ApiGateway',
metricName: 'Count',
dimensionsMap: { ApiId: props.apiId },
statistic: 'Sum',
period: cdk.Duration.minutes(1),
}),
],
width: 8,
}),
new cloudwatch.GraphWidget({
title: 'API Latency (p50, p90, p99)',
left: [
new cloudwatch.Metric({
namespace: 'AWS/ApiGateway',
metricName: 'Latency',
dimensionsMap: { ApiId: props.apiId },
statistic: 'p50',
period: cdk.Duration.minutes(1),
}),
new cloudwatch.Metric({
namespace: 'AWS/ApiGateway',
metricName: 'Latency',
dimensionsMap: { ApiId: props.apiId },
statistic: 'p90',
period: cdk.Duration.minutes(1),
}),
new cloudwatch.Metric({
namespace: 'AWS/ApiGateway',
metricName: 'Latency',
dimensionsMap: { ApiId: props.apiId },
statistic: 'p99',
period: cdk.Duration.minutes(1),
}),
],
width: 8,
}),
new cloudwatch.GraphWidget({
title: 'API Errors (4xx, 5xx)',
left: [
new cloudwatch.Metric({
namespace: 'AWS/ApiGateway',
metricName: '4xxError',
dimensionsMap: { ApiId: props.apiId },
statistic: 'Sum',
period: cdk.Duration.minutes(1),
}),
new cloudwatch.Metric({
namespace: 'AWS/ApiGateway',
metricName: '5xxError',
dimensionsMap: { ApiId: props.apiId },
statistic: 'Sum',
period: cdk.Duration.minutes(1),
}),
],
width: 8,
})
);
// Row 2: Lambda Performance
dashboard.addWidgets(
new cloudwatch.GraphWidget({
title: 'Lambda Invocations',
left: [
props.lessonsFunction.metricInvocations({
period: cdk.Duration.minutes(1),
}),
props.progressFunction.metricInvocations({
period: cdk.Duration.minutes(1),
}),
],
width: 8,
}),
new cloudwatch.GraphWidget({
title: 'Lambda Duration',
left: [
props.lessonsFunction.metricDuration({
period: cdk.Duration.minutes(1),
statistic: 'Average',
}),
props.progressFunction.metricDuration({
period: cdk.Duration.minutes(1),
statistic: 'Average',
}),
],
width: 8,
}),
new cloudwatch.GraphWidget({
title: 'Lambda Errors + Throttles',
left: [
props.lessonsFunction.metricErrors({
period: cdk.Duration.minutes(1),
}),
props.lessonsFunction.metricThrottles({
period: cdk.Duration.minutes(1),
}),
],
width: 8,
})
);
// Row 3: Database Health
dashboard.addWidgets(
new cloudwatch.GraphWidget({
title: 'Aurora ACU Utilization',
left: [
new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'ServerlessDatabaseCapacity',
dimensionsMap: {
DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
},
statistic: 'Average',
period: cdk.Duration.minutes(1),
}),
],
width: 8,
}),
new cloudwatch.GraphWidget({
title: 'Aurora Connections',
left: [
new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'DatabaseConnections',
dimensionsMap: {
DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
},
statistic: 'Sum',
period: cdk.Duration.minutes(1),
}),
],
width: 8,
}),
new cloudwatch.GraphWidget({
title: 'Aurora Query Latency',
left: [
new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'SelectLatency',
dimensionsMap: {
DBClusterIdentifier: props.auroraCluster.clusterIdentifier,
},
statistic: 'Average',
period: cdk.Duration.minutes(1),
}),
],
width: 8,
})
);
// Row 4: Custom Business Metrics
dashboard.addWidgets(
new cloudwatch.GraphWidget({
title: 'Cache Hit Rate',
left: [
new cloudwatch.Metric({
namespace: 'KidsLearn',
metricName: 'CacheHit',
statistic: 'Sum',
period: cdk.Duration.minutes(5),
}),
new cloudwatch.Metric({
namespace: 'KidsLearn',
metricName: 'CacheMiss',
statistic: 'Sum',
period: cdk.Duration.minutes(5),
}),
],
width: 12,
}),
new cloudwatch.GraphWidget({
title: 'Lessons Completed per Hour',
left: [
new cloudwatch.Metric({
namespace: 'KidsLearn',
metricName: 'LessonCompleted',
statistic: 'Sum',
period: cdk.Duration.hours(1),
}),
],
width: 12,
})
);
}
}
X-Ray — Distributed Tracing
X-Ray traces a request across every service it touches. When a parent reports “my child’s lesson took 10 seconds to load,” I open X-Ray and see:
Request trace: GET /api/lessons/abc123
├── API Gateway: 12ms
├── Lambda Cold Start: 450ms
├── Lambda Handler: 847ms
│ ├── Redis GET (cache miss): 3ms
│ ├── RDS Proxy → Aurora: 340ms
│ │ └── SQL Query: 285ms (SELECT with pgvector)
│ ├── Redis SET (cache write): 2ms
│ └── Response serialization: 2ms
└── API Gateway Response: 5ms
Total: 1314ms
Without tracing, I’d be guessing. With X-Ray, I know the vector similarity query took 285ms and the Lambda cold start added 450ms. The fix is clear: add provisioned concurrency to eliminate cold starts, and optimize the pgvector index.
Structured Logging
// Structured log output
{
"level": "INFO",
"message": "Lesson retrieved",
"timestamp": "2026-02-18T09:14:23.456Z",
"service": "lessons-api",
"environment": "production",
"xray_trace_id": "1-abc123-def456",
"cold_start": false,
"function_name": "kidslearn-lessons-production",
"function_memory_size": 512,
"lesson_id": "abc123",
"cache_hit": false,
"db_query_ms": 285,
"total_ms": 847
}
CloudWatch Logs Insights lets us query these structured logs:
-- Find slowest lesson requests in the last hour
fields @timestamp, lesson_id, total_ms, cache_hit, db_query_ms
| filter total_ms > 2000
| sort total_ms desc
| limit 20
-- Cache hit rate over time
stats count(*) as total,
sum(case when cache_hit = 1 then 1 else 0 end) as hits,
(sum(case when cache_hit = 1 then 1 else 0 end) / count(*)) * 100 as hit_rate
by bin(5m)
The Bottom Line
Observability is what separates a production platform from a demo:
- CloudWatch dashboards give us instant visibility into platform health
- Custom metrics track business KPIs alongside technical metrics
- Alarms detect issues before users report them
- X-Ray traces requests across every service, reducing debugging from hours to minutes
- Structured logging makes log analysis possible at scale
- Cost dashboards prevent budget surprises
In Part 10, we cover the final piece: production readiness with multi-region deployment, disaster recovery, auto-scaling, and load testing.
See you in Part 10.
This is Part 9 of a 10-part series: AWS Full-Stack Mastery for Technical Leads.
Series outline:
- Why AWS & Getting Started (Part 1)
- Infrastructure as Code (CDK) (Part 2)
- Frontend (Amplify + CloudFront) (Part 3)
- Backend (API Gateway + Lambda + Fargate) (Part 4)
- Database (Aurora + DynamoDB + ElastiCache) (Part 5)
- AI/ML (Bedrock + SageMaker) (Part 6)
- DevOps (CodePipeline + CodeBuild) (Part 7)
- Security (IAM + Cognito + WAF) (Part 8)
- Observability (CloudWatch + X-Ray) (this post)
- Production (Multi-Region + DR) (Part 10)
References
- CloudWatch User Guide — Metrics, logs, alarms, and dashboards.
- Lambda Powertools Logger — Structured logging.
- Lambda Powertools Metrics — Custom CloudWatch metrics.
- AWS X-Ray Developer Guide — Distributed tracing configuration.
- CloudWatch Logs Insights — Querying structured logs.
- CloudWatch Anomaly Detection — ML-based anomaly detection.
- CloudWatch Dashboard CDK — Dashboard constructs.
- AWS Cost Explorer — Spend analysis and forecasting.
- CloudWatch Alarms Best Practices — Effective alerting.
- X-Ray SDK for Node.js — Application instrumentation.