Introduction

Your multi-agent system works in development. It passes all eval tests. Now you need to run it at 5,000 queries per day with predictable latency, controlled costs, and zero downtime.

This final deep dive covers the production scaling layer: Step Functions for orchestration, Lambda auto-scaling, Bedrock provisioned throughput, cost optimization, and operational runbooks.


Table of Contents

  1. Scaling Challenges for Agent Systems
  2. AWS Step Functions Orchestration
  3. Lambda Configuration for Agents
  4. Bedrock Provisioned Throughput
  5. Auto-Scaling Patterns
  6. Cost Optimization Strategies
  7. Human-in-the-Loop at Scale
  8. Multi-Region Architecture
  9. Operational Runbooks
  10. Production Checklist

1. Scaling Challenges for Agent Systems

Agent systems have unique scaling characteristics:

ChallengeRoot CauseSolution
Variable latencyLLM response time varies 2-30sAsync execution with Step Functions
Token throughput limitsBedrock has per-model TPM limitsProvisioned throughput or multi-region
Cost scalingMore queries = linearly more LLM costModel tier optimization (Haiku for workers)
State explosionCheckpoints grow with conversation lengthS3 offloading, TTL cleanup
Cold startsLambda cold start adds 1-3sProvisioned concurrency
Cascading failuresOne slow agent blocks the pipelineTimeouts, circuit breakers

Key principle: Scale the orchestration layer independently from the LLM layer. Step Functions handle flow control; Bedrock handles inference. Each scales differently.


2. AWS Step Functions Orchestration

2.1 Why Step Functions

LangGraph runs the agent graph in a single process. This works for development but limits production scalability. Step Functions provides:

  • Parallel execution: Fan out to multiple workers simultaneously
  • Built-in retries: Automatic retry with exponential backoff
  • Human-in-the-loop: Native task token support for approvals
  • Timeout handling: Per-step and total execution timeouts
  • Visual debugging: Execution history in AWS Console
  • Cost model: Pay per state transition (USD 0.025 per 1,000 transitions)

2.2 State Machine Design

# Step Functions state machine for agent pipeline:
#
# Start -> BrainPlan
#
# BrainPlan:
#   Type: Task
#   Resource: Lambda (brain-plan-function)
#   Timeout: 60 seconds
#   Retry: 2 attempts with backoff
#   Output: execution_plan, workers_needed
#
# BrainPlan -> ParallelWorkers
#
# ParallelWorkers:
#   Type: Parallel
#   Branches:
#     - SQLWriter (if needed)
#     - DataProcessor (if needed)
#     - WeightApplier (if needed)
#   Each branch:
#     Type: Task
#     Resource: Lambda (worker-function)
#     Timeout: 120 seconds
#     Retry: 1 attempt
#
# ParallelWorkers -> BrainSynthesise
#
# BrainSynthesise:
#   Type: Task
#   Resource: Lambda (brain-synthesise-function)
#   Timeout: 60 seconds
#   Output: final_answer, confidence
#
# BrainSynthesise -> ConfidenceCheck
#
# ConfidenceCheck:
#   Type: Choice
#   If confidence >= 0.8: -> Formatter
#   If confidence >= 0.5: -> HumanReview
#   Otherwise: -> ErrorHandler
#
# Formatter:
#   Type: Task
#   Resource: Lambda (formatter-function)
#   Timeout: 30 seconds
#   Next: End
#
# HumanReview:
#   Type: Task
#   Resource: Activity (human-review-activity)
#   Timeout: 3600 seconds (1 hour)
#   HeartbeatSeconds: 300
#   Next: End

2.3 Parallel Fan-Out

# Step Functions parallel execution:
#
# Without parallelism (sequential):
#   BrainPlan (5s) -> SQLWriter (10s) -> WeightApplier (8s)
#   -> Formatter (3s) -> BrainSynthesise (5s)
#   Total: 31 seconds
#
# With parallelism (Step Functions):
#   BrainPlan (5s) -> [SQLWriter (10s) | WeightApplier (8s)] parallel
#   -> BrainSynthesise (5s) -> Formatter (3s)
#   Total: 23 seconds (26% faster)
#
# For complex queries with 4+ workers:
#   Sequential: 40-60 seconds
#   Parallel: 20-30 seconds (50% faster)

3. Lambda Configuration for Agents

3.1 Memory and Timeout Settings

FunctionMemoryTimeoutConcurrencyPurpose
brain-plan512 MB60s50Orchestration planning
sql-writer256 MB120s100SQL generation and execution
weight-applier512 MB120s50Data processing (pandas)
formatter256 MB30s100Response formatting
brain-synthesise512 MB60s50Final answer synthesis

3.2 Cold Start Mitigation

# Provisioned concurrency configuration:
#
# brain-plan: 5 instances (always warm)
# sql-writer: 10 instances (high volume)
# weight-applier: 5 instances
# formatter: 10 instances
# brain-synthesise: 5 instances
#
# Cost: ~USD 150/month for 35 provisioned instances
# Benefit: Eliminates 1-3s cold start for 95% of invocations
#
# Schedule-based scaling:
# - Business hours (9am-6pm): Full provisioned concurrency
# - Off-hours (6pm-9am): Reduce to 2 instances each
# - Weekends: Reduce to 1 instance each
# Savings: ~40% vs always-on provisioned concurrency

3.3 Lambda Layers for Dependencies

# Lambda layers to reduce deployment size:
#
# Layer 1: langchain-core (shared across all functions)
#   - langchain-core
#   - langgraph
#   - langgraph-checkpoint-aws
#   Size: ~50 MB
#
# Layer 2: data-processing (for weight-applier only)
#   - pandas
#   - numpy
#   Size: ~80 MB
#
# Layer 3: bedrock-runtime (shared across all functions)
#   - boto3 (latest)
#   - botocore (latest)
#   Size: ~30 MB
#
# Total layer size: 160 MB (within 250 MB Lambda limit)
# Benefit: Each function deployment is only 1-5 MB of custom code

4. Bedrock Provisioned Throughput

4.1 When to Use Provisioned Throughput

WorkloadQueries/DayRecommendedMonthly Cost
POCUnder 100On-demandPay per token
Production1,000-10,000Provisioned ThroughputUSD 1,000-5,000
EnterpriseAbove 10,000Provisioned + Cross-regionUSD 5,000+

4.2 Capacity Planning

# Bedrock token throughput calculation:
#
# Query profile:
#   Brain (Sonnet): 2 calls * 2,500 tokens = 5,000 tokens
#   Workers (Haiku): 3 calls * 1,800 tokens = 5,400 tokens
#   Total per query: ~10,400 tokens
#
# Daily throughput:
#   5,000 queries * 10,400 tokens = 52,000,000 tokens/day
#   = 2,167,000 tokens/hour
#   = 36,100 tokens/minute (average)
#
# Peak throughput (4x average during business hours):
#   144,400 tokens/minute
#
# Provisioned throughput needed:
#   Sonnet: 1 model unit (covers ~100K tokens/minute)
#   Haiku: 1 model unit (covers ~200K tokens/minute)

4.3 Cross-Region Inference

# Bedrock cross-region inference for higher throughput:
#
# Primary region: us-east-1
# Secondary regions: us-west-2, eu-west-1
#
# Configuration:
#   inference_profile_id = "us.anthropic.claude-sonnet-v2"
#   # The "us." prefix enables cross-region routing
#   # Bedrock automatically routes to the least-loaded region
#
# Benefits:
#   - 3x token throughput (3 regions)
#   - Automatic failover if one region is throttled
#   - No code changes needed (just change model ID prefix)
#
# Note: Data may cross region boundaries
#   - Ensure compliance with data residency requirements
#   - Use region-specific profiles for regulated workloads

5. Auto-Scaling Patterns

5.1 Lambda Concurrency Scaling

# Auto-scaling based on query volume:
#
# Metric: IncomingQueryCount (custom CloudWatch metric)
#
# Scale-up rules:
#   If IncomingQueryCount > 50/minute for 3 minutes:
#     Increase provisioned concurrency by 50%
#   If IncomingQueryCount > 100/minute for 3 minutes:
#     Increase provisioned concurrency by 100%
#
# Scale-down rules:
#   If IncomingQueryCount is under 20/minute for 10 minutes:
#     Decrease provisioned concurrency by 25%
#   Minimum: 2 instances per function
#
# Application Auto Scaling target:
#   Resource: Lambda function provisioned concurrency
#   Min capacity: 2
#   Max capacity: 100
#   Target tracking: 70% utilization

5.2 Queue-Based Scaling

# SQS queue for async query processing:
#
# Architecture:
#   API Gateway -> SQS Queue -> Lambda (poller) -> Step Functions
#
# Benefits:
#   - Absorbs traffic spikes without throttling
#   - Natural backpressure mechanism
#   - Dead letter queue for failed queries
#   - Visibility timeout prevents duplicate processing
#
# Queue configuration:
#   Visibility timeout: 300 seconds (5 minutes)
#   Message retention: 14 days
#   Receive wait time: 20 seconds (long polling)
#   Dead letter queue: after 3 failed attempts
#
# Scaling trigger:
#   If ApproximateNumberOfMessagesVisible > 100:
#     Scale up Step Functions executions

6. Cost Optimization Strategies

6.1 Model Tier Optimization

The single most effective cost optimization: use the cheapest model that meets quality requirements for each agent.

AgentDefault ModelOptimized ModelCost Reduction
Brain (plan)Claude SonnetClaude Sonnet0% (needs intelligence)
Brain (synthesise)Claude SonnetClaude Sonnet0% (needs intelligence)
SQL WriterClaude SonnetClaude Haiku90%
Data ProcessorClaude SonnetClaude Haiku90%
Rim WeightingClaude SonnetClaude Haiku90%
FormatterClaude SonnetClaude Haiku90%

Result: 60-70% total cost reduction by using Haiku for all workers.

6.2 Prompt Caching

# Bedrock prompt caching for repeated context:
#
# Without caching:
#   Every call sends full system prompt + data dictionary
#   System prompt: 2,000 tokens
#   Data dictionary: 3,000 tokens
#   = 5,000 tokens of repeated context per call
#
# With caching:
#   First call: Full 5,000 tokens (cache miss)
#   Subsequent calls: Cache hit (90% cheaper)
#
# Cost impact:
#   5 calls per query * 5,000 cached tokens = 25,000 tokens
#   Without cache: 25,000 * USD 3.00/M = USD 0.075
#   With cache:    25,000 * USD 0.30/M = USD 0.0075
#   Savings per query: USD 0.0675
#   At 5,000 queries/day: USD 337/day = USD 10,125/month savings

6.3 Result Caching

# Cache frequent query results in DynamoDB:
#
# Cache key: normalized query hash
# Cache value: agent result
# TTL: 1 hour (for real-time data), 24 hours (for historical data)
#
# Implementation:
#   1. Normalize user query (lowercase, remove whitespace)
#   2. Hash the normalized query
#   3. Check DynamoDB cache table
#   4. If hit: Return cached result (zero LLM cost)
#   5. If miss: Run agent pipeline, cache result
#
# Expected hit rate: 15-30% (many analysts ask similar questions)
# Cost savings at 25% hit rate:
#   5,000 queries * 25% = 1,250 cached responses
#   1,250 * USD 0.029/query = USD 36/day = USD 1,087/month

6.4 Monthly Cost Summary

ComponentPOC (100 q/day)Production (5K q/day)Enterprise (50K q/day)
Bedrock (Sonnet)USD 80USD 4,000USD 40,000
Bedrock (Haiku)USD 3USD 150USD 1,500
Prompt caching savings-USD 20-USD 10,000-USD 100,000
Result caching savings-USD 5-USD 1,000-USD 10,000
DynamoDBUSD 5USD 50USD 500
LambdaUSD 2USD 100USD 1,000
Step FunctionsUSD 1USD 15USD 150
S3USD 1USD 10USD 100
Net TotalUSD 67~USD 3,325~USD 33,250

7. Human-in-the-Loop at Scale

7.1 When to Escalate

# Escalation rules:
#
# Rule 1: Low confidence
#   If brain confidence score is under 0.5: escalate
Export for reading

Comments