Multi-Agent Deep Dive Part 6: Scaling Patterns & Production — Step Functions, Auto-Scaling & Cost Optimization

Introduction

Your multi-agent system works in development. It passes all eval tests. Now you need to run it at 5,000 queries per day with predictable latency, controlled costs, and zero downtime.

This final deep dive covers the production scaling layer: Step Functions for orchestration, Lambda auto-scaling, Bedrock provisioned throughput, cost optimization, and operational runbooks.

Scaling Challenges for Agent Systems
AWS Step Functions Orchestration
Lambda Configuration for Agents
Bedrock Provisioned Throughput
Auto-Scaling Patterns
Cost Optimization Strategies
Human-in-the-Loop at Scale
Multi-Region Architecture
Operational Runbooks
Production Checklist

1. Scaling Challenges for Agent Systems

Agent systems have unique scaling characteristics:

Challenge	Root Cause	Solution
Variable latency	LLM response time varies 2-30s	Async execution with Step Functions
Token throughput limits	Bedrock has per-model TPM limits	Provisioned throughput or multi-region
Cost scaling	More queries = linearly more LLM cost	Model tier optimization (Haiku for workers)
State explosion	Checkpoints grow with conversation length	S3 offloading, TTL cleanup
Cold starts	Lambda cold start adds 1-3s	Provisioned concurrency
Cascading failures	One slow agent blocks the pipeline	Timeouts, circuit breakers

Key principle: Scale the orchestration layer independently from the LLM layer. Step Functions handle flow control; Bedrock handles inference. Each scales differently.

2. AWS Step Functions Orchestration

2.1 Why Step Functions

LangGraph runs the agent graph in a single process. This works for development but limits production scalability. Step Functions provides:

Parallel execution: Fan out to multiple workers simultaneously
Built-in retries: Automatic retry with exponential backoff
Human-in-the-loop: Native task token support for approvals
Timeout handling: Per-step and total execution timeouts
Visual debugging: Execution history in AWS Console
Cost model: Pay per state transition (USD 0.025 per 1,000 transitions)

2.2 State Machine Design

# Step Functions state machine for agent pipeline:
#
# Start -> BrainPlan
#
# BrainPlan:
#   Type: Task
#   Resource: Lambda (brain-plan-function)
#   Timeout: 60 seconds
#   Retry: 2 attempts with backoff
#   Output: execution_plan, workers_needed
#
# BrainPlan -> ParallelWorkers
#
# ParallelWorkers:
#   Type: Parallel
#   Branches:
#     - SQLWriter (if needed)
#     - DataProcessor (if needed)
#     - WeightApplier (if needed)
#   Each branch:
#     Type: Task
#     Resource: Lambda (worker-function)
#     Timeout: 120 seconds
#     Retry: 1 attempt
#
# ParallelWorkers -> BrainSynthesise
#
# BrainSynthesise:
#   Type: Task
#   Resource: Lambda (brain-synthesise-function)
#   Timeout: 60 seconds
#   Output: final_answer, confidence
#
# BrainSynthesise -> ConfidenceCheck
#
# ConfidenceCheck:
#   Type: Choice
#   If confidence >= 0.8: -> Formatter
#   If confidence >= 0.5: -> HumanReview
#   Otherwise: -> ErrorHandler
#
# Formatter:
#   Type: Task
#   Resource: Lambda (formatter-function)
#   Timeout: 30 seconds
#   Next: End
#
# HumanReview:
#   Type: Task
#   Resource: Activity (human-review-activity)
#   Timeout: 3600 seconds (1 hour)
#   HeartbeatSeconds: 300
#   Next: End

2.3 Parallel Fan-Out

# Step Functions parallel execution:
#
# Without parallelism (sequential):
#   BrainPlan (5s) -> SQLWriter (10s) -> WeightApplier (8s)
#   -> Formatter (3s) -> BrainSynthesise (5s)
#   Total: 31 seconds
#
# With parallelism (Step Functions):
#   BrainPlan (5s) -> [SQLWriter (10s) | WeightApplier (8s)] parallel
#   -> BrainSynthesise (5s) -> Formatter (3s)
#   Total: 23 seconds (26% faster)
#
# For complex queries with 4+ workers:
#   Sequential: 40-60 seconds
#   Parallel: 20-30 seconds (50% faster)

3. Lambda Configuration for Agents

3.1 Memory and Timeout Settings

Function	Memory	Timeout	Concurrency	Purpose
brain-plan	512 MB	60s	50	Orchestration planning
sql-writer	256 MB	120s	100	SQL generation and execution
weight-applier	512 MB	120s	50	Data processing (pandas)
formatter	256 MB	30s	100	Response formatting
brain-synthesise	512 MB	60s	50	Final answer synthesis

3.2 Cold Start Mitigation

# Provisioned concurrency configuration:
#
# brain-plan: 5 instances (always warm)
# sql-writer: 10 instances (high volume)
# weight-applier: 5 instances
# formatter: 10 instances
# brain-synthesise: 5 instances
#
# Cost: ~USD 150/month for 35 provisioned instances
# Benefit: Eliminates 1-3s cold start for 95% of invocations
#
# Schedule-based scaling:
# - Business hours (9am-6pm): Full provisioned concurrency
# - Off-hours (6pm-9am): Reduce to 2 instances each
# - Weekends: Reduce to 1 instance each
# Savings: ~40% vs always-on provisioned concurrency

3.3 Lambda Layers for Dependencies

# Lambda layers to reduce deployment size:
#
# Layer 1: langchain-core (shared across all functions)
#   - langchain-core
#   - langgraph
#   - langgraph-checkpoint-aws
#   Size: ~50 MB
#
# Layer 2: data-processing (for weight-applier only)
#   - pandas
#   - numpy
#   Size: ~80 MB
#
# Layer 3: bedrock-runtime (shared across all functions)
#   - boto3 (latest)
#   - botocore (latest)
#   Size: ~30 MB
#
# Total layer size: 160 MB (within 250 MB Lambda limit)
# Benefit: Each function deployment is only 1-5 MB of custom code

4. Bedrock Provisioned Throughput

4.1 When to Use Provisioned Throughput

Workload	Queries/Day	Recommended	Monthly Cost
POC	Under 100	On-demand	Pay per token
Production	1,000-10,000	Provisioned Throughput	USD 1,000-5,000
Enterprise	Above 10,000	Provisioned + Cross-region	USD 5,000+

4.2 Capacity Planning

# Bedrock token throughput calculation:
#
# Query profile:
#   Brain (Sonnet): 2 calls * 2,500 tokens = 5,000 tokens
#   Workers (Haiku): 3 calls * 1,800 tokens = 5,400 tokens
#   Total per query: ~10,400 tokens
#
# Daily throughput:
#   5,000 queries * 10,400 tokens = 52,000,000 tokens/day
#   = 2,167,000 tokens/hour
#   = 36,100 tokens/minute (average)
#
# Peak throughput (4x average during business hours):
#   144,400 tokens/minute
#
# Provisioned throughput needed:
#   Sonnet: 1 model unit (covers ~100K tokens/minute)
#   Haiku: 1 model unit (covers ~200K tokens/minute)

4.3 Cross-Region Inference

# Bedrock cross-region inference for higher throughput:
#
# Primary region: us-east-1
# Secondary regions: us-west-2, eu-west-1
#
# Configuration:
#   inference_profile_id = "us.anthropic.claude-sonnet-v2"
#   # The "us." prefix enables cross-region routing
#   # Bedrock automatically routes to the least-loaded region
#
# Benefits:
#   - 3x token throughput (3 regions)
#   - Automatic failover if one region is throttled
#   - No code changes needed (just change model ID prefix)
#
# Note: Data may cross region boundaries
#   - Ensure compliance with data residency requirements
#   - Use region-specific profiles for regulated workloads

5. Auto-Scaling Patterns

5.1 Lambda Concurrency Scaling

# Auto-scaling based on query volume:
#
# Metric: IncomingQueryCount (custom CloudWatch metric)
#
# Scale-up rules:
#   If IncomingQueryCount > 50/minute for 3 minutes:
#     Increase provisioned concurrency by 50%
#   If IncomingQueryCount > 100/minute for 3 minutes:
#     Increase provisioned concurrency by 100%
#
# Scale-down rules:
#   If IncomingQueryCount is under 20/minute for 10 minutes:
#     Decrease provisioned concurrency by 25%
#   Minimum: 2 instances per function
#
# Application Auto Scaling target:
#   Resource: Lambda function provisioned concurrency
#   Min capacity: 2
#   Max capacity: 100
#   Target tracking: 70% utilization

5.2 Queue-Based Scaling

# SQS queue for async query processing:
#
# Architecture:
#   API Gateway -> SQS Queue -> Lambda (poller) -> Step Functions
#
# Benefits:
#   - Absorbs traffic spikes without throttling
#   - Natural backpressure mechanism
#   - Dead letter queue for failed queries
#   - Visibility timeout prevents duplicate processing
#
# Queue configuration:
#   Visibility timeout: 300 seconds (5 minutes)
#   Message retention: 14 days
#   Receive wait time: 20 seconds (long polling)
#   Dead letter queue: after 3 failed attempts
#
# Scaling trigger:
#   If ApproximateNumberOfMessagesVisible > 100:
#     Scale up Step Functions executions

6. Cost Optimization Strategies

6.1 Model Tier Optimization

The single most effective cost optimization: use the cheapest model that meets quality requirements for each agent.

Agent	Default Model	Optimized Model	Cost Reduction
Brain (plan)	Claude Sonnet	Claude Sonnet	0% (needs intelligence)
Brain (synthesise)	Claude Sonnet	Claude Sonnet	0% (needs intelligence)
SQL Writer	Claude Sonnet	Claude Haiku	90%
Data Processor	Claude Sonnet	Claude Haiku	90%
Rim Weighting	Claude Sonnet	Claude Haiku	90%
Formatter	Claude Sonnet	Claude Haiku	90%

Result: 60-70% total cost reduction by using Haiku for all workers.

6.2 Prompt Caching

# Bedrock prompt caching for repeated context:
#
# Without caching:
#   Every call sends full system prompt + data dictionary
#   System prompt: 2,000 tokens
#   Data dictionary: 3,000 tokens
#   = 5,000 tokens of repeated context per call
#
# With caching:
#   First call: Full 5,000 tokens (cache miss)
#   Subsequent calls: Cache hit (90% cheaper)
#
# Cost impact:
#   5 calls per query * 5,000 cached tokens = 25,000 tokens
#   Without cache: 25,000 * USD 3.00/M = USD 0.075
#   With cache:    25,000 * USD 0.30/M = USD 0.0075
#   Savings per query: USD 0.0675
#   At 5,000 queries/day: USD 337/day = USD 10,125/month savings

6.3 Result Caching

# Cache frequent query results in DynamoDB:
#
# Cache key: normalized query hash
# Cache value: agent result
# TTL: 1 hour (for real-time data), 24 hours (for historical data)
#
# Implementation:
#   1. Normalize user query (lowercase, remove whitespace)
#   2. Hash the normalized query
#   3. Check DynamoDB cache table
#   4. If hit: Return cached result (zero LLM cost)
#   5. If miss: Run agent pipeline, cache result
#
# Expected hit rate: 15-30% (many analysts ask similar questions)
# Cost savings at 25% hit rate:
#   5,000 queries * 25% = 1,250 cached responses
#   1,250 * USD 0.029/query = USD 36/day = USD 1,087/month

6.4 Monthly Cost Summary

Component	POC (100 q/day)	Production (5K q/day)	Enterprise (50K q/day)
Bedrock (Sonnet)	USD 80	USD 4,000	USD 40,000
Bedrock (Haiku)	USD 3	USD 150	USD 1,500
Prompt caching savings	-USD 20	-USD 10,000	-USD 100,000
Result caching savings	-USD 5	-USD 1,000	-USD 10,000
DynamoDB	USD 5	USD 50	USD 500
Lambda	USD 2	USD 100	USD 1,000
Step Functions	USD 1	USD 15	USD 150
S3	USD 1	USD 10	USD 100
Net Total	USD 67	~USD 3,325	~USD 33,250

7. Human-in-the-Loop at Scale

7.1 When to Escalate

# Escalation rules:
#
# Rule 1: Low confidence
#   If brain confidence score is under 0.5: escalate

Export for reading

Multi-Agent Deep Dive Part 6: Scaling Patterns & Production — Step Functions, Auto-Scaling & Cost Optimization

Introduction

Table of Contents

1. Scaling Challenges for Agent Systems

2. AWS Step Functions Orchestration

2.1 Why Step Functions

2.2 State Machine Design

2.3 Parallel Fan-Out

3. Lambda Configuration for Agents

3.1 Memory and Timeout Settings

3.2 Cold Start Mitigation

3.3 Lambda Layers for Dependencies

4. Bedrock Provisioned Throughput

4.1 When to Use Provisioned Throughput

4.2 Capacity Planning

4.3 Cross-Region Inference

5. Auto-Scaling Patterns

5.1 Lambda Concurrency Scaling

5.2 Queue-Based Scaling

6. Cost Optimization Strategies

6.1 Model Tier Optimization

6.2 Prompt Caching

6.3 Result Caching

6.4 Monthly Cost Summary

7. Human-in-the-Loop at Scale

7.1 When to Escalate

Comments

On this page

Multi-Agent Deep Dive Part 6: Scaling Patterns & Production — Step Functions, Auto-Scaling & Cost Optimization