The Client Who Wanted $20/Month Hosting (And the One Who Needed 99.99% SLA)

I had two client conversations in the same week that perfectly illustrated why MarketingOS needs to run anywhere.

Monday morning. A bakery owner named Lisa sat across from me. She had a marketing site with a hero section, a menu page, a catering inquiry form, and a blog where she posted seasonal specials. Traffic was maybe 2,000 visits a month, all local. She was paying $180/month for managed WordPress hosting because her previous developer told her she “needed enterprise-grade security.” She looked at me and said, “I just need it to work. Can we get the hosting under $30?” Absolutely we can.

Thursday afternoon. A VP of Marketing at a fintech SaaS sat across from me (well, across a Zoom call). They needed the marketing site to handle product launch days with 500K+ concurrent visitors, serve content from CDN edges in 14 countries, maintain 99.99% uptime because marketing downtime meant lost revenue during ad campaigns, and pass a SOC 2 audit. Their budget for infrastructure was $3,000/month, and they considered it cheap compared to the cost of a single hour of downtime during a product launch.

Same template. Same codebase. Same Docker images built by the CI/CD pipeline we set up in Part 7. But the infrastructure underneath those images needs to be radically different.

This is the part of MarketingOS that I spent the most time on and rewrote twice. The first version was AWS-only with Terraform. The second was “deploy anywhere” with a 2,000-line shell script that tried to detect the environment and configure itself. Both were bad. What I landed on is three distinct deployment paths, each with its own Terraform module, each optimized for a different cost/complexity/reliability trade-off.

Let’s walk through all three.

The Infrastructure Decision Framework

Before diving into Terraform files, here’s the decision tree I give clients:

Choose Self-Hosted Ubuntu if:

  • Monthly budget is under $50
  • Traffic is under 50K visits/month
  • A few minutes of downtime during deployment is acceptable
  • You (or your agency) can SSH into a server to debug issues
  • You don’t need geographic redundancy

Choose AWS if:

  • You need auto-scaling for traffic spikes
  • You need multi-region availability
  • You’re already in the AWS ecosystem
  • Compliance requirements mandate specific certifications (SOC 2, HIPAA BAA)
  • Budget is $100-500/month

Choose Azure if:

  • You need native .NET hosting optimization (App Service is hard to beat for .NET)
  • You’re in the Microsoft ecosystem (Azure AD, Office 365)
  • You want deployment slots for zero-downtime swaps out of the box
  • Application Insights APM matters to you
  • Budget is $80-400/month

Now let’s build all three.

Option 1: Self-Hosted Ubuntu with Docker Compose

This is the Lisa option. One VPS, Docker Compose, Traefik for reverse proxy and automatic SSL, and automated backups. It handles far more traffic than people expect — I’ve run sites with 30K monthly visitors on a $20/month Hetzner box without breaking a sweat.

Server Setup

We start with a fresh Ubuntu 24.04 LTS server. I use Hetzner (CPX21: 3 vCPUs, 4GB RAM, 80GB SSD, $7.50/month) or DigitalOcean ($24/month for a comparable droplet). Here’s the initial setup script:

#!/bin/bash
# server-setup.sh — Initial Ubuntu 24.04 server configuration
# Run as root on a fresh server

set -euo pipefail

echo "=== MarketingOS Server Setup ==="

# Update system
apt update && apt upgrade -y

# Install essential packages
apt install -y \
  curl \
  wget \
  git \
  ufw \
  fail2ban \
  unattended-upgrades \
  apt-transport-https \
  ca-certificates \
  gnupg \
  lsb-release \
  htop \
  ncdu

# Configure automatic security updates
cat > /etc/apt/apt.conf.d/20auto-upgrades << 'EOF'
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
APT::Periodic::AutocleanInterval "7";
EOF

# Install Docker (official repository)
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
  -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
  https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  tee /etc/apt/sources.list.d/docker.list > /dev/null

apt update
apt install -y docker-ce docker-ce-cli containerd.io \
  docker-buildx-plugin docker-compose-plugin

# Enable Docker service
systemctl enable docker
systemctl start docker

# Configure UFW firewall
ufw default deny incoming
ufw default allow outgoing
ufw allow 22/tcp    # SSH
ufw allow 80/tcp    # HTTP (Traefik)
ufw allow 443/tcp   # HTTPS (Traefik)
ufw --force enable

# Configure Fail2ban for SSH protection
cat > /etc/fail2ban/jail.local << 'EOF'
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
findtime = 600

[sshd-ddos]
enabled = true
port = ssh
filter = sshd-ddos
logpath = /var/log/auth.log
maxretry = 6
bantime = 86400
findtime = 600
EOF

systemctl enable fail2ban
systemctl restart fail2ban

# Create deploy user
adduser --disabled-password --gecos "" deploy
usermod -aG docker deploy
mkdir -p /home/deploy/.ssh
cp /root/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

# Create application directory
mkdir -p /opt/marketingos
chown deploy:deploy /opt/marketingos

# Configure Docker logging to prevent disk fill
cat > /etc/docker/daemon.json << 'EOF'
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}
EOF

systemctl restart docker

echo "=== Server setup complete ==="
echo "SSH as 'deploy' user for application deployment"

A few notes on this script. The fail2ban configuration bans IP addresses after 3 failed SSH attempts for one hour, and bans DDoS-style rapid attempts for 24 hours. The Docker logging configuration caps log files at 10MB with 3 rotations — without this, I’ve seen SQL Server containers generate 20GB of logs in a month and fill the disk. The deploy user has Docker permissions but no sudo — deployments happen through this restricted account.

Production Docker Compose with Traefik

In Part 7, we built Docker images and pushed them to GitHub Container Registry. Now we pull those images and run them with Traefik handling SSL and reverse proxying. This is the docker-compose.prod.yml that lives on the server:

# /opt/marketingos/docker-compose.prod.yml
version: "3.8"

services:
  traefik:
    image: traefik:v3.2
    container_name: marketingos-traefik
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/acme.json:/acme.json
      - ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
      - ./traefik/dynamic:/etc/traefik/dynamic:ro
    networks:
      - web
      - internal
    labels:
      - "traefik.enable=true"
      # Dashboard (optional, remove in production if not needed)
      - "traefik.http.routers.dashboard.rule=Host(`traefik.${DOMAIN}`)"
      - "traefik.http.routers.dashboard.tls.certresolver=letsencrypt"
      - "traefik.http.routers.dashboard.service=api@internal"
      - "traefik.http.routers.dashboard.middlewares=auth"
      - "traefik.http.middlewares.auth.basicauth.users=${TRAEFIK_AUTH}"

  umbraco:
    image: ghcr.io/${GITHUB_ORG}/marketingos-backend:${IMAGE_TAG:-latest}
    container_name: marketingos-umbraco
    restart: unless-stopped
    environment:
      - ASPNETCORE_ENVIRONMENT=Production
      - ASPNETCORE_URLS=http://+:5000
      - ConnectionStrings__umbracoDbDSN=Server=sqlserver;Database=MarketingOS;User Id=sa;Password=${SQL_PASSWORD};TrustServerCertificate=true
      - Umbraco__CMS__DeliveryApi__Enabled=true
      - Umbraco__CMS__DeliveryApi__ApiKey=${DELIVERY_API_KEY}
      - Redis__ConnectionString=redis:6379
    depends_on:
      sqlserver:
        condition: service_healthy
      redis:
        condition: service_healthy
    networks:
      - web
      - internal
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.umbraco.rule=Host(`cms.${DOMAIN}`)"
      - "traefik.http.routers.umbraco.tls.certresolver=letsencrypt"
      - "traefik.http.services.umbraco.loadbalancer.server.port=5000"
      - "traefik.http.routers.umbraco.middlewares=umbraco-headers"
      - "traefik.http.middlewares.umbraco-headers.headers.stsSeconds=31536000"
      - "traefik.http.middlewares.umbraco-headers.headers.stsIncludeSubdomains=true"
      - "traefik.http.middlewares.umbraco-headers.headers.contentTypeNosniff=true"
      - "traefik.http.middlewares.umbraco-headers.headers.frameDeny=true"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5000/umbraco/ping"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

  nextjs:
    image: ghcr.io/${GITHUB_ORG}/marketingos-frontend:${IMAGE_TAG:-latest}
    container_name: marketingos-nextjs
    restart: unless-stopped
    environment:
      - NODE_ENV=production
      - UMBRACO_API_URL=http://umbraco:5000
      - UMBRACO_API_KEY=${DELIVERY_API_KEY}
      - REDIS_URL=redis://redis:6379
      - REVALIDATION_SECRET=${REVALIDATION_SECRET}
    depends_on:
      umbraco:
        condition: service_healthy
    networks:
      - web
      - internal
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.nextjs.rule=Host(`${DOMAIN}`) || Host(`www.${DOMAIN}`)"
      - "traefik.http.routers.nextjs.tls.certresolver=letsencrypt"
      - "traefik.http.services.nextjs.loadbalancer.server.port=3000"
      # Redirect www to non-www
      - "traefik.http.routers.www-redirect.rule=Host(`www.${DOMAIN}`)"
      - "traefik.http.routers.www-redirect.middlewares=www-to-nonwww"
      - "traefik.http.middlewares.www-to-nonwww.redirectregex.regex=^https?://www\\.(.+)"
      - "traefik.http.middlewares.www-to-nonwww.redirectregex.replacement=https://$${1}"
      - "traefik.http.middlewares.www-to-nonwww.redirectregex.permanent=true"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  sqlserver:
    image: mcr.microsoft.com/mssql/server:2022-latest
    container_name: marketingos-sql
    restart: unless-stopped
    environment:
      - ACCEPT_EULA=Y
      - SA_PASSWORD=${SQL_PASSWORD}
      - MSSQL_PID=Express
    volumes:
      - sqldata:/var/opt/mssql
    networks:
      - internal
    healthcheck:
      test: /opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P "$${SA_PASSWORD}" -C -Q "SELECT 1" || exit 1
      interval: 15s
      timeout: 10s
      retries: 5
      start_period: 30s

  redis:
    image: redis:7-alpine
    container_name: marketingos-redis
    restart: unless-stopped
    command: redis-server --appendonly yes --maxmemory 256mb --maxmemory-policy allkeys-lru
    volumes:
      - redisdata:/data
    networks:
      - internal
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3

volumes:
  sqldata:
  redisdata:

networks:
  web:
    external: true
  internal:
    driver: bridge

And the Traefik static configuration:

# /opt/marketingos/traefik/traefik.yml
api:
  dashboard: true

entryPoints:
  web:
    address: ":80"
    http:
      redirections:
        entryPoint:
          to: websecure
          scheme: https
  websecure:
    address: ":443"
    http:
      tls:
        certResolver: letsencrypt

certificatesResolvers:
  letsencrypt:
    acme:
      email: admin@yourdomain.com
      storage: /acme.json
      httpChallenge:
        entryPoint: web

providers:
  docker:
    endpoint: "unix:///var/run/docker.sock"
    exposedByDefault: false
    network: web
  file:
    directory: /etc/traefik/dynamic
    watch: true

log:
  level: WARN

accessLog:
  filePath: /dev/stdout
  filters:
    statusCodes:
      - "400-599"

Important details: Traefik automatically obtains and renews Let’s Encrypt certificates. The exposedByDefault: false setting means only containers with traefik.enable=true labels are exposed. All HTTP traffic is redirected to HTTPS. The SQL Server and Redis containers are on the internal network only — they’re not accessible from the internet.

Before first deployment, create the acme.json file and the external Docker network:

# On the server, as the deploy user
cd /opt/marketingos
mkdir -p traefik/dynamic
touch traefik/acme.json
chmod 600 traefik/acme.json
docker network create web

Automated Database Backups

This is the part people skip, and it’s the part that matters most when something goes wrong. I’ve had a client’s VPS provider lose a disk. I’ve had a Docker volume get corrupted after a kernel update. Backups are not optional.

#!/bin/bash
# /opt/marketingos/scripts/backup-db.sh
# Automated SQL Server backup to S3-compatible storage
# Run via cron: 0 2 * * * /opt/marketingos/scripts/backup-db.sh

set -euo pipefail

# Configuration
BACKUP_DIR="/opt/marketingos/backups"
S3_BUCKET="s3://marketingos-backups"
S3_ENDPOINT="https://s3.us-east-1.amazonaws.com"  # or Backblaze, Wasabi, etc.
RETENTION_DAILY=7
RETENTION_WEEKLY=4
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DAY_OF_WEEK=$(date +%u)  # 1=Monday, 7=Sunday

# Load environment variables
source /opt/marketingos/.env

mkdir -p "${BACKUP_DIR}"

echo "[$(date)] Starting database backup..."

# Create backup inside the SQL Server container
docker exec marketingos-sql /opt/mssql-tools18/bin/sqlcmd \
  -S localhost -U sa -P "${SQL_PASSWORD}" -C \
  -Q "BACKUP DATABASE [MarketingOS] TO DISK = N'/var/opt/mssql/backup/MarketingOS_${TIMESTAMP}.bak' WITH FORMAT, COMPRESSION, STATS = 10"

# Copy backup from container to host
docker cp marketingos-sql:/var/opt/mssql/backup/MarketingOS_${TIMESTAMP}.bak \
  "${BACKUP_DIR}/MarketingOS_${TIMESTAMP}.bak"

# Remove backup from inside the container
docker exec marketingos-sql rm -f "/var/opt/mssql/backup/MarketingOS_${TIMESTAMP}.bak"

# Compress the backup
gzip "${BACKUP_DIR}/MarketingOS_${TIMESTAMP}.bak"
BACKUP_FILE="${BACKUP_DIR}/MarketingOS_${TIMESTAMP}.bak.gz"

echo "[$(date)] Backup created: $(du -h ${BACKUP_FILE} | cut -f1)"

# Upload to S3 — daily folder
aws s3 cp "${BACKUP_FILE}" \
  "${S3_BUCKET}/daily/MarketingOS_${TIMESTAMP}.bak.gz" \
  --endpoint-url "${S3_ENDPOINT}" \
  --storage-class STANDARD_IA

# On Sundays, also copy to weekly folder
if [ "${DAY_OF_WEEK}" -eq 7 ]; then
  aws s3 cp "${BACKUP_FILE}" \
    "${S3_BUCKET}/weekly/MarketingOS_${TIMESTAMP}.bak.gz" \
    --endpoint-url "${S3_ENDPOINT}" \
    --storage-class STANDARD_IA
  echo "[$(date)] Weekly backup uploaded"
fi

# Clean up local backups older than 2 days
find "${BACKUP_DIR}" -name "*.bak.gz" -mtime +2 -delete

# Clean up remote daily backups older than RETENTION_DAILY days
CUTOFF_DAILY=$(date -d "-${RETENTION_DAILY} days" +%Y%m%d)
aws s3 ls "${S3_BUCKET}/daily/" --endpoint-url "${S3_ENDPOINT}" | \
  while read -r line; do
    FILE=$(echo "$line" | awk '{print $4}')
    FILE_DATE=$(echo "$FILE" | grep -oP '\d{8}')
    if [ -n "${FILE_DATE}" ] && [ "${FILE_DATE}" -lt "${CUTOFF_DAILY}" ]; then
      aws s3 rm "${S3_BUCKET}/daily/${FILE}" --endpoint-url "${S3_ENDPOINT}"
      echo "[$(date)] Deleted old daily backup: ${FILE}"
    fi
  done

# Clean up remote weekly backups older than RETENTION_WEEKLY weeks
CUTOFF_WEEKLY=$(date -d "-$((RETENTION_WEEKLY * 7)) days" +%Y%m%d)
aws s3 ls "${S3_BUCKET}/weekly/" --endpoint-url "${S3_ENDPOINT}" | \
  while read -r line; do
    FILE=$(echo "$line" | awk '{print $4}')
    FILE_DATE=$(echo "$FILE" | grep -oP '\d{8}')
    if [ -n "${FILE_DATE}" ] && [ "${FILE_DATE}" -lt "${CUTOFF_WEEKLY}" ]; then
      aws s3 rm "${S3_BUCKET}/weekly/${FILE}" --endpoint-url "${S3_ENDPOINT}"
      echo "[$(date)] Deleted old weekly backup: ${FILE}"
    fi
  done

echo "[$(date)] Backup complete"

And a restore testing script, because a backup you’ve never tested restoring is not a backup:

#!/bin/bash
# /opt/marketingos/scripts/test-restore.sh
# Tests that the latest backup can be restored successfully
# Run monthly: 0 4 1 * * /opt/marketingos/scripts/test-restore.sh

set -euo pipefail

source /opt/marketingos/.env

BACKUP_DIR="/opt/marketingos/backups"
LATEST_BACKUP=$(ls -t ${BACKUP_DIR}/*.bak.gz 2>/dev/null | head -1)

if [ -z "${LATEST_BACKUP}" ]; then
  echo "ERROR: No backup files found!"
  exit 1
fi

echo "[$(date)] Testing restore of: ${LATEST_BACKUP}"

# Decompress
TEMP_BAK="/tmp/restore_test.bak"
gunzip -c "${LATEST_BACKUP}" > "${TEMP_BAK}"

# Copy into SQL Server container
docker cp "${TEMP_BAK}" marketingos-sql:/var/opt/mssql/backup/restore_test.bak

# Restore to a test database
docker exec marketingos-sql /opt/mssql-tools18/bin/sqlcmd \
  -S localhost -U sa -P "${SQL_PASSWORD}" -C \
  -Q "RESTORE DATABASE [MarketingOS_RestoreTest] FROM DISK = N'/var/opt/mssql/backup/restore_test.bak' WITH MOVE 'MarketingOS' TO '/var/opt/mssql/data/MarketingOS_RestoreTest.mdf', MOVE 'MarketingOS_log' TO '/var/opt/mssql/data/MarketingOS_RestoreTest_log.ldf', REPLACE"

# Verify the restored database
RESULT=$(docker exec marketingos-sql /opt/mssql-tools18/bin/sqlcmd \
  -S localhost -U sa -P "${SQL_PASSWORD}" -C \
  -Q "SELECT COUNT(*) FROM [MarketingOS_RestoreTest].[dbo].[umbracoNode]" \
  -h -1 -W)

echo "[$(date)] Restore test: ${RESULT} nodes found in restored database"

# Drop test database and clean up
docker exec marketingos-sql /opt/mssql-tools18/bin/sqlcmd \
  -S localhost -U sa -P "${SQL_PASSWORD}" -C \
  -Q "DROP DATABASE [MarketingOS_RestoreTest]"

docker exec marketingos-sql rm -f /var/opt/mssql/backup/restore_test.bak
rm -f "${TEMP_BAK}"

if [ "${RESULT}" -gt 0 ]; then
  echo "[$(date)] RESTORE TEST PASSED"
else
  echo "[$(date)] RESTORE TEST FAILED - database appears empty"
  exit 1
fi

Self-Hosted Monitoring with Uptime Kuma

Paid monitoring services work, but for the self-hosted path we’re optimizing for cost. Uptime Kuma is a self-hosted monitoring tool that does HTTP checks, keyword monitoring, SSL certificate expiry alerts, and sends notifications to Slack, Discord, email, or Telegram.

Add it to the Docker Compose stack:

# Add to docker-compose.prod.yml services section
  uptime-kuma:
    image: louislam/uptime-kuma:1
    container_name: marketingos-monitor
    restart: unless-stopped
    volumes:
      - uptimekuma:/app/data
    networks:
      - web
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.monitor.rule=Host(`monitor.${DOMAIN}`)"
      - "traefik.http.routers.monitor.tls.certresolver=letsencrypt"
      - "traefik.http.services.monitor.loadbalancer.server.port=3001"

# Add to volumes section
  uptimekuma:

Once deployed, configure monitors for:

  1. Frontend healthhttps://yourdomain.com (expected 200, check every 60s)
  2. CMS healthhttps://cms.yourdomain.com/umbraco/ping (expected 200, check every 60s)
  3. API healthhttps://cms.yourdomain.com/umbraco/delivery/api/v2/content (expected 200, check every 5 min)
  4. SSL certificate — alert when certificate expires in less than 14 days
  5. Docker containers — use the Docker socket monitor to check container status

For resource monitoring (disk space, CPU, memory), add a simple script to cron:

#!/bin/bash
# /opt/marketingos/scripts/check-resources.sh
# Alerts if disk or memory usage is high
# Run every 15 minutes: */15 * * * * /opt/marketingos/scripts/check-resources.sh

DISK_THRESHOLD=85
MEMORY_THRESHOLD=90
WEBHOOK_URL="${SLACK_WEBHOOK_URL:-}"

DISK_USAGE=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
MEMORY_USAGE=$(free | awk '/Mem:/ {printf "%.0f", $3/$2 * 100}')

ALERT=""

if [ "${DISK_USAGE}" -gt "${DISK_THRESHOLD}" ]; then
  ALERT="${ALERT}Disk usage: ${DISK_USAGE}% (threshold: ${DISK_THRESHOLD}%)\n"
fi

if [ "${MEMORY_USAGE}" -gt "${MEMORY_THRESHOLD}" ]; then
  ALERT="${ALERT}Memory usage: ${MEMORY_USAGE}% (threshold: ${MEMORY_THRESHOLD}%)\n"
fi

if [ -n "${ALERT}" ] && [ -n "${WEBHOOK_URL}" ]; then
  curl -s -X POST "${WEBHOOK_URL}" \
    -H 'Content-type: application/json' \
    -d "{\"text\": \"[MarketingOS] Resource Alert on $(hostname):\n${ALERT}\"}"
fi

Deployment from GitHub Actions

The CI/CD pipeline from Part 7 builds and pushes Docker images. Now we add a deploy job that SSHs into the server and pulls the new images:

# .github/workflows/deploy-self-hosted.yml
name: Deploy to Self-Hosted

on:
  workflow_dispatch:
    inputs:
      image_tag:
        description: 'Image tag to deploy'
        required: true
        default: 'latest'
  workflow_run:
    workflows: ["Build and Push"]
    types: [completed]
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    if: ${{ github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success' }}
    environment: production-selfhosted

    steps:
      - name: Set image tag
        id: tag
        run: |
          if [ "${{ github.event_name }}" == "workflow_dispatch" ]; then
            echo "tag=${{ github.event.inputs.image_tag }}" >> $GITHUB_OUTPUT
          else
            echo "tag=latest" >> $GITHUB_OUTPUT
          fi

      - name: Deploy via SSH
        uses: appleboy/ssh-action@v1.0.3
        with:
          host: ${{ secrets.SERVER_HOST }}
          username: deploy
          key: ${{ secrets.SERVER_SSH_KEY }}
          script: |
            cd /opt/marketingos

            # Login to GitHub Container Registry
            echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin

            # Set the image tag
            export IMAGE_TAG=${{ steps.tag.outputs.tag }}

            # Pull new images
            docker compose -f docker-compose.prod.yml pull umbraco nextjs

            # Rolling update — restart one service at a time
            docker compose -f docker-compose.prod.yml up -d --no-deps umbraco
            echo "Waiting for Umbraco to be healthy..."
            timeout 120 bash -c 'until docker inspect --format="{{.State.Health.Status}}" marketingos-umbraco | grep -q healthy; do sleep 5; done'

            docker compose -f docker-compose.prod.yml up -d --no-deps nextjs
            echo "Waiting for Next.js to be healthy..."
            timeout 60 bash -c 'until docker inspect --format="{{.State.Health.Status}}" marketingos-nextjs | grep -q healthy; do sleep 5; done'

            # Clean up old images
            docker image prune -f

            echo "Deployment complete"

      - name: Verify deployment
        run: |
          sleep 10
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://${{ secrets.SITE_DOMAIN }})
          if [ "$STATUS" != "200" ]; then
            echo "Site returned status $STATUS — deployment may have failed"
            exit 1
          fi
          echo "Site is responding with 200 OK"

The rolling update strategy restarts Umbraco first, waits for its health check to pass, then restarts Next.js. This isn’t true zero-downtime (there’s a brief moment during container restart), but for a marketing site with <50K monthly visits, the 2-3 second gap is invisible.

Self-Hosted Cost Estimate

ComponentProviderMonthly Cost
VPS (3 vCPU, 4GB RAM)Hetzner CPX21$7.50
Backup storage (10GB)Backblaze B2$0.50
Domain nameCloudflare~$1 (amortized)
Uptime monitoringSelf-hosted (included)$0
SSL certificatesLet’s Encrypt (free)$0
Total~$10-20/month

For DigitalOcean or AWS Lightsail, budget $24-40/month for comparable specs. Still well under Lisa’s $30 target.

Option 2: AWS Deployment

When clients need auto-scaling, geographic distribution, or compliance certifications, AWS is my default recommendation. The architecture uses managed services wherever possible — I don’t want to be woken up at 3 AM because a container ran out of memory on a self-managed EC2 instance.

Architecture Overview

Internet
  └─→ CloudFront (CDN + SSL termination)
       ├─→ S3 Bucket (static assets, media files)
       └─→ Application Load Balancer
            ├─→ ECS Fargate — Next.js service (2-10 tasks)
            └─→ ECS Fargate — Umbraco service (2-4 tasks)
                 ├─→ RDS SQL Server (Multi-AZ)
                 ├─→ ElastiCache Redis (cluster mode)
                 └─→ S3 Bucket (media storage via Umbraco)

Next.js runs on Fargate because we need server-side rendering for ISR revalidation and preview mode. If the Next.js site were purely static, I’d use S3 + CloudFront alone. But ISR needs a running server, so Fargate it is. Alternatively, you could deploy Next.js to Vercel and only run Umbraco on AWS — that’s a perfectly valid hybrid approach, and sometimes cheaper.

Terraform Configuration

I structure the Terraform as a module that can be instantiated per client:

# infrastructure/aws/main.tf
terraform {
  required_version = ">= 1.7.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    bucket         = "marketingos-terraform-state"
    key            = "aws/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Project     = "MarketingOS"
      Environment = var.environment
      ManagedBy   = "Terraform"
      Client      = var.client_name
    }
  }
}

# --- VPC ---
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "${var.project_name}-${var.environment}"
  cidr = "10.0.0.0/16"

  azs             = ["${var.aws_region}a", "${var.aws_region}b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = var.environment != "production"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

# --- ECS Cluster ---
resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-${var.environment}"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  configuration {
    execute_command_configuration {
      logging = "DEFAULT"
    }
  }
}

# --- ECS Task Definition: Umbraco ---
resource "aws_ecs_task_definition" "umbraco" {
  family                   = "${var.project_name}-umbraco-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.umbraco_cpu
  memory                   = var.umbraco_memory
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "umbraco"
      image = "${var.ecr_repository_url}:${var.image_tag}"

      portMappings = [
        {
          containerPort = 5000
          protocol      = "tcp"
        }
      ]

      environment = [
        { name = "ASPNETCORE_ENVIRONMENT", value = "Production" },
        { name = "ASPNETCORE_URLS", value = "http://+:5000" },
      ]

      secrets = [
        {
          name      = "ConnectionStrings__umbracoDbDSN"
          valueFrom = aws_ssm_parameter.db_connection_string.arn
        },
        {
          name      = "Umbraco__CMS__DeliveryApi__ApiKey"
          valueFrom = aws_ssm_parameter.delivery_api_key.arn
        },
        {
          name      = "Redis__ConnectionString"
          valueFrom = aws_ssm_parameter.redis_connection_string.arn
        }
      ]

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:5000/umbraco/ping || exit 1"]
        interval    = 30
        timeout     = 10
        retries     = 3
        startPeriod = 90
      }

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.umbraco.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "umbraco"
        }
      }
    }
  ])
}

# --- ECS Service: Umbraco ---
resource "aws_ecs_service" "umbraco" {
  name            = "umbraco"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.umbraco.arn
  desired_count   = var.umbraco_desired_count
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = module.vpc.private_subnets
    security_groups  = [aws_security_group.umbraco.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.umbraco.arn
    container_name   = "umbraco"
    container_port   = 5000
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200
}

# --- ECS Task Definition & Service: Next.js ---
resource "aws_ecs_task_definition" "nextjs" {
  family                   = "${var.project_name}-nextjs-${var.environment}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.nextjs_cpu
  memory                   = var.nextjs_memory
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name  = "nextjs"
      image = "${var.ecr_nextjs_repository_url}:${var.image_tag}"

      portMappings = [
        {
          containerPort = 3000
          protocol      = "tcp"
        }
      ]

      environment = [
        { name = "NODE_ENV", value = "production" },
        { name = "UMBRACO_API_URL", value = "http://umbraco.${var.project_name}.local:5000" },
      ]

      secrets = [
        {
          name      = "UMBRACO_API_KEY"
          valueFrom = aws_ssm_parameter.delivery_api_key.arn
        },
        {
          name      = "REVALIDATION_SECRET"
          valueFrom = aws_ssm_parameter.revalidation_secret.arn
        },
        {
          name      = "REDIS_URL"
          valueFrom = aws_ssm_parameter.redis_url.arn
        }
      ]

      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 30
      }

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.nextjs.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "nextjs"
        }
      }
    }
  ])
}

resource "aws_ecs_service" "nextjs" {
  name            = "nextjs"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.nextjs.arn
  desired_count   = var.nextjs_desired_count
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = module.vpc.private_subnets
    security_groups  = [aws_security_group.nextjs.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.nextjs.arn
    container_name   = "nextjs"
    container_port   = 3000
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

# --- RDS SQL Server ---
resource "aws_db_instance" "sqlserver" {
  identifier     = "${var.project_name}-${var.environment}"
  engine         = "sqlserver-ex"
  engine_version = "16.00"
  instance_class = var.db_instance_class

  allocated_storage     = 20
  max_allocated_storage = 100
  storage_encrypted     = true

  username = "umbraco_admin"
  password = var.db_password

  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.rds.id]

  multi_az            = var.environment == "production"
  skip_final_snapshot = var.environment != "production"

  backup_retention_period = 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "sun:04:00-sun:05:00"

  performance_insights_enabled = true

  tags = {
    Name = "${var.project_name}-sqlserver-${var.environment}"
  }
}

# --- ElastiCache Redis ---
resource "aws_elasticache_replication_group" "redis" {
  replication_group_id = "${var.project_name}-${var.environment}"
  description          = "Redis for MarketingOS ${var.environment}"

  node_type            = var.redis_node_type
  num_cache_clusters   = var.environment == "production" ? 2 : 1
  port                 = 6379
  engine_version       = "7.1"
  parameter_group_name = "default.redis7"

  subnet_group_name  = aws_elasticache_subnet_group.main.name
  security_group_ids = [aws_security_group.redis.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true

  automatic_failover_enabled = var.environment == "production"

  snapshot_retention_limit = 3
  snapshot_window          = "02:00-03:00"
}

# --- S3 for Media ---
resource "aws_s3_bucket" "media" {
  bucket = "${var.project_name}-media-${var.environment}"
}

resource "aws_s3_bucket_versioning" "media" {
  bucket = aws_s3_bucket.media.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "media" {
  bucket = aws_s3_bucket.media.id

  rule {
    id     = "transition-to-ia"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "STANDARD_IA"
    }

    noncurrent_version_expiration {
      noncurrent_days = 30
    }
  }
}

# --- CloudFront ---
resource "aws_cloudfront_distribution" "main" {
  enabled             = true
  is_ipv6_enabled     = true
  default_root_object = ""
  aliases             = [var.domain_name, "www.${var.domain_name}"]
  price_class         = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"

  # Origin: Next.js via ALB
  origin {
    domain_name = aws_lb.main.dns_name
    origin_id   = "nextjs-alb"

    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }

  # Origin: S3 for media files
  origin {
    domain_name              = aws_s3_bucket.media.bucket_regional_domain_name
    origin_id                = "media-s3"
    origin_access_control_id = aws_cloudfront_origin_access_control.s3.id
  }

  # Default behavior — Next.js
  default_cache_behavior {
    allowed_methods  = ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "nextjs-alb"

    cache_policy_id          = aws_cloudfront_cache_policy.nextjs.id
    origin_request_policy_id = aws_cloudfront_origin_request_policy.nextjs.id

    viewer_protocol_policy = "redirect-to-https"
    compress               = true
  }

  # Media files behavior — S3
  ordered_cache_behavior {
    path_pattern     = "/media/*"
    allowed_methods  = ["GET", "HEAD"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "media-s3"

    cache_policy_id = aws_cloudfront_cache_policy.media.id

    viewer_protocol_policy = "redirect-to-https"
    compress               = true
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  viewer_certificate {
    acm_certificate_arn      = aws_acm_certificate.main.arn
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }
}
# infrastructure/aws/variables.tf
variable "project_name" {
  description = "Project name used in resource naming"
  type        = string
  default     = "marketingos"
}

variable "client_name" {
  description = "Client name for tagging"
  type        = string
}

variable "environment" {
  description = "Environment (staging, production)"
  type        = string
  validation {
    condition     = contains(["staging", "production"], var.environment)
    error_message = "Environment must be staging or production."
  }
}

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "domain_name" {
  description = "Primary domain name"
  type        = string
}

variable "image_tag" {
  description = "Docker image tag to deploy"
  type        = string
  default     = "latest"
}

# --- Compute sizing ---
variable "umbraco_cpu" {
  description = "Umbraco task CPU units (1024 = 1 vCPU)"
  type        = number
  default     = 1024
}

variable "umbraco_memory" {
  description = "Umbraco task memory in MB"
  type        = number
  default     = 2048
}

variable "umbraco_desired_count" {
  description = "Number of Umbraco tasks"
  type        = number
  default     = 2
}

variable "nextjs_cpu" {
  description = "Next.js task CPU units"
  type        = number
  default     = 512
}

variable "nextjs_memory" {
  description = "Next.js task memory in MB"
  type        = number
  default     = 1024
}

variable "nextjs_desired_count" {
  description = "Number of Next.js tasks"
  type        = number
  default     = 2
}

variable "db_instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t3.small"
}

variable "db_password" {
  description = "Database password"
  type        = string
  sensitive   = true
}

variable "redis_node_type" {
  description = "ElastiCache node type"
  type        = string
  default     = "cache.t4g.micro"
}

variable "ecr_repository_url" {
  description = "ECR repository URL for Umbraco image"
  type        = string
}

variable "ecr_nextjs_repository_url" {
  description = "ECR repository URL for Next.js image"
  type        = string
}
# infrastructure/aws/outputs.tf
output "cloudfront_distribution_id" {
  description = "CloudFront distribution ID for cache invalidation"
  value       = aws_cloudfront_distribution.main.id
}

output "cloudfront_domain" {
  description = "CloudFront domain name"
  value       = aws_cloudfront_distribution.main.domain_name
}

output "alb_dns_name" {
  description = "ALB DNS name"
  value       = aws_lb.main.dns_name
}

output "rds_endpoint" {
  description = "RDS endpoint"
  value       = aws_db_instance.sqlserver.endpoint
  sensitive   = true
}

output "redis_endpoint" {
  description = "Redis primary endpoint"
  value       = aws_elasticache_replication_group.redis.primary_endpoint_address
  sensitive   = true
}

output "media_bucket" {
  description = "S3 media bucket name"
  value       = aws_s3_bucket.media.bucket
}

output "ecs_cluster_name" {
  description = "ECS cluster name"
  value       = aws_ecs_cluster.main.name
}

AWS-Specific Optimizations

CloudFront cache invalidation on content publish. When a content editor publishes a page in Umbraco, we need CloudFront to serve the fresh version. The webhook from Umbraco triggers Next.js ISR revalidation (covered in Part 3), but CloudFront might still serve a stale cached copy. We handle this with a CloudFront function:

// infrastructure/aws/cloudfront-functions/cache-control.js
// CloudFront Function — runs at the edge, manipulates cache headers

function handler(event) {
  var response = event.response;
  var headers = response.headers;

  // HTML pages: short CDN cache, let ISR handle freshness
  var uri = event.request.uri;
  if (uri.endsWith('/') || uri.endsWith('.html') || !uri.includes('.')) {
    headers['cache-control'] = { value: 's-maxage=60, stale-while-revalidate=86400' };
  }

  // Static assets: long cache with immutable
  if (uri.match(/\.(js|css|woff2|png|jpg|webp|avif|svg|ico)$/)) {
    headers['cache-control'] = { value: 'public, max-age=31536000, immutable' };
  }

  return response;
}

Auto-scaling for traffic spikes. Configure ECS auto-scaling based on CPU and request count:

# infrastructure/aws/autoscaling.tf
resource "aws_appautoscaling_target" "nextjs" {
  max_capacity       = 10
  min_capacity       = var.nextjs_desired_count
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.nextjs.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "nextjs_cpu" {
  name               = "nextjs-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.nextjs.resource_id
  scalable_dimension = aws_appautoscaling_target.nextjs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.nextjs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

resource "aws_appautoscaling_policy" "nextjs_requests" {
  name               = "nextjs-request-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.nextjs.resource_id
  scalable_dimension = aws_appautoscaling_target.nextjs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.nextjs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.main.arn_suffix}/${aws_lb_target_group.nextjs.arn_suffix}"
    }
    target_value       = 1000
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

CloudWatch alarms for the things that actually matter:

# infrastructure/aws/monitoring.tf
resource "aws_cloudwatch_metric_alarm" "umbraco_unhealthy" {
  alarm_name          = "${var.project_name}-umbraco-unhealthy-${var.environment}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "UnHealthyHostCount"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Maximum"
  threshold           = 0
  alarm_description   = "Umbraco has unhealthy targets"

  dimensions = {
    TargetGroup  = aws_lb_target_group.umbraco.arn_suffix
    LoadBalancer = aws_lb.main.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_cpu" {
  alarm_name          = "${var.project_name}-rds-cpu-${var.environment}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "CPUUtilization"
  namespace           = "AWS/RDS"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "RDS CPU utilization is high"

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.sqlserver.identifier
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "rds_storage" {
  alarm_name          = "${var.project_name}-rds-storage-${var.environment}"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 1
  metric_name         = "FreeStorageSpace"
  namespace           = "AWS/RDS"
  period              = 300
  statistic           = "Minimum"
  threshold           = 5000000000  # 5 GB
  alarm_description   = "RDS free storage is below 5GB"

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.sqlserver.identifier
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

resource "aws_cloudwatch_metric_alarm" "error_rate" {
  alarm_name          = "${var.project_name}-5xx-rate-${var.environment}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5

  metric_query {
    id          = "error_rate"
    expression  = "errors/requests*100"
    label       = "5xx Error Rate"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "HTTPCode_Target_5XX_Count"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.main.arn_suffix
      }
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "RequestCount"
      namespace   = "AWS/ApplicationELB"
      period      = 300
      stat        = "Sum"
      dimensions = {
        LoadBalancer = aws_lb.main.arn_suffix
      }
    }
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

AWS Cost Estimate

ComponentSmall SiteMedium Site
ECS Fargate (2 tasks)$30$60
RDS SQL Server Express (db.t3.small)$25$50
ElastiCache Redis (cache.t4g.micro)$12$24
CloudFront (100GB transfer)$10$25
S3 (10GB media)$1$3
ALB$18$18
NAT Gateway$32$32
CloudWatch$5$10
Total$133/month$222/month

The NAT Gateway is the sneaky cost that catches people off guard. $0.045/hour plus data processing charges adds up to $32+/month just for the gateway. For staging environments, consider using a NAT instance (t3.micro) instead, or placing Fargate tasks in public subnets with assign_public_ip = true (less secure, but $32/month cheaper).

Option 3: Azure Deployment

Azure is the natural home for .NET applications. App Service for Linux provides native .NET 10 hosting with deployment slots, built-in health checks, and Application Insights integration. For MarketingOS, Azure often ends up cheaper than AWS for the same workload because App Service is more cost-effective than ECS Fargate for always-on containers.

Architecture Overview

Internet
  └─→ Azure CDN / Front Door
       ├─→ Azure Blob Storage (media files)
       └─→ App Service — Next.js (Linux, Node 22)
            └─→ App Service — Umbraco (Linux, .NET 10)
                 ├─→ Azure SQL Database
                 ├─→ Azure Cache for Redis
                 └─→ Azure Blob Storage (media via Umbraco)

Terraform Configuration

# infrastructure/azure/main.tf
terraform {
  required_version = ">= 1.7.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }

  backend "azurerm" {
    resource_group_name  = "marketingos-tfstate"
    storage_account_name = "marketingostfstate"
    container_name       = "tfstate"
    key                  = "azure.terraform.tfstate"
  }
}

provider "azurerm" {
  features {
    resource_group {
      prevent_deletion_if_contains_resources = true
    }
  }
}

# --- Resource Group ---
resource "azurerm_resource_group" "main" {
  name     = "rg-${var.project_name}-${var.environment}"
  location = var.azure_location

  tags = {
    Project     = "MarketingOS"
    Environment = var.environment
    Client      = var.client_name
    ManagedBy   = "Terraform"
  }
}

# --- App Service Plan ---
resource "azurerm_service_plan" "main" {
  name                = "asp-${var.project_name}-${var.environment}"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  os_type             = "Linux"
  sku_name            = var.app_service_sku

  tags = azurerm_resource_group.main.tags
}

# --- App Service: Umbraco ---
resource "azurerm_linux_web_app" "umbraco" {
  name                = "app-${var.project_name}-umbraco-${var.environment}"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  service_plan_id     = azurerm_service_plan.main.id

  site_config {
    always_on = true

    application_stack {
      dotnet_version = "10.0"
    }

    health_check_path                 = "/umbraco/ping"
    health_check_eviction_time_in_min = 5

    ip_restriction_default_action = "Allow"
  }

  app_settings = {
    "ASPNETCORE_ENVIRONMENT"               = "Production"
    "Umbraco__CMS__DeliveryApi__Enabled"   = "true"
    "Umbraco__CMS__DeliveryApi__ApiKey"    = var.delivery_api_key
    "Redis__ConnectionString"              = "${azurerm_redis_cache.main.hostname}:${azurerm_redis_cache.main.ssl_port},password=${azurerm_redis_cache.main.primary_access_key},ssl=True,abortConnect=False"
    "APPLICATIONINSIGHTS_CONNECTION_STRING" = azurerm_application_insights.main.connection_string
  }

  connection_string {
    name  = "umbracoDbDSN"
    type  = "SQLAzure"
    value = "Server=tcp:${azurerm_mssql_server.main.fully_qualified_domain_name},1433;Initial Catalog=${azurerm_mssql_database.main.name};Persist Security Info=False;User ID=${var.sql_admin_username};Password=${var.sql_admin_password};MultipleActiveResultSets=True;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;"
  }

  identity {
    type = "SystemAssigned"
  }

  tags = azurerm_resource_group.main.tags
}

# --- Deployment Slots for Zero-Downtime ---
resource "azurerm_linux_web_app_slot" "umbraco_staging" {
  name           = "staging"
  app_service_id = azurerm_linux_web_app.umbraco.id

  site_config {
    always_on = true

    application_stack {
      dotnet_version = "10.0"
    }

    health_check_path = "/umbraco/ping"
  }

  app_settings = azurerm_linux_web_app.umbraco.app_settings

  connection_string {
    name  = "umbracoDbDSN"
    type  = "SQLAzure"
    value = "Server=tcp:${azurerm_mssql_server.main.fully_qualified_domain_name},1433;Initial Catalog=${azurerm_mssql_database.main.name};Persist Security Info=False;User ID=${var.sql_admin_username};Password=${var.sql_admin_password};MultipleActiveResultSets=True;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;"
  }

  tags = azurerm_resource_group.main.tags
}

# --- App Service: Next.js ---
resource "azurerm_linux_web_app" "nextjs" {
  name                = "app-${var.project_name}-nextjs-${var.environment}"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  service_plan_id     = azurerm_service_plan.main.id

  site_config {
    always_on = true

    application_stack {
      node_version = "22-lts"
    }

    health_check_path = "/api/health"
  }

  app_settings = {
    "NODE_ENV"            = "production"
    "UMBRACO_API_URL"     = "https://${azurerm_linux_web_app.umbraco.default_hostname}"
    "UMBRACO_API_KEY"     = var.delivery_api_key
    "REVALIDATION_SECRET" = var.revalidation_secret
    "REDIS_URL"           = "rediss://:${azurerm_redis_cache.main.primary_access_key}@${azurerm_redis_cache.main.hostname}:${azurerm_redis_cache.main.ssl_port}"
    "APPLICATIONINSIGHTS_CONNECTION_STRING" = azurerm_application_insights.main.connection_string
  }

  identity {
    type = "SystemAssigned"
  }

  tags = azurerm_resource_group.main.tags
}

# --- Azure SQL Database ---
resource "azurerm_mssql_server" "main" {
  name                         = "sql-${var.project_name}-${var.environment}"
  resource_group_name          = azurerm_resource_group.main.name
  location                     = azurerm_resource_group.main.location
  version                      = "12.0"
  administrator_login          = var.sql_admin_username
  administrator_login_password = var.sql_admin_password
  minimum_tls_version          = "1.2"

  tags = azurerm_resource_group.main.tags
}

resource "azurerm_mssql_database" "main" {
  name        = "sqldb-${var.project_name}-${var.environment}"
  server_id   = azurerm_mssql_server.main.id
  sku_name    = var.sql_sku
  max_size_gb = var.sql_max_size_gb

  short_term_retention_policy {
    retention_days = 7
  }

  long_term_retention_policy {
    weekly_retention = "P4W"
  }

  tags = azurerm_resource_group.main.tags
}

# Allow Azure services to access the SQL server
resource "azurerm_mssql_firewall_rule" "azure_services" {
  name             = "AllowAzureServices"
  server_id        = azurerm_mssql_server.main.id
  start_ip_address = "0.0.0.0"
  end_ip_address   = "0.0.0.0"
}

# --- Azure Cache for Redis ---
resource "azurerm_redis_cache" "main" {
  name                = "redis-${var.project_name}-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  capacity            = var.redis_capacity
  family              = var.redis_family
  sku_name            = var.redis_sku
  enable_non_ssl_port = false
  minimum_tls_version = "1.2"

  redis_configuration {
    maxmemory_policy = "allkeys-lru"
  }

  tags = azurerm_resource_group.main.tags
}

# --- Blob Storage for Media ---
resource "azurerm_storage_account" "media" {
  name                     = "${replace(var.project_name, "-", "")}media${var.environment}"
  resource_group_name      = azurerm_resource_group.main.name
  location                 = azurerm_resource_group.main.location
  account_tier             = "Standard"
  account_replication_type = var.environment == "production" ? "GRS" : "LRS"
  min_tls_version          = "TLS1_2"

  blob_properties {
    versioning_enabled = true

    delete_retention_policy {
      days = 30
    }
  }

  tags = azurerm_resource_group.main.tags
}

resource "azurerm_storage_container" "media" {
  name                  = "media"
  storage_account_id    = azurerm_storage_account.media.id
  container_access_type = "blob"
}

# --- Azure CDN ---
resource "azurerm_cdn_profile" "main" {
  name                = "cdn-${var.project_name}-${var.environment}"
  location            = "global"
  resource_group_name = azurerm_resource_group.main.name
  sku                 = "Standard_Microsoft"

  tags = azurerm_resource_group.main.tags
}

resource "azurerm_cdn_endpoint" "media" {
  name                = "cdn-media-${var.project_name}-${var.environment}"
  profile_name        = azurerm_cdn_profile.main.name
  location            = "global"
  resource_group_name = azurerm_resource_group.main.name

  origin {
    name      = "media-blob"
    host_name = azurerm_storage_account.media.primary_blob_host
  }

  is_compression_enabled = true
  content_types_to_compress = [
    "text/css",
    "text/javascript",
    "application/javascript",
    "application/json",
    "image/svg+xml",
  ]

  tags = azurerm_resource_group.main.tags
}

# --- Application Insights ---
resource "azurerm_application_insights" "main" {
  name                = "ai-${var.project_name}-${var.environment}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  application_type    = "web"
  retention_in_days   = 30

  tags = azurerm_resource_group.main.tags
}
# infrastructure/azure/variables.tf
variable "project_name" {
  type    = string
  default = "marketingos"
}

variable "client_name" {
  type = string
}

variable "environment" {
  type = string
  validation {
    condition     = contains(["staging", "production"], var.environment)
    error_message = "Environment must be staging or production."
  }
}

variable "azure_location" {
  type    = string
  default = "East US"
}

variable "app_service_sku" {
  description = "App Service plan SKU (B1, S1, P1v3, etc.)"
  type        = string
  default     = "B1"
}

variable "sql_admin_username" {
  type      = string
  sensitive = true
}

variable "sql_admin_password" {
  type      = string
  sensitive = true
}

variable "sql_sku" {
  description = "Azure SQL Database SKU (Basic, S0, S1, etc.)"
  type        = string
  default     = "Basic"
}

variable "sql_max_size_gb" {
  type    = number
  default = 2
}

variable "redis_capacity" {
  type    = number
  default = 0
}

variable "redis_family" {
  type    = string
  default = "C"
}

variable "redis_sku" {
  type    = string
  default = "Basic"
}

variable "delivery_api_key" {
  type      = string
  sensitive = true
}

variable "revalidation_secret" {
  type      = string
  sensitive = true
}

Azure-Specific Advantages

Deployment slots are the killer feature for Azure App Service. In the Terraform above, we created a staging slot for the Umbraco app. The deployment workflow is:

  1. Deploy new code to the staging slot
  2. Azure warms up the staging slot (runs health checks)
  3. Swap staging and production slots (instant, no downtime)
  4. If something is wrong, swap back

This is genuinely zero-downtime deployment, and it’s simpler than the ECS rolling update or Kubernetes blue-green patterns. Here’s the GitHub Actions step:

# Deploy to Azure App Service with slot swap
- name: Deploy to staging slot
  uses: azure/webapps-deploy@v3
  with:
    app-name: app-marketingos-umbraco-production
    slot-name: staging
    images: ghcr.io/${{ github.repository_owner }}/marketingos-backend:${{ github.sha }}

- name: Wait for staging health check
  run: |
    for i in {1..30}; do
      STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
        https://app-marketingos-umbraco-production-staging.azurewebsites.net/umbraco/ping)
      if [ "$STATUS" == "200" ]; then
        echo "Staging slot is healthy"
        break
      fi
      echo "Waiting for staging slot... (attempt $i/30)"
      sleep 10
    done

- name: Swap slots
  uses: azure/CLI@v2
  with:
    inlineScript: |
      az webapp deployment slot swap \
        --resource-group rg-marketingos-production \
        --name app-marketingos-umbraco-production \
        --slot staging \
        --target-slot production

Application Insights gives you distributed tracing, performance monitoring, and failure diagnostics out of the box. For .NET applications, it’s the best APM option — it understands .NET internals, SQL queries, HTTP calls, and dependency chains natively. And it’s included in the Application Insights resource we created in Terraform.

Azure Cost Estimate

ComponentSmall SiteMedium Site
App Service Plan (B1 shared)$13$54 (S1)
Azure SQL Database (Basic)$5$15 (S0)
Azure Cache for Redis (Basic C0)$16$40 (Standard C1)
Azure CDN (Standard)$3$10
Blob Storage (10GB + GRS)$1$3
Application Insights (5GB/mo)$0$0 (free tier)
Total$38/month$122/month

Azure is notably cheaper than AWS for this workload, primarily because App Service doesn’t have the NAT Gateway tax, and the Basic tier of Azure SQL is cheaper than the smallest RDS SQL Server instance. The trade-off is less granular auto-scaling — App Service auto-scale requires Standard tier or higher.

Infrastructure as Code Best Practices

After managing Terraform for a dozen client deployments, here are the patterns that saved me.

Module Structure

infrastructure/
├── modules/
│   ├── self-hosted/       # Ubuntu + Docker Compose
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── aws/               # ECS + RDS + CloudFront
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── autoscaling.tf
│   │   └── monitoring.tf
│   └── azure/             # App Service + SQL + CDN
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── client-bakery/
│   │   ├── main.tf        # Uses self-hosted module
│   │   └── terraform.tfvars
│   ├── client-fintech/
│   │   ├── main.tf        # Uses aws module
│   │   ├── staging.tfvars
│   │   └── production.tfvars
│   └── client-startup/
│       ├── main.tf        # Uses azure module
│       ├── staging.tfvars
│       └── production.tfvars
└── shared/
    ├── terraform-state/   # Bootstrap for remote state
    └── dns/               # Cloudflare DNS management

Each client gets a directory in environments/. The main.tf file instantiates the appropriate module:

# infrastructure/environments/client-fintech/main.tf
module "infrastructure" {
  source = "../../modules/aws"

  project_name  = "fintech-marketing"
  client_name   = "FinTech Corp"
  environment   = terraform.workspace  # staging or production
  domain_name   = "marketing.fintechcorp.com"
  aws_region    = "us-east-1"

  umbraco_cpu           = 1024
  umbraco_memory        = 2048
  umbraco_desired_count = 2

  nextjs_cpu           = 512
  nextjs_memory        = 1024
  nextjs_desired_count = 3

  db_instance_class = "db.t3.medium"
  db_password       = var.db_password

  redis_node_type = "cache.t4g.small"

  ecr_repository_url       = "123456789.dkr.ecr.us-east-1.amazonaws.com/marketingos-backend"
  ecr_nextjs_repository_url = "123456789.dkr.ecr.us-east-1.amazonaws.com/marketingos-frontend"
  image_tag                 = var.image_tag
}

Remote State Management

Never store Terraform state locally. For AWS, use S3 + DynamoDB locking. For Azure, use Azure Storage. Here’s the bootstrap:

# infrastructure/shared/terraform-state/aws-backend.tf
resource "aws_s3_bucket" "terraform_state" {
  bucket = "marketingos-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Environment Management with Workspaces

Use Terraform workspaces to manage staging and production from the same configuration:

# Create and switch workspaces
cd infrastructure/environments/client-fintech
terraform workspace new staging
terraform workspace new production

# Deploy to staging
terraform workspace select staging
terraform apply -var-file="staging.tfvars"

# Deploy to production
terraform workspace select production
terraform apply -var-file="production.tfvars"

The workspace name is available as terraform.workspace in your configuration, which we use for environment-specific settings like Multi-AZ on RDS, CloudFront price class, and auto-scaling minimums.

Monitoring and Observability

Infrastructure is only half the picture. Once the application is running, you need to know when things go wrong before your clients do.

OpenTelemetry in Umbraco

We instrument the .NET backend with OpenTelemetry for distributed tracing and metrics:

// backend/src/MarketingOS.Web/Program.cs
using OpenTelemetry.Logs;
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using Serilog;
using Serilog.Events;

var builder = WebApplication.CreateBuilder(args);

// --- Serilog Configuration ---
Log.Logger = new LoggerConfiguration()
    .MinimumLevel.Information()
    .MinimumLevel.Override("Microsoft", LogEventLevel.Warning)
    .MinimumLevel.Override("Microsoft.EntityFrameworkCore", LogEventLevel.Warning)
    .MinimumLevel.Override("Umbraco", LogEventLevel.Warning)
    .Enrich.FromLogContext()
    .Enrich.WithMachineName()
    .Enrich.WithEnvironmentName()
    .Enrich.WithProperty("Application", "MarketingOS.Backend")
    .WriteTo.Console(outputTemplate:
        "[{Timestamp:HH:mm:ss} {Level:u3}] {SourceContext}: {Message:lj}{NewLine}{Exception}")
    .WriteTo.Seq(
        builder.Configuration["Seq:ServerUrl"] ?? "http://localhost:5341",
        apiKey: builder.Configuration["Seq:ApiKey"])
    .CreateLogger();

builder.Host.UseSerilog();

// --- OpenTelemetry Configuration ---
var serviceName = "MarketingOS.Backend";
var serviceVersion = typeof(Program).Assembly
    .GetCustomAttribute<System.Reflection.AssemblyInformationalVersionAttribute>()?
    .InformationalVersion ?? "1.0.0";

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(
            serviceName: serviceName,
            serviceVersion: serviceVersion,
            serviceInstanceId: Environment.MachineName))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(options =>
        {
            // Filter out health check and static file requests
            options.Filter = httpContext =>
                !httpContext.Request.Path.StartsWithSegments("/umbraco/ping") &&
                !httpContext.Request.Path.StartsWithSegments("/umbraco/lib");
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(options =>
        {
            options.SetDbStatementForText = true;
            options.RecordException = true;
        })
        .AddSource("MarketingOS.*")
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(
                builder.Configuration["OpenTelemetry:Endpoint"]
                ?? "http://localhost:4317");
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddProcessInstrumentation()
        .AddMeter("MarketingOS.*")
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(
                builder.Configuration["OpenTelemetry:Endpoint"]
                ?? "http://localhost:4317");
        }));

// --- Custom Metrics ---
builder.Services.AddSingleton(serviceProvider =>
{
    var meter = new System.Diagnostics.Metrics.Meter("MarketingOS.DeliveryApi");
    return meter;
});

// Add Umbraco and other services
builder.CreateUmbracoBuilder()
    .AddBackOffice()
    .AddWebsite()
    .AddDeliveryApi()
    .AddComposers()
    .Build();

var app = builder.Build();

app.UseSerilogRequestLogging(options =>
{
    options.EnrichDiagnosticContext = (diagnosticContext, httpContext) =>
    {
        diagnosticContext.Set("RequestHost", httpContext.Request.Host.Value);
        diagnosticContext.Set("UserAgent", httpContext.Request.Headers["User-Agent"].ToString());
    };
});

await app.BootUmbracoAsync();
app.UseUmbraco()
    .WithMiddleware(u =>
    {
        u.UseBackOffice();
        u.UseWebsite();
    })
    .WithEndpoints(u =>
    {
        u.UseBackOfficeEndpoints();
        u.UseWebsiteEndpoints();
    });

await app.RunAsync();

Key decisions in this configuration:

  1. Serilog for structured logging. Every log entry is a structured event with properties, not a flat string. This makes log searching and dashboards dramatically better. We send logs to Seq (self-hosted) or to the cloud provider’s native logging.

  2. SQL Client instrumentation with statement capture. SetDbStatementForText = true captures the actual SQL query text in traces. This is invaluable for debugging slow queries but should be disabled if your queries contain sensitive data.

  3. Filter health checks from traces. Without this filter, your tracing dashboard is 90% health check noise. The /umbraco/ping and /umbraco/lib exclusions keep traces focused on real user requests.

  4. Custom metrics. The MarketingOS.DeliveryApi meter lets us track custom business metrics like content API response times, cache hit rates, and content type request distributions.

Structured Logging Conventions

I enforce a logging convention across the team. Every log statement follows this pattern:

// Good - structured, searchable, useful
_logger.LogInformation(
    "Content published: {ContentType} '{ContentName}' (ID: {ContentId}) by {UserName}",
    content.ContentType.Alias,
    content.Name,
    content.Id,
    currentUser.Name);

// Bad - string interpolation, unsearchable
_logger.LogInformation(
    $"Content published: {content.ContentType.Alias} '{content.Name}'");

The structured approach means you can search for “all publishes of contentType:landingPage” or “all actions by userName:editor@client.com” in your log aggregator. The string interpolation approach gives you a wall of text that only grep can sort through.

Cost Comparison Matrix

Here’s the full comparison across all three options at different traffic levels:

FactorSelf-Hosted UbuntuAWSAzure
Small site (<10K visits/mo)
Monthly cost$10-20$100-150$40-80
Setup time2-3 hours4-6 hours3-5 hours
SSLLet’s Encrypt (auto)ACM (free)App Service managed
Medium site (10K-100K visits/mo)
Monthly cost$30-50$200-400$120-250
Auto-scalingManual (resize VPS)Automatic (ECS)Automatic (App Service S1+)
CDNCloudFlare free tierCloudFrontAzure CDN
Large site (100K+ visits/mo)
Monthly cost$50-100*$400-800$250-500
Multi-regionNot practicalMulti-region ECSApp Service multi-region
SLABest-effort99.99% (composite)99.95% (App Service)
Operational
DeploymentSSH + docker composeECS rolling updateDeployment slot swap
MonitoringUptime Kuma (self-hosted)CloudWatch + X-RayApplication Insights
BackupsCron + S3RDS automatedAzure SQL automated
Disaster recoveryManualMulti-AZ + cross-regionGeo-replication option
Team skill requiredLinux, DockerAWS, TerraformAzure, Terraform

*Self-hosted at 100K+ visits works but requires careful tuning and accepts risk of single point of failure.

My recommendation for most MarketingOS clients: Start with self-hosted for the first site. When a client needs SLA guarantees, compliance, or global distribution, deploy to Azure if they’re a .NET shop or AWS if they’re already in that ecosystem. The Docker images are identical in all three paths — the only thing that changes is the infrastructure underneath.

What’s Next

We now have three deployment paths, each with Infrastructure as Code, monitoring, backups, and CI/CD integration. The MarketingOS template can serve a $20/month bakery website and a $3,000/month enterprise marketing platform from the same codebase.

In Part 9, we’ll tie it all together: the template onboarding automation that spins up a new client site in under an hour, the real cost analysis from 12 months of running multiple client sites, and the honest retrospective — what worked, what I’d do differently, and the lessons learned from building a reusable marketing website template with Umbraco 17 and Next.js.

The hardest part of this project wasn’t any single technical decision. It was making sure all the pieces — content model, rendering, SEO, AI, testing, CI/CD, and infrastructure — work together as a coherent system. Part 9 is where we see if they do.


This is Part 8 of a 9-part series on building a reusable marketing website template with Umbraco 17 and Next.js. The series follows the development of MarketingOS, a template that reduces marketing website delivery from weeks to under an hour.

Series outline:

  1. Architecture & Setup — Why this stack, ADRs, solution structure, Docker Compose
  2. Content Modeling — Document types, compositions, Block List page builder, Content Delivery API
  3. Next.js Rendering — Server Components, ISR, block renderer, component library, multi-tenant
  4. SEO & Performance — Metadata, JSON-LD, sitemaps, Core Web Vitals optimization
  5. AI Content with Gemini — Content generation, translation, SEO optimization, review workflow
  6. Testing — xUnit, Jest, Playwright, Pact contract tests, visual regression
  7. Docker & CI/CD — Multi-stage builds, GitHub Actions, environment promotion
  8. Infrastructure — Self-hosted Ubuntu, AWS, Azure, Terraform, monitoring (this post)
  9. Template & Retrospective — Onboarding automation, cost analysis, lessons learned
Export for reading

Comments