Production-Ready Clean Architecture: Deployment, Monitoring, and the Lessons I'd Share With My Past Self

Six posts of architecture. Time to ship it.

Taking a Clean Architecture .NET 10 application from localhost to production involves decisions the architecture books don’t cover: Docker images, CI pipelines, observability, and the uncomfortable realization that some of your architectural choices were over-engineered. This is the part where theory meets reality — and reality has opinions.

Throughout this series we’ve built Kids Learn from the Domain layer up: rich entities with domain events, CQRS handlers in the Application layer, Infrastructure implementations, Presentation with Minimal APIs, a vertical slices hybrid, and a thorough testing strategy. Now we’re going to containerize it, set up a CI/CD pipeline that enforces the Dependency Rule automatically, wire up observability so we actually know what’s happening in production, tune performance until it’s fast enough for impatient 8-year-olds, and then I’m going to be honest about what was worth the effort and what wasn’t.

This is the final post. Let’s make it count.

Docker and Local Development

Every developer on the Kids Learn team needs to run the full system locally — the API, PostgreSQL with pgvector for embedding search, and Redis for caching. Docker Compose makes this reproducible. No more “works on my machine” when someone forgets to install the pgvector extension.

Here’s our multi-stage Dockerfile for the API:

# Stage 1: Build
FROM mcr.microsoft.com/dotnet/sdk:10.0 AS build
WORKDIR /src

# Copy solution and project files first (layer caching)
COPY ["KidsLearn.sln", "."]
COPY ["src/KidsLearn.Domain/KidsLearn.Domain.csproj", "src/KidsLearn.Domain/"]
COPY ["src/KidsLearn.Application/KidsLearn.Application.csproj", "src/KidsLearn.Application/"]
COPY ["src/KidsLearn.Infrastructure/KidsLearn.Infrastructure.csproj", "src/KidsLearn.Infrastructure/"]
COPY ["src/KidsLearn.Api/KidsLearn.Api.csproj", "src/KidsLearn.Api/"]

RUN dotnet restore

# Copy everything else and build
COPY . .
RUN dotnet publish "src/KidsLearn.Api/KidsLearn.Api.csproj" \
    -c Release \
    -o /app/publish \
    --no-restore

# Stage 2: Runtime
FROM mcr.microsoft.com/dotnet/aspnet:10.0 AS runtime
WORKDIR /app

# Create non-root user
RUN addgroup --system appgroup && adduser --system --ingroup appgroup appuser

COPY --from=build /app/publish .

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD curl -f http://localhost:8080/health/live || exit 1

USER appuser
EXPOSE 8080
ENTRYPOINT ["dotnet", "KidsLearn.Api.dll"]

The multi-stage build is important. The SDK image is ~900MB. The runtime image is ~220MB. In production, you don’t need the compiler, NuGet packages, or source code sitting in your container. The layer caching strategy matters too — we copy .csproj files and restore first, so NuGet restore only re-runs when dependencies actually change. On a typical code-only change, the build starts from the COPY . . layer.

Here’s the Docker Compose for local development:

# docker-compose.yml
services:
  api:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8080:8080"
    environment:
      - ASPNETCORE_ENVIRONMENT=Development
      - ConnectionStrings__DefaultConnection=Host=postgres;Database=kidslearn;Username=postgres;Password=devpassword
      - ConnectionStrings__Redis=redis:6379
      - GeminiAi__ApiKey=${GEMINI_API_KEY}
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

  postgres:
    image: pgvector/pgvector:pg17
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: kidslearn
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: devpassword
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

volumes:
  pgdata:

Notice we’re using pgvector/pgvector:pg17 instead of the standard PostgreSQL image. This comes with the pgvector extension pre-installed, which we need for embedding similarity search on lesson content. The health checks on depends_on ensure the API doesn’t start until both PostgreSQL and Redis are actually ready to accept connections — not just that the containers are running.

For day-to-day development, though, we actually prefer .NET Aspire over raw Docker Compose. Aspire gives us a dashboard with logs, traces, and metrics for every service, plus automatic service discovery:

// AppHost/Program.cs (.NET Aspire orchestrator)
var builder = DistributedApplication.CreateBuilder(args);

var postgres = builder.AddPostgres("postgres")
    .WithPgVector()
    .WithDataVolume()
    .WithPgAdmin();

var kidslearnDb = postgres.AddDatabase("kidslearn");

var redis = builder.AddRedis("redis")
    .WithRedisInsight();

var api = builder.AddProject<Projects.KidsLearn_Api>("api")
    .WithReference(kidslearnDb)
    .WithReference(redis)
    .WithExternalHttpEndpoints();

builder.Build().Run();

Aspire is genuinely excellent for development. You run dotnet run on the AppHost project and everything starts up with a nice dashboard at localhost:15888. Connection strings are injected automatically. When you’re debugging a slow request, you can trace it through the Aspire dashboard without setting up Jaeger locally. We still use Docker Compose for CI and production-like environments, but for daily development, Aspire wins.

Kids Learn deployment architecture — Docker containers for API, PostgreSQL with pgvector, Redis cache, connected to Gemini AI API, behind a reverse proxy

CI/CD with GitHub Actions

Our CI/CD pipeline has one job I’m particularly proud of: architecture tests that fail the build if anyone violates the Dependency Rule. You can talk about Clean Architecture all day in code reviews, but automated enforcement is what actually keeps the architecture clean over time.

Here’s the complete GitHub Actions workflow:

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  DOTNET_VERSION: "10.0.x"
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: ${{ env.DOTNET_VERSION }}

      - name: Restore dependencies
        run: dotnet restore

      - name: Build
        run: dotnet build --no-restore --configuration Release

      - name: Run Domain unit tests
        run: dotnet test tests/KidsLearn.Domain.Tests --no-build -c Release --logger "trx;LogFileName=domain-results.trx"

      - name: Run Application unit tests
        run: dotnet test tests/KidsLearn.Application.Tests --no-build -c Release --logger "trx;LogFileName=application-results.trx"

      - name: Run Architecture tests
        run: dotnet test tests/KidsLearn.Architecture.Tests --no-build -c Release --logger "trx;LogFileName=arch-results.trx"

      - name: Upload test results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: test-results
          path: "**/TestResults/*.trx"

  integration-tests:
    runs-on: ubuntu-latest
    needs: build-and-test
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: ${{ env.DOTNET_VERSION }}

      - name: Run integration tests (Testcontainers)
        run: dotnet test tests/KidsLearn.Integration.Tests -c Release --logger "trx;LogFileName=integration-results.trx"
        env:
          TESTCONTAINERS_RYUK_DISABLED: "false"

      - name: Upload integration test results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: integration-test-results
          path: "**/TestResults/*.trx"

  docker-build:
    runs-on: ubuntu-latest
    needs: [build-and-test, integration-tests]
    if: github.ref == 'refs/heads/main'
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha
            type=raw,value=latest

      - name: Build and push Docker image
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

  deploy-staging:
    runs-on: ubuntu-latest
    needs: docker-build
    environment: staging
    steps:
      - name: Deploy to staging
        run: |
          # Deploy using your preferred method: kubectl, docker compose, etc.
          echo "Deploying ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${GITHUB_SHA::7} to staging"
          # kubectl set image deployment/kidslearn-api api=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${GITHUB_SHA::7}

      - name: Run smoke tests against staging
        run: |
          # Quick health check against staging
          curl -f https://staging.kidslearn.app/health/ready || exit 1

  deploy-production:
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment:
      name: production
      url: https://kidslearn.app
    steps:
      - name: Deploy to production
        run: |
          echo "Deploying ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${GITHUB_SHA::7} to production"
          # kubectl set image deployment/kidslearn-api api=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${GITHUB_SHA::7}

      - name: Verify production health
        run: |
          sleep 15
          curl -f https://kidslearn.app/health/ready || exit 1

The pipeline flow is deliberate. Unit tests and architecture tests run first because they’re fast — under 30 seconds total. If the Dependency Rule is violated, we fail immediately without wasting time on integration tests. Integration tests use Testcontainers to spin up a real PostgreSQL instance with pgvector, so they test actual database queries and EF Core migrations. No mocking the database in integration tests — that defeats the purpose.

The architecture tests deserve a closer look. These use NetArchTest to enforce our Clean Architecture rules at the CI level:

[Fact]
public void Domain_Should_Not_Reference_Any_Other_Project()
{
    var result = Types.InAssembly(DomainAssembly)
        .ShouldNot()
        .HaveDependencyOnAny(
            "KidsLearn.Application",
            "KidsLearn.Infrastructure",
            "KidsLearn.Api")
        .GetResult();

    result.IsSuccessful.Should().BeTrue(
        $"Domain layer has forbidden dependencies: {string.Join(", ", result.FailingTypeNames ?? [])}");
}

[Fact]
public void Application_Should_Not_Reference_Infrastructure_Or_Presentation()
{
    var result = Types.InAssembly(ApplicationAssembly)
        .ShouldNot()
        .HaveDependencyOnAny(
            "KidsLearn.Infrastructure",
            "KidsLearn.Api")
        .GetResult();

    result.IsSuccessful.Should().BeTrue(
        $"Application layer has forbidden dependencies: {string.Join(", ", result.FailingTypeNames ?? [])}");
}

These tests have caught three real violations in pull requests. Once, a developer imported an EF Core extension method directly in an Application layer handler — convenient, but it would have coupled Application to Infrastructure. The architecture test caught it, the PR failed, the developer moved the logic to an Infrastructure service behind an interface. That’s the system working as designed.

The staging environment uses GitHub’s environment protection rules — you can require manual approval before production deployment if your team prefers that. We run smoke tests against staging automatically, then a team lead approves the production deploy. The whole pipeline from push to production takes about 8 minutes when everything passes.

CI/CD pipeline — PR triggers build, unit tests, architecture tests, integration tests (Testcontainers), Docker build, deploy to staging, manual approval, deploy to production

Observability with OpenTelemetry

Clean Architecture gives you well-defined layers. OpenTelemetry lets you see requests flowing through those layers in production. When a parent reports that lesson generation is slow, I want to see exactly where the time is going: was it the Gemini API call, the pgvector similarity search, the Redis cache miss, or the EF Core query?

We use three pillars of observability: structured logging with Serilog, distributed tracing with OpenTelemetry, and custom metrics.

First, structured logging. In a containerized environment, you want JSON to stdout. No file-based logs, no text formatting — just structured JSON that your log aggregator can parse:

// Program.cs — Serilog configuration
builder.Host.UseSerilog((context, loggerConfig) =>
{
    loggerConfig
        .ReadFrom.Configuration(context.Configuration)
        .Enrich.FromLogContext()
        .Enrich.WithProperty("Application", "KidsLearn.Api")
        .Enrich.WithProperty("Environment", context.HostingEnvironment.EnvironmentName)
        .WriteTo.Console(new RenderedCompactJsonFormatter());
});

Every log entry includes the application name and environment. When you’re aggregating logs from multiple services, you’ll thank yourself for this. We also enrich logs with custom properties at the handler level — the child’s ID, the lesson subject, the curriculum standard being targeted:

public async Task<LessonResult> Handle(GenerateLessonCommand request, CancellationToken ct)
{
    using var _ = _logger.BeginScope(new Dictionary<string, object>
    {
        ["ChildId"] = request.ChildId,
        ["Subject"] = request.Subject,
        ["MasteryLevel"] = request.CurrentMasteryLevel
    });

    _logger.LogInformation("Generating adaptive lesson for child");
    // ... handler logic
}

For distributed tracing, we configure OpenTelemetry in the DI setup:

// Infrastructure/DependencyInjection.cs
builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService("KidsLearn.Api", serviceVersion: "1.0.0"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation(opt =>
        {
            opt.SetDbStatementForText = true;  // See actual SQL in traces
        })
        .AddRedisInstrumentation()
        .AddSource("KidsLearn.Application")  // Custom activity sources
        .AddSource("KidsLearn.Infrastructure.Ai")
        .AddOtlpExporter(opt =>
        {
            opt.Endpoint = new Uri(builder.Configuration["Otlp:Endpoint"]!);
        }))
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddMeter("KidsLearn.Lessons")
        .AddMeter("KidsLearn.Ai")
        .AddOtlpExporter());

The key insight is adding custom activity sources for our application code. The built-in ASP.NET Core and EF Core instrumentation gives you the outer request and database queries, but the interesting stuff happens in between. We create custom activities in our handlers:

public class GenerateLessonCommandHandler
{
    private static readonly ActivitySource ActivitySource = new("KidsLearn.Application");

    public async Task<LessonResult> Handle(GenerateLessonCommand request, CancellationToken ct)
    {
        using var activity = ActivitySource.StartActivity("GenerateLesson");
        activity?.SetTag("child.id", request.ChildId.ToString());
        activity?.SetTag("subject", request.Subject);

        // 1. Fetch child's mastery data
        using (var fetchActivity = ActivitySource.StartActivity("FetchMasteryData"))
        {
            var mastery = await _masteryRepository.GetByChildIdAsync(request.ChildId, ct);
            fetchActivity?.SetTag("mastery.level", mastery.CurrentLevel.ToString());
        }

        // 2. Find similar content via pgvector
        using (var searchActivity = ActivitySource.StartActivity("SimilaritySearch"))
        {
            var similar = await _contentRepository.FindSimilarAsync(embedding, topK: 5, ct);
            searchActivity?.SetTag("results.count", similar.Count);
        }

        // 3. Generate with Gemini
        using (var aiActivity = ActivitySource.StartActivity("GeminiGeneration"))
        {
            var lesson = await _aiService.GenerateLessonAsync(prompt, ct);
            aiActivity?.SetTag("ai.tokens.input", lesson.InputTokens);
            aiActivity?.SetTag("ai.tokens.output", lesson.OutputTokens);
            aiActivity?.SetTag("ai.model", "gemini-2.0-flash");
        }

        return result;
    }
}

Now when I look at a trace in Jaeger or Grafana Tempo, I see the full picture: the HTTP request comes into the Minimal API endpoint (50ms total), the handler starts (2ms for MediatR dispatch), mastery data is fetched from PostgreSQL (8ms), pgvector similarity search runs (15ms), and the Gemini API call takes 1.2 seconds. Instantly I know that 96% of the time is in the AI call, and optimization efforts should focus there — or on caching the result.

For custom metrics, we track the things that matter for Kids Learn specifically:

public class LessonMetrics
{
    private readonly Histogram<double> _generationLatency;
    private readonly Counter<long> _lessonsGenerated;
    private readonly Histogram<double> _aiTokenCost;
    private readonly Histogram<double> _masteryProgression;

    public LessonMetrics(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("KidsLearn.Lessons");

        _generationLatency = meter.CreateHistogram<double>(
            "kidslearn.lesson.generation_latency",
            unit: "ms",
            description: "Time to generate an adaptive lesson");

        _lessonsGenerated = meter.CreateCounter<long>(
            "kidslearn.lesson.generated_total",
            description: "Total lessons generated");

        _aiTokenCost = meter.CreateHistogram<double>(
            "kidslearn.ai.token_cost",
            unit: "usd",
            description: "Estimated cost per AI generation");

        _masteryProgression = meter.CreateHistogram<double>(
            "kidslearn.mastery.progression_rate",
            description: "Mastery level change per lesson");
    }

    public void RecordGeneration(double latencyMs, string subject, double tokenCost)
    {
        _generationLatency.Record(latencyMs, new KeyValuePair<string, object?>("subject", subject));
        _lessonsGenerated.Add(1, new KeyValuePair<string, object?>("subject", subject));
        _aiTokenCost.Record(tokenCost);
    }
}

The aiTokenCost metric is particularly useful. Gemini charges per token, and we want to catch it early if a prompt change causes token usage to spike. We set up a Grafana alert that fires if the average cost per generation exceeds $0.003 — that’s happened twice, both times because someone changed the system prompt to include more context than necessary.

Health checks are the final piece. We expose two endpoints: /health/live for Kubernetes liveness probes (is the process running?) and /health/ready for readiness probes (can we serve traffic?):

builder.Services.AddHealthChecks()
    .AddNpgSql(connectionString, name: "postgresql", tags: ["ready"])
    .AddRedis(redisConnectionString, name: "redis", tags: ["ready"])
    .AddUrlGroup(new Uri("https://generativelanguage.googleapis.com/"),
        name: "gemini-api", tags: ["ready"]);

app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = _ => false  // Just checks if the app responds
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse
});

The readiness check verifies PostgreSQL, Redis, and the Gemini API are reachable. If PostgreSQL goes down, Kubernetes stops routing traffic to that pod until it recovers. The liveness check is intentionally minimal — if the process can respond to HTTP, it’s alive. Don’t put database checks in the liveness probe or you’ll get cascading restarts when the database has a brief hiccup.

OpenTelemetry observability — distributed traces flowing through Minimal API endpoint, MediatR/Wolverine handler, EF Core queries, and Gemini API calls, collected by OpenTelemetry Collector into Grafana/Jaeger

Performance Optimization

Kids Learn serves children. Children are not patient. If a page takes more than a couple seconds to load, they’ll switch to YouTube. We set strict performance budgets: standard API endpoints must respond in under 200ms, and AI-powered lesson generation must complete in under 2 seconds.

Meeting those targets on a Clean Architecture codebase required specific optimizations.

EF Core compiled queries for hot paths. Every time a child opens the app, we fetch their progress data. That’s our hottest query. EF Core compiles LINQ expressions to SQL on every execution by default. Compiled queries eliminate that overhead:

public class ChildProgressRepository : IChildProgressRepository
{
    // Compiled query — the expression tree is compiled once, reused forever
    private static readonly Func<KidsLearnDbContext, Guid, Task<ChildProgress?>> GetByChildIdQuery =
        EF.CompileAsyncQuery((KidsLearnDbContext context, Guid childId) =>
            context.ChildProgress
                .Include(p => p.MasteryRecords)
                .Include(p => p.RecentLessons.OrderByDescending(l => l.CompletedAt).Take(10))
                .AsSplitQuery()
                .FirstOrDefault(p => p.ChildId == childId));

    public Task<ChildProgress?> GetByChildIdAsync(Guid childId, CancellationToken ct)
        => GetByChildIdQuery(_context, childId);
}

Notice the AsSplitQuery() — when you have multiple Include() calls, EF Core generates a single query with JOINs by default, which can produce a cartesian explosion. Split queries execute separate SQL statements for each include, which is often faster for complex object graphs. For this specific query, split queries reduced response time from 45ms to 12ms.

Redis caching for repeated data. Curriculum standards don’t change often. A child’s progress data is read far more than it’s written. Generated lessons can be cached for children at the same mastery level studying the same topic. We use a simple caching decorator pattern:

public class CachedChildProgressRepository : IChildProgressRepository
{
    private readonly IChildProgressRepository _inner;
    private readonly IDistributedCache _cache;
    private static readonly TimeSpan CacheDuration = TimeSpan.FromMinutes(5);

    public async Task<ChildProgress?> GetByChildIdAsync(Guid childId, CancellationToken ct)
    {
        var cacheKey = $"progress:{childId}";
        var cached = await _cache.GetAsync<ChildProgress>(cacheKey, ct);
        if (cached is not null)
            return cached;

        var progress = await _inner.GetByChildIdAsync(childId, ct);
        if (progress is not null)
            await _cache.SetAsync(cacheKey, progress, CacheDuration, ct);

        return progress;
    }
}

// Registration — decorator pattern with Scrutor
services.AddScoped<ChildProgressRepository>();
services.AddScoped<IChildProgressRepository>(sp =>
    new CachedChildProgressRepository(
        sp.GetRequiredService<ChildProgressRepository>(),
        sp.GetRequiredService<IDistributedCache>()));

This is where Clean Architecture pays off beautifully. The Application layer depends on IChildProgressRepository. It has no idea whether it’s hitting the database directly or going through a cache. We added caching without changing a single line in any handler.

Connection pooling with Npgsql. PostgreSQL connections are expensive to create. Npgsql’s connection pooling is enabled by default, but the defaults aren’t always right. For Kids Learn, we tuned the pool based on our load patterns:

builder.Services.AddDbContext<KidsLearnDbContext>(options =>
{
    options.UseNpgsql(connectionString, npgsqlOptions =>
    {
        npgsqlOptions.EnableRetryOnFailure(
            maxRetryCount: 3,
            maxRetryDelay: TimeSpan.FromSeconds(5),
            errorCodesToAdd: null);

        // Connection pool sizing
        npgsqlOptions.MaxPoolSize(100);
        npgsqlOptions.MinPoolSize(10);
        npgsqlOptions.ConnectionIdleLifetime(300);
    });
});

pgvector HNSW index tuning. The embedding similarity search is critical for finding relevant lesson content. pgvector supports two index types: IVFFlat and HNSW. We use HNSW because it has better query performance at the cost of slower index building and more memory:

-- Create HNSW index on lesson content embeddings
CREATE INDEX idx_lesson_content_embedding ON lesson_contents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);

-- At query time, set the search breadth
SET hnsw.ef_search = 64;

The m parameter controls how many connections each node has in the graph (higher = better recall, more memory). ef_construction controls index build quality. ef_search controls query-time accuracy. We started with defaults and tuned based on recall measurements — for our dataset of ~50,000 lesson content embeddings, m=16 and ef_search=64 gives us 98.5% recall@10 with a P95 query time of 8ms. Good enough.

The results. Here’s what these optimizations achieved:

Endpoint	Before	After
GET /api/children/{id}/progress	95ms	18ms (cache hit: 3ms)
GET /api/lessons/recommended	180ms	45ms
POST /api/lessons/generate	3.2s	1.6s
GET /api/curriculum/standards	120ms	8ms (cache hit: 2ms)
pgvector similarity search (top 10)	35ms	8ms

The lesson generation endpoint is still the slowest at 1.6 seconds, but most of that is the Gemini API call. We can’t optimize that away — we can only cache aggressively and use streaming responses so the UI feels responsive while the AI is generating.

Native AOT Considerations

.NET 10 brings significant improvements to Native AOT (Ahead-of-Time compilation). Instead of JIT-compiling code at runtime, AOT compiles everything to native machine code at build time. The benefits are real: cold start times drop from 500ms to 5-40ms, memory usage decreases by 30-50%, and the published binary is smaller because it doesn’t include the JIT compiler.

For a container-based deployment where pods scale up and down frequently, those cold start numbers matter. But AOT comes with trade-offs that interact with Clean Architecture patterns in interesting ways.

What breaks with AOT: Anything that relies on runtime reflection. MediatR uses reflection to discover and resolve handlers. If you’re using MediatR with AOT, you need to use its source generator mode or pre-register all handler types explicitly. Some EF Core features that rely on runtime code generation also need attention — though EF Core 10 has significantly improved AOT support with compiled models.

What works well: Wolverine, which we discussed in Part 5, uses source generators instead of reflection. This makes it naturally AOT-compatible. Minimal APIs with source generators also work well. The basic dependency injection container works fine — it’s resolved at build time, not runtime.

Here’s how we configured AOT for the API project:

<!-- KidsLearn.Api.csproj -->
<PropertyGroup>
    <PublishAot>true</PublishAot>
    <TrimMode>full</TrimMode>
    <JsonSerializerIsReflectionEnabledByDefault>false</JsonSerializerIsReflectionEnabledByDefault>
    <InvariantGlobalization>false</InvariantGlobalization>
</PropertyGroup>

<!-- Source generators for JSON serialization -->
<ItemGroup>
    <PackageReference Include="System.Text.Json" Version="10.0.0" />
</ItemGroup>

With AOT, you need explicit JSON serialization contexts instead of relying on reflection-based serialization:

[JsonSerializable(typeof(LessonResponse))]
[JsonSerializable(typeof(ChildProgressResponse))]
[JsonSerializable(typeof(List<CurriculumStandardResponse>))]
[JsonSerializable(typeof(ProblemDetails))]
[JsonSerializable(typeof(ValidationProblemDetails))]
public partial class AppJsonContext : JsonSerializerContext { }

// In Program.cs
builder.Services.ConfigureHttpJsonOptions(options =>
{
    options.SerializerOptions.TypeInfoResolverChain.Insert(0, AppJsonContext.Default);
});

Our pragmatic decision: We use AOT for the main API container and standard runtime for background job processing. The API benefits most from fast cold starts — when Kubernetes scales up a new pod during a traffic spike, 30ms startup versus 500ms startup means users don’t notice. Background jobs don’t have cold start pressure, and they use some reflection-heavy libraries for job scheduling that aren’t worth making AOT-compatible.

# AOT-optimized Dockerfile
FROM mcr.microsoft.com/dotnet/sdk:10.0 AS build
WORKDIR /src
COPY . .
RUN dotnet publish "src/KidsLearn.Api/KidsLearn.Api.csproj" \
    -c Release \
    -r linux-x64 \
    -o /app/publish

# AOT doesn't need the .NET runtime — just the OS
FROM mcr.microsoft.com/dotnet/runtime-deps:10.0 AS runtime
WORKDIR /app
COPY --from=build /app/publish .
USER $APP_UID
EXPOSE 8080
ENTRYPOINT ["./KidsLearn.Api"]

Notice we use runtime-deps instead of aspnet for the runtime image. AOT-compiled applications don’t need the .NET runtime — they only need the native OS libraries. This brings the final image size down from ~220MB to ~80MB.

The Honest Retrospective

I promised honesty in this series, and this is the section that delivers on that promise. After building Kids Learn with Clean Architecture over several months, shipping it to production, and maintaining it through feature additions and infrastructure changes, here’s what I’d tell my past self.

What Was Worth Every Minute

The separate Domain layer. This is the single best decision we made. The Domain layer has zero dependencies on frameworks, databases, or external services. Testing it is a dream — pure C# objects with business logic, tested with xUnit and nothing else. When we added new curriculum standards for mathematics, we wrote the domain logic and tests first, deployed with confidence, and never worried about database schemas or API contracts until later. The Domain layer has 94% test coverage and those tests run in under 2 seconds. That feedback loop is addictive.

CQRS separation. Having separate command and query models was worth it from month two onward. Our read models are optimized for the UI — flat DTOs with exactly the fields the frontend needs. Our command handlers enforce business rules through the Domain layer. When we needed to add a denormalized read model for the parent dashboard (showing aggregated progress across all children), we added a new query handler without touching any existing command logic. The separation also made it natural to add Redis caching on the read side without affecting writes.

Architecture tests in CI. I keep coming back to this. Three violations caught in pull requests. Three times a developer took a shortcut that would have introduced coupling, and the automated test said no. One was a direct EF Core reference in an Application handler. One was an Infrastructure project referencing the API project (circular dependency). One was a Domain entity importing a System.Text.Json attribute. Each one was a 5-minute fix in the PR, but would have been a painful refactoring if discovered months later.

What Was Over-Engineered

Generic repository abstractions. We started with an IRepository<T> base interface with methods like GetByIdAsync, GetAllAsync, AddAsync, UpdateAsync, DeleteAsync. It looked clean and DRY. In practice, every concrete repository needed different query methods — GetByChildIdWithMasteryRecords, FindSimilarByEmbedding, GetByCurriculumStandardAndGrade. The generic interface was used by maybe 20% of our actual data access patterns. We eventually moved to specific repository interfaces per aggregate root (like IChildProgressRepository with methods that match actual use cases) and the generic base became dead code. Should have started with specific interfaces from day one.

Too many pipeline behaviors. We had a logging behavior, a validation behavior, a performance monitoring behavior, an authorization behavior, a transaction behavior, and a caching behavior — all as MediatR pipeline behaviors. For a team of four developers, this was machinery for machinery’s sake. The logging behavior duplicated what OpenTelemetry already gave us. The performance monitoring behavior duplicated what our metrics already tracked. The authorization behavior added complexity when ASP.NET Core’s built-in authorization policies would have been simpler. We ended up keeping validation and transactions, and removing the rest. Two pipeline behaviors, not six.

Custom exception types for every error. We had ChildNotFoundException, LessonGenerationFailedException, CurriculumStandardNotSupportedException, MasteryLevelOutOfRangeException, AiQuotaExceededException, and about fifteen more. Each had custom properties and custom handling in the exception middleware. The Result pattern (which we discussed in Part 3) replaced most of these. Now we have domain exceptions for truly exceptional situations and Result objects for expected failures. The exception hierarchy was deleted, and nobody missed it.

What I’d Do Differently

Start with the vertical slices hybrid from day one. In Part 5, we showed how to organize by feature instead of by layer. We migrated to this structure three months in, and it was the best refactoring we did. But the migration itself took a week and touched every file. If I started over, I’d use the feature-based structure from the first commit — Clean Architecture layers as the dependency direction, feature folders as the organization.

Use Wolverine instead of MediatR from the start. MediatR is fine. It works. But Wolverine’s source generators give you better AOT support, built-in message bus integration, and no need for separate pipeline behaviors. We migrated incrementally (running both side by side), but it would have been cleaner to start with Wolverine. If you’re starting a new .NET 10 project in 2026, seriously evaluate Wolverine before defaulting to MediatR.

Invest in integration tests earlier. We had excellent unit test coverage on Domain and Application layers from the beginning. But we didn’t write integration tests until month two. The integration tests — using Testcontainers with a real PostgreSQL instance — caught bugs that unit tests never would: incorrect EF Core mappings, missing database indexes, wrong pgvector distance functions. If I could redo it, integration tests would be part of the first sprint, not the third.

When Clean Architecture Is Overkill

I’ve been advocating for Clean Architecture for seven posts, so let me be clear about when you should not use it.

CRUD applications. If your app is fundamentally forms-over-data with little business logic, Clean Architecture adds layers without adding value. A simple Minimal API with EF Core directly is the right choice. The architectural overhead of interfaces, handlers, and separate projects is not justified when your “business logic” is “save this to the database.”

Small prototypes and MVPs. When you’re validating a product idea, speed of iteration matters more than architectural purity. Use a single-project Minimal API. If the product succeeds and the complexity grows, you can introduce Clean Architecture layers then. Premature architecture is just as real as premature optimization.

Microservices with fewer than 5 endpoints. If a service has 3 API endpoints and straightforward logic, a feature-folder Minimal API project is plenty. Clean Architecture’s value comes from managing complexity — if there isn’t much complexity, there isn’t much value.

The core insight is this: Clean Architecture is not about the folder structure. It’s not about having four projects named Domain, Application, Infrastructure, and Presentation. It’s about the Dependency Rule — source code dependencies can only point inward, toward higher-level policies. If you understand that one rule, the rest follows naturally. You can apply it in a single project with namespaces. You can apply it in a modular monolith. You can apply it in microservices. The Dependency Rule is the principle. Everything else is an implementation detail.

Series Recap

We’ve covered a lot of ground across seven posts. Here’s the journey:

Part 1: Foundations and the Dependency Rule — We established why Clean Architecture exists, what problem it solves, and the one rule that governs everything: dependencies point inward. We set up the Kids Learn solution structure and established the project references that enforce the Dependency Rule at compile time.

Part 2: Rich Domain Layer — We built the core of Kids Learn: Child and Lesson aggregate roots with value objects, domain events, and encapsulated business rules. The Domain layer depends on nothing and contains the most important logic in the system — how children progress through curriculum, how mastery is calculated, how lesson difficulty adapts.

Part 3: Application Layer with CQRS — We added the orchestration layer with commands, queries, handlers, and the MediatR pipeline. Validation with FluentValidation. The Result pattern for clean error handling. This layer coordinates the use cases without containing business rules or knowing about databases.

Part 4: Infrastructure and Presentation — We implemented the outer layers: EF Core 10 with PostgreSQL, Redis caching, Gemini AI integration in Infrastructure; Minimal APIs with Carter in Presentation. These layers depend inward and can be swapped without touching Domain or Application.

Part 5: Vertical Slices Hybrid — We reorganized from layers-first to features-first, getting the best of both Clean Architecture (dependency direction) and Vertical Slice Architecture (feature cohesion). We introduced Wolverine as an alternative to MediatR and showed how both approaches coexist.

Part 6: Testing Strategy — We built a comprehensive testing approach: unit tests for Domain logic, handler tests for Application behavior, integration tests with Testcontainers for Infrastructure, architecture tests with NetArchTest for the Dependency Rule, and E2E tests for critical user flows.

Part 7: Production and Retrospective — This post. We containerized, built the CI/CD pipeline, wired up observability, optimized performance, considered Native AOT, and got honest about what worked and what didn’t.

Closing

Clean Architecture didn’t make Kids Learn perfect. It made it maintainable.

When we added the teacher portal three months after launch, it took a week instead of a month. The Domain layer already had the concepts we needed — we added a Teacher aggregate, new Application handlers for teacher-specific use cases, and new Minimal API endpoints. The existing child progress tracking, lesson generation, and mastery calculation worked without modification.

When EF Core 10 shipped with vector search improvements, we swapped the implementation in Infrastructure without touching a single line in Domain or Application. The IContentRepository interface still had the same FindSimilarAsync method — only the implementation behind it changed to use EF Core’s new built-in vector search instead of raw SQL.

When Gemini updated their API and our lesson generation broke, the fix was isolated to one class in Infrastructure: GeminiLessonGenerationService. The Application layer’s ILessonGenerationService interface didn’t change. No handler was modified. No test outside of Infrastructure tests needed updating.

That’s the promise of Clean Architecture — and for Kids Learn, it delivered.

Is it more upfront work than throwing everything in one project? Absolutely. Is it worth it for a system you plan to maintain and evolve for years? In my experience, yes. Every time.

But remember: the architecture is a means, not an end. The goal was never to have a perfect folder structure or the most abstract interfaces. The goal was to build a platform where children learn effectively, where teachers get useful tools, and where the development team can ship features confidently. Clean Architecture was the vehicle. The product is what matters.

Ship it. Monitor it. Iterate on it. And don’t be afraid to admit when something you built was over-engineered — that honesty is how you get better.

Export for reading

Production-Ready Clean Architecture: Deployment, Monitoring, and the Lessons I'd Share With My Past Self

Docker and Local Development

CI/CD with GitHub Actions

Observability with OpenTelemetry

Performance Optimization

Native AOT Considerations

The Honest Retrospective

What Was Worth Every Minute

What Was Over-Engineered

What I’d Do Differently

When Clean Architecture Is Overkill

Series Recap

Closing

Comments

On this page

Production-Ready Clean Architecture: Deployment, Monitoring, and the Lessons I'd Share With My Past Self