I built an entire SaaS product — from the first idea to paying customers — using AI agents as my co-pilots at every stage. Not just for writing code. For market analysis, architecture decisions, database schema design, test strategy, deployment pipelines, and even copywriting.

The product is Kids Learn, an AI-powered educational platform that creates personalized learning paths for children aged 4-12. It uses Gemini to generate adaptive content, a RAG knowledge base for curriculum alignment, and real-time progress tracking for parents. The tech stack: Next.js, PostgreSQL with pgvector, Google Gemini, and Cloudflare for edge delivery.

Here’s how I went from a napkin idea to production, and how Claude and agentic AI transformed every phase of the journey.

SaaS Development Lifecycle: From Idea to Production — 5 phases showing the iterative journey from ideation through design, implementation, testing, and production deployment

Phase 1: Ideation and Market Analysis

Before writing a single line of code, I spent two weeks validating the idea. This is where most side projects die — you skip validation, build for three months, then discover nobody wants it.

Using Claude for Market Research

I used Claude as a research analyst. Not “give me a business plan” prompts — specific, targeted research questions:

  • “What are the top 10 complaints parents have about existing educational apps on the App Store? Cite specific review patterns.”
  • “What’s the difference between adaptive learning and personalized learning in EdTech? Which has better retention data?”
  • “Analyze the pricing models of ABCmouse, Khan Academy Kids, and Homer. What’s the sweet spot for a new entrant?”

Claude’s analysis surfaced three key insights that shaped the product:

  1. Parents hate subscription fatigue — but they’ll pay for visible progress. The pricing model needed to tie payment to outcomes, not just access.
  2. Most apps are content dumps — they throw videos and worksheets at kids without adapting to what the child actually struggles with.
  3. Curriculum alignment matters — parents want to know “this maps to what my kid is learning in school,” not just “educational games.”

The Value Proposition Canvas

I had Claude help me build a Value Proposition Canvas. Not as a one-shot generation — as an iterative conversation where I pushed back on assumptions and Claude challenged my biases.

The result: Kids Learn would be an adaptive learning platform that creates personalized daily lessons aligned to school curricula, uses AI to identify knowledge gaps, and gives parents a clear dashboard showing what their child knows vs. what they need to work on.

Competitive Analysis Framework

┌─────────────────────┬──────────┬───────────┬──────────┬───────────┐
│ Feature             │ ABCmouse │ Khan Kids │ Homer    │ Kids Learn│
├─────────────────────┼──────────┼───────────┼──────────┼───────────┤
│ Adaptive difficulty │ Basic    │ None      │ Basic    │ AI-driven │
│ Curriculum aligned  │ Partial  │ Yes       │ No       │ Full      │
│ Parent dashboard    │ Minimal  │ Basic     │ Good     │ Detailed  │
│ Content generation  │ Static   │ Static    │ Static   │ AI + RAG  │
│ Knowledge gap ID    │ No       │ No        │ No       │ Yes       │
│ Multi-language      │ Limited  │ Yes       │ No       │ Yes       │
│ Pricing model       │ Sub      │ Free      │ Sub      │ Outcome   │
└─────────────────────┴──────────┴───────────┴──────────┴───────────┘

Phase 2: System Design and Architecture

This is where I leaned hardest on Claude. Architecture decisions are expensive to reverse. Getting them right upfront saves months.

The Architecture Decision Records (ADRs)

I wrote ADRs for every significant choice, using Claude to argue both sides. Here’s the format:

ADR-001: Next.js over Remix/Astro for the application layer

  • Context: Need SSR for SEO (landing pages), CSR for the learning interface, API routes for backend logic, and ISR for content pages.
  • Decision: Next.js 15 with App Router. The combination of Server Components, Server Actions, and edge middleware covers all rendering patterns in one framework.
  • Rejected: Astro (great for content sites, but the interactive learning UI needs React’s component model). Remix (better data loading patterns, but smaller ecosystem for the AI integrations we need).

ADR-002: PostgreSQL + pgvector over dedicated vector DB

  • Context: Need vector storage for curriculum embeddings, lesson content retrieval, and knowledge gap analysis. Also need relational data for users, subscriptions, progress.
  • Decision: Single PostgreSQL instance with pgvector extension. One database to manage, one backup strategy, transactional consistency between vector ops and relational data.
  • Rejected: Pinecone (adds operational complexity and cost for a managed vector DB when pgvector handles our scale). ChromaDB (great for prototyping, but we need production ACID guarantees).

ADR-003: Gemini as primary AI provider

  • Context: Need multimodal AI for content generation (text + images), content analysis, and adaptive learning algorithms.
  • Decision: Google Gemini 2.0 Flash for real-time adaptive features (fast, cheap), Gemini 2.0 Pro for content generation (higher quality). Claude for development assistance, code review, and complex analysis tasks.
  • Rejected: OpenAI GPT-4o (more expensive at our projected scale, no significant quality advantage for our use case). Local models (too early — need the quality of frontier models for educational content).

High-Level Architecture

The system follows a clean separation between the presentation layer, business logic, AI services, and data layer:

Kids Learn System Architecture — 4-layer architecture showing Presentation (Next.js), API, AI Services (Gemini, RAG), and Data (PostgreSQL, pgvector, Redis, R2) layers

Presentation Layer (Next.js)

  • Landing pages (SSG for performance)
  • Learning interface (Client Components with real-time state)
  • Parent dashboard (Server Components with streaming)
  • Admin panel (protected routes)

API Layer (Next.js API Routes + Server Actions)

  • Authentication (NextAuth.js with Google/email providers)
  • Learning session management
  • Progress tracking and analytics
  • Content delivery and caching

AI Services Layer

  • Gemini content generation service
  • RAG retrieval pipeline (pgvector + hybrid search)
  • Adaptive difficulty engine
  • Knowledge gap analyzer

Data Layer

  • PostgreSQL (users, subscriptions, progress, content metadata)
  • pgvector (curriculum embeddings, lesson embeddings)
  • Redis (session cache, rate limiting, real-time state)
  • Cloudflare R2 (media assets, generated images)

Database Schema Design

Entity Relationship Diagram — Users, Topics, Lessons, Learning Progress, Session Events, and Curriculum Standards with vector columns for AI features

I had Claude help design the schema with specific constraints in mind: multi-tenancy (each school/family is a tenant), soft deletes, audit trails, and vector columns for AI features.

-- Core user model with role-based access
CREATE TABLE users (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  email VARCHAR(255) UNIQUE NOT NULL,
  name VARCHAR(255) NOT NULL,
  role VARCHAR(20) CHECK (role IN ('parent', 'child', 'teacher', 'admin')),
  parent_id UUID REFERENCES users(id),
  avatar_url TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Curriculum-aligned learning topics
CREATE TABLE topics (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  subject VARCHAR(50) NOT NULL,          -- math, reading, science
  grade_level INTEGER NOT NULL,           -- 1-6
  curriculum_standard VARCHAR(100),       -- e.g., "CCSS.MATH.1.OA.A.1"
  title VARCHAR(255) NOT NULL,
  description TEXT,
  prerequisites UUID[] DEFAULT '{}',
  embedding VECTOR(768),                  -- Gemini text-embedding-004
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- AI-generated lessons with difficulty tracking
CREATE TABLE lessons (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  topic_id UUID REFERENCES topics(id) NOT NULL,
  difficulty_level DECIMAL(3,2) CHECK (difficulty_level BETWEEN 0 AND 1),
  content JSONB NOT NULL,                 -- structured lesson content
  generated_by VARCHAR(50) DEFAULT 'gemini-2.0-flash',
  content_embedding VECTOR(768),
  quality_score DECIMAL(3,2),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Per-child progress tracking
CREATE TABLE learning_progress (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  child_id UUID REFERENCES users(id) NOT NULL,
  topic_id UUID REFERENCES topics(id) NOT NULL,
  mastery_level DECIMAL(3,2) DEFAULT 0,   -- 0.0 to 1.0
  attempts INTEGER DEFAULT 0,
  correct_answers INTEGER DEFAULT 0,
  time_spent_seconds INTEGER DEFAULT 0,
  last_activity_at TIMESTAMPTZ,
  knowledge_state VECTOR(768),            -- embedding of child's understanding
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(child_id, topic_id)
);

-- Learning session events for analytics
CREATE TABLE session_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  child_id UUID REFERENCES users(id) NOT NULL,
  lesson_id UUID REFERENCES lessons(id),
  event_type VARCHAR(50) NOT NULL,
  event_data JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Vector similarity index for content retrieval
CREATE INDEX idx_topics_embedding ON topics
  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

CREATE INDEX idx_lessons_embedding ON lessons
  USING ivfflat (content_embedding vector_cosine_ops) WITH (lists = 100);

CREATE INDEX idx_progress_knowledge ON learning_progress
  USING ivfflat (knowledge_state vector_cosine_ops) WITH (lists = 50);

Key design decisions:

  • knowledge_state as a vector column on learning_progress — this represents what the child knows as an embedding, enabling vector similarity to find related topics they might struggle with
  • content JSONB on lessons — structured content (questions, explanations, hints) stored as JSON for flexible rendering
  • curriculum_standard on topics — maps directly to Common Core or local standards, so parents see alignment
  • IVFFlat indexes with appropriate list counts for our data scale (~10K topics, ~100K lessons)

Phase 3: Implementation with Agentic AI

Here’s where the agentic workflow changed everything. Instead of using Claude as a fancy autocomplete, I used it as a development partner through every implementation phase.

Project Setup

# Initialize Next.js with TypeScript
npx create-next-app@latest kids-learn --typescript --tailwind \
  --eslint --app --src-dir --import-alias "@/*"

# Core dependencies
npm install @google/generative-ai pgvector pg drizzle-orm
npm install next-auth @auth/drizzle-adapter
npm install zod react-hook-form @tanstack/react-query

# AI and vector search
npm install ai @ai-sdk/google          # Vercel AI SDK with Gemini
npm install drizzle-kit pg-promise      # DB migrations

# Development
npm install -D drizzle-kit vitest @testing-library/react
npm install -D playwright @playwright/test

The AI Content Generation Pipeline

The core feature — generating personalized lessons — required careful prompt engineering and a multi-step pipeline.

// src/lib/ai/lesson-generator.ts
import { google } from "@ai-sdk/google";
import { generateObject } from "ai";
import { z } from "zod";
import { findSimilarTopics, getChildKnowledgeState } from "@/lib/db/vectors";

const LessonSchema = z.object({
  title: z.string(),
  introduction: z.string(),
  concepts: z.array(z.object({
    explanation: z.string(),
    example: z.string(),
    visual_description: z.string(),
  })),
  exercises: z.array(z.object({
    question: z.string(),
    type: z.enum(["multiple_choice", "fill_blank", "drag_drop", "open_ended"]),
    options: z.array(z.string()).optional(),
    correct_answer: z.string(),
    hint: z.string(),
    difficulty: z.number().min(0).max(1),
  })),
  summary: z.string(),
});

export async function generateLesson(
  childId: string,
  topicId: string,
) {
  // Step 1: Get child's current knowledge state
  const knowledgeState = await getChildKnowledgeState(childId);
  const progress = await getTopicProgress(childId, topicId);
  const topic = await getTopic(topicId);

  // Step 2: Find related topics the child has mastered (for building connections)
  const relatedMastered = await findSimilarTopics(
    topic.embedding,
    { minMastery: 0.7, childId, limit: 3 }
  );

  // Step 3: Determine target difficulty
  const targetDifficulty = calculateTargetDifficulty(
    progress.mastery_level,
    progress.recent_accuracy,
    progress.attempts,
  );

  // Step 4: Generate with Gemini
  const { object: lesson } = await generateObject({
    model: google("gemini-2.0-flash"),
    schema: LessonSchema,
    prompt: `You are creating a lesson for a ${topic.grade_level}th grade student.

Topic: ${topic.title} (${topic.curriculum_standard})
Description: ${topic.description}
Target difficulty: ${targetDifficulty} (0=beginner, 1=advanced)
Current mastery: ${progress.mastery_level}

The student has already mastered these related topics:
${relatedMastered.map(t => `- ${t.title} (mastery: ${t.mastery})`).join("\n")}

Guidelines:
- Use age-appropriate language for grade ${topic.grade_level}
- Build on concepts from mastered topics when possible
- Include exactly 5 exercises with increasing difficulty
- Each exercise should have a helpful hint that guides without giving the answer
- Make examples relatable to children's daily experiences
- Keep the introduction under 3 sentences to maintain attention`,
  });

  // Step 5: Store with embedding for future retrieval
  const contentEmbedding = await embedContent(JSON.stringify(lesson));

  const saved = await db.insert(lessons).values({
    topic_id: topicId,
    difficulty_level: targetDifficulty,
    content: lesson,
    content_embedding: contentEmbedding,
    quality_score: null,  // Set after child completes lesson
  }).returning();

  return saved[0];
}

The RAG Knowledge Base for Curriculum Alignment

The knowledge base ensures every generated lesson aligns with actual curriculum standards. We ingest curriculum documents, chunk them semantically, and use hybrid retrieval.

// src/lib/ai/curriculum-rag.ts
import { google } from "@ai-sdk/google";
import { embed } from "ai";
import { db } from "@/lib/db";
import { sql } from "drizzle-orm";

interface RetrievalResult {
  content: string;
  standard_id: string;
  similarity: number;
  source: "vector" | "keyword" | "both";
}

export async function retrieveCurriculumContext(
  query: string,
  gradeLevel: number,
  subject: string,
  limit: number = 5,
): Promise<RetrievalResult[]> {
  // Generate query embedding
  const { embedding } = await embed({
    model: google.textEmbeddingModel("text-embedding-004"),
    value: query,
  });

  // Hybrid retrieval: vector similarity + full-text search
  const results = await db.execute(sql`
    WITH vector_results AS (
      SELECT
        id, content, standard_id,
        1 - (embedding <=> ${sql.raw(`'[${embedding.join(",")}]'::vector`)}) as similarity,
        'vector' as source
      FROM curriculum_standards
      WHERE grade_level = ${gradeLevel}
        AND subject = ${subject}
      ORDER BY embedding <=> ${sql.raw(`'[${embedding.join(",")}]'::vector`)}
      LIMIT ${limit * 2}
    ),
    keyword_results AS (
      SELECT
        id, content, standard_id,
        ts_rank(search_vector, plainto_tsquery('english', ${query})) as similarity,
        'keyword' as source
      FROM curriculum_standards
      WHERE grade_level = ${gradeLevel}
        AND subject = ${subject}
        AND search_vector @@ plainto_tsquery('english', ${query})
      ORDER BY ts_rank(search_vector, plainto_tsquery('english', ${query})) DESC
      LIMIT ${limit * 2}
    ),
    combined AS (
      SELECT DISTINCT ON (id)
        id, content, standard_id,
        GREATEST(v.similarity, k.similarity) as similarity,
        CASE
          WHEN v.id IS NOT NULL AND k.id IS NOT NULL THEN 'both'
          WHEN v.id IS NOT NULL THEN 'vector'
          ELSE 'keyword'
        END as source
      FROM vector_results v
      FULL OUTER JOIN keyword_results k USING (id, content, standard_id)
      ORDER BY id, similarity DESC
    )
    SELECT * FROM combined
    ORDER BY
      CASE source WHEN 'both' THEN 0 WHEN 'vector' THEN 1 ELSE 2 END,
      similarity DESC
    LIMIT ${limit}
  `);

  return results.rows as RetrievalResult[];
}

Adaptive Difficulty Engine

The adaptive engine is the secret sauce. It adjusts difficulty in real-time based on the child’s performance using a modified version of item response theory (IRT).

// src/lib/ai/adaptive-engine.ts

interface PerformanceWindow {
  attempts: number;
  correct: number;
  avgTimeSeconds: number;
  streakLength: number;
  streakType: "correct" | "incorrect";
}

export function calculateTargetDifficulty(
  currentMastery: number,
  recentWindow: PerformanceWindow,
  historicalAccuracy: number,
): number {
  // Zone of Proximal Development (ZPD) targeting
  // We want the child slightly challenged but not frustrated
  const ZPD_LOW = 0.65;   // Below this: too easy, child gets bored
  const ZPD_HIGH = 0.85;  // Above this: too hard, child gets frustrated
  const ZPD_TARGET = 0.75; // Sweet spot: challenging but achievable

  const recentAccuracy = recentWindow.correct / Math.max(recentWindow.attempts, 1);

  // Base difficulty from mastery level
  let targetDifficulty = currentMastery;

  // Adjust based on recent performance
  if (recentAccuracy > ZPD_HIGH) {
    // Too easy — increase difficulty
    targetDifficulty += 0.1 * (recentAccuracy - ZPD_TARGET);
  } else if (recentAccuracy < ZPD_LOW) {
    // Too hard — decrease difficulty
    targetDifficulty -= 0.15 * (ZPD_TARGET - recentAccuracy);
    // Decrease faster than increase to prevent frustration
  }

  // Streak detection
  if (recentWindow.streakType === "incorrect" && recentWindow.streakLength >= 3) {
    // Child is struggling — significant step back
    targetDifficulty -= 0.2;
  } else if (recentWindow.streakType === "correct" && recentWindow.streakLength >= 5) {
    // Child is cruising — push harder
    targetDifficulty += 0.15;
  }

  // Time-based signal: if answers are very fast, it might be too easy
  const expectedTime = 30 + (targetDifficulty * 60); // 30-90 seconds
  if (recentWindow.avgTimeSeconds < expectedTime * 0.5) {
    targetDifficulty += 0.05; // Slightly harder
  }

  // Clamp to valid range
  return Math.max(0.05, Math.min(0.95, targetDifficulty));
}

Learning Session Flow

Here’s the complete sequence of what happens when a child starts a learning session — from opening the app to celebrating completion:

Learning Session Sequence Diagram — shows interaction between Child, Frontend, API, Adaptive Engine, Gemini AI, RAG Pipeline, and PostgreSQL during a complete learning session

Parent Dashboard with Real-Time Analytics

The parent dashboard uses Next.js Server Components with streaming for fast initial load and progressive data hydration.

// src/app/dashboard/page.tsx
import { Suspense } from "react";
import { auth } from "@/lib/auth";
import { ChildProgressCard } from "@/components/dashboard/child-progress-card";
import { KnowledgeMap } from "@/components/dashboard/knowledge-map";
import { WeeklyReport } from "@/components/dashboard/weekly-report";
import { getChildren } from "@/lib/db/queries";

export default async function DashboardPage() {
  const session = await auth();
  const children = await getChildren(session.user.id);

  return (
    <div className="dashboard-grid">
      <h1>Learning Dashboard</h1>

      {children.map((child) => (
        <section key={child.id}>
          <h2>{child.name}'s Progress</h2>

          {/* Fast: basic stats from cache */}
          <ChildProgressCard childId={child.id} />

          {/* Streamed: knowledge map requires vector computation */}
          <Suspense fallback={<KnowledgeMapSkeleton />}>
            <KnowledgeMap childId={child.id} />
          </Suspense>

          {/* Streamed: weekly report requires AI summarization */}
          <Suspense fallback={<WeeklyReportSkeleton />}>
            <WeeklyReport childId={child.id} />
          </Suspense>
        </section>
      ))}
    </div>
  );
}

Phase 4: Testing Strategy — The AI-Augmented Approach

This is where most SaaS products cut corners. Testing an AI-powered product is uniquely challenging because the outputs are non-deterministic. Here’s the testing pyramid I used.

Testing Strategy — Testing pyramid with Unit, AI Quality, Integration, and E2E layers, plus AI-specific test strategies including fixture system, property-based testing, safety gates, and CI/CD quality gates

Level 1: Unit Tests with Vitest

// src/lib/ai/__tests__/adaptive-engine.test.ts
import { describe, it, expect } from "vitest";
import { calculateTargetDifficulty } from "../adaptive-engine";

describe("Adaptive Difficulty Engine", () => {
  it("increases difficulty when accuracy is above ZPD high threshold", () => {
    const result = calculateTargetDifficulty(0.5, {
      attempts: 10,
      correct: 9,       // 90% accuracy — too easy
      avgTimeSeconds: 20,
      streakLength: 5,
      streakType: "correct",
    }, 0.85);

    expect(result).toBeGreaterThan(0.5);  // Should increase from current
  });

  it("decreases difficulty faster than it increases (frustration prevention)", () => {
    const increase = calculateTargetDifficulty(0.5, {
      attempts: 10, correct: 9,  // 90% — too easy
      avgTimeSeconds: 30, streakLength: 2, streakType: "correct",
    }, 0.85);

    const decrease = calculateTargetDifficulty(0.5, {
      attempts: 10, correct: 5,  // 50% — too hard
      avgTimeSeconds: 60, streakLength: 2, streakType: "incorrect",
    }, 0.55);

    const increaseAmount = increase - 0.5;
    const decreaseAmount = 0.5 - decrease;
    expect(decreaseAmount).toBeGreaterThan(increaseAmount);
  });

  it("significantly reduces difficulty after 3 consecutive failures", () => {
    const result = calculateTargetDifficulty(0.6, {
      attempts: 5,
      correct: 1,
      avgTimeSeconds: 45,
      streakLength: 3,
      streakType: "incorrect",
    }, 0.4);

    expect(result).toBeLessThan(0.4);  // Big step back
  });

  it("clamps difficulty between 0.05 and 0.95", () => {
    const low = calculateTargetDifficulty(0.05, {
      attempts: 10, correct: 0,
      avgTimeSeconds: 60, streakLength: 10, streakType: "incorrect",
    }, 0.1);

    const high = calculateTargetDifficulty(0.95, {
      attempts: 10, correct: 10,
      avgTimeSeconds: 10, streakLength: 10, streakType: "correct",
    }, 0.99);

    expect(low).toBeGreaterThanOrEqual(0.05);
    expect(high).toBeLessThanOrEqual(0.95);
  });
});

Level 2: AI Output Quality Tests

For AI-generated content, I built a quality gate system that evaluates outputs against rubrics.

// src/lib/ai/__tests__/lesson-quality.test.ts
import { describe, it, expect } from "vitest";
import { generateLesson } from "../lesson-generator";
import { evaluateLessonQuality } from "../quality-evaluator";

describe("Lesson Generation Quality", () => {
  it("generates age-appropriate content for grade 2", async () => {
    const lesson = await generateLesson(
      "test-child-grade2",
      "topic-addition-basics"
    );

    const quality = await evaluateLessonQuality(lesson, {
      gradeLevel: 2,
      subject: "math",
    });

    // Flesch-Kincaid grade level should match target
    expect(quality.readabilityGrade).toBeLessThanOrEqual(3);
    expect(quality.readabilityGrade).toBeGreaterThanOrEqual(1);

    // No inappropriate content
    expect(quality.contentSafety.flagged).toBe(false);

    // Exercises match difficulty expectations
    expect(quality.exerciseDifficultyRange.min).toBeGreaterThanOrEqual(0);
    expect(quality.exerciseDifficultyRange.max).toBeLessThanOrEqual(0.5);
  }, 30000);  // Extended timeout for AI generation

  it("aligns generated content with curriculum standards", async () => {
    const lesson = await generateLesson(
      "test-child-grade3",
      "topic-multiplication-intro"
    );

    const alignment = await checkCurriculumAlignment(
      lesson,
      "CCSS.MATH.3.OA.A.1"  // Interpret products of whole numbers
    );

    expect(alignment.score).toBeGreaterThan(0.7);
    expect(alignment.coveredObjectives).not.toHaveLength(0);
  }, 30000);
});

Level 3: Integration Tests

// src/tests/integration/learning-flow.test.ts
import { describe, it, expect, beforeAll, afterAll } from "vitest";
import { setupTestDB, teardownTestDB, createTestUser } from "../helpers/db";

describe("Learning Flow Integration", () => {
  let testParent: User;
  let testChild: User;

  beforeAll(async () => {
    await setupTestDB();
    testParent = await createTestUser({ role: "parent" });
    testChild = await createTestUser({
      role: "child",
      parentId: testParent.id,
    });
  });

  afterAll(async () => {
    await teardownTestDB();
  });

  it("completes a full learning session", async () => {
    // 1. Get recommended topic
    const recommendation = await getNextRecommendation(testChild.id);
    expect(recommendation.topicId).toBeDefined();

    // 2. Start a lesson
    const lesson = await generateLesson(testChild.id, recommendation.topicId);
    expect(lesson.content.exercises).toHaveLength(5);

    // 3. Submit answers
    const answers = lesson.content.exercises.map((ex, i) => ({
      exerciseIndex: i,
      answer: ex.correct_answer,  // All correct for this test
      timeSeconds: 20 + (i * 5),
    }));

    const result = await submitLessonAnswers(testChild.id, lesson.id, answers);

    // 4. Verify progress updated
    expect(result.score).toBe(1.0);
    const progress = await getTopicProgress(testChild.id, recommendation.topicId);
    expect(progress.mastery_level).toBeGreaterThan(0);
    expect(progress.attempts).toBe(5);

    // 5. Verify knowledge state embedding updated
    expect(progress.knowledge_state).toBeDefined();
    expect(progress.knowledge_state.length).toBe(768);
  });
});

Level 4: End-to-End Tests with Playwright

// e2e/learning-session.spec.ts
import { test, expect } from "@playwright/test";

test.describe("Kids Learn - Learning Session", () => {
  test.beforeEach(async ({ page }) => {
    await page.goto("/login");
    await page.fill("[data-testid=email]", "test-parent@example.com");
    await page.fill("[data-testid=password]", "test-password");
    await page.click("[data-testid=login-button]");
    await expect(page.locator("h1")).toContainText("Dashboard");
  });

  test("parent can switch to child profile and start a lesson", async ({ page }) => {
    // Switch to child profile
    await page.click("[data-testid=child-avatar-emma]");
    await expect(page.locator("h2")).toContainText("Emma's Learning");

    // Start recommended lesson
    await page.click("[data-testid=start-lesson]");
    await expect(page.locator("[data-testid=lesson-title]")).toBeVisible();

    // Answer first question
    await page.click("[data-testid=option-a]");
    await page.click("[data-testid=submit-answer]");

    // Verify feedback shown
    await expect(
      page.locator("[data-testid=feedback]")
    ).toBeVisible({ timeout: 5000 });
  });

  test("parent dashboard shows progress after lesson completion", async ({ page }) => {
    // Navigate to dashboard
    await page.goto("/dashboard");

    // Check progress card
    const progressCard = page.locator("[data-testid=progress-card-emma]");
    await expect(progressCard).toBeVisible();
    await expect(progressCard.locator(".mastery-score")).not.toHaveText("0%");

    // Check knowledge map loaded
    await expect(
      page.locator("[data-testid=knowledge-map]")
    ).toBeVisible({ timeout: 10000 });
  });
});

Testing AI Systems: Lessons Learned

The non-determinism problem. AI outputs vary between runs. You can’t assert expect(lesson.title).toBe("Adding Numbers"). Instead, assert on properties: the title is non-empty, under 100 characters, and contains keywords related to the topic.

The cost problem. Running Gemini for every test run gets expensive fast. Solution: use a fixture system that records AI responses and replays them in CI. Only run live AI tests in a nightly job.

// test/helpers/ai-fixtures.ts
export function withAIFixture<T>(
  name: string,
  generator: () => Promise<T>,
): () => Promise<T> {
  const fixturePath = `test/fixtures/${name}.json`;

  return async () => {
    if (process.env.AI_FIXTURE_MODE === "record") {
      const result = await generator();
      await writeFile(fixturePath, JSON.stringify(result, null, 2));
      return result;
    }

    // Replay mode (default in CI)
    const fixture = JSON.parse(await readFile(fixturePath, "utf-8"));
    return fixture as T;
  };
}

The safety problem. AI-generated content for kids needs extra safety checks. Every lesson goes through a content safety evaluator before being shown:

async function contentSafetyCheck(lesson: Lesson): Promise<boolean> {
  const { object: safety } = await generateObject({
    model: google("gemini-2.0-flash"),
    schema: z.object({
      safe: z.boolean(),
      concerns: z.array(z.string()),
    }),
    prompt: `Evaluate this educational content for children ages 4-12.
Flag any content that is:
- Violent, scary, or anxiety-inducing
- Contains inappropriate language
- Culturally insensitive or biased
- Factually incorrect for the stated grade level
- Potentially confusing or misleading

Content: ${JSON.stringify(lesson.content)}`,
  });

  if (!safety.safe) {
    await logSafetyConcern(lesson.id, safety.concerns);
  }

  return safety.safe;
}

Phase 5: Deployment and Production

Infrastructure as Code

# docker-compose.yml
services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://kidslearn:${DB_PASS}@db:5432/kidslearn
      - GOOGLE_AI_API_KEY=${GEMINI_API_KEY}
      - NEXTAUTH_SECRET=${AUTH_SECRET}
    depends_on:
      db:
        condition: service_healthy

  db:
    image: pgvector/pgvector:pg16
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: kidslearn
      POSTGRES_USER: kidslearn
      POSTGRES_PASSWORD: ${DB_PASS}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U kidslearn"]
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    volumes:
      - redisdata:/data

volumes:
  pgdata:
  redisdata:

CI/CD Pipeline

# .github/workflows/ci.yml
name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: pgvector/pgvector:pg16
        env:
          POSTGRES_DB: kidslearn_test
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"

      - run: npm ci
      - run: npm run db:migrate
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/kidslearn_test

      - name: Unit & Integration Tests
        run: npm run test:ci
        env:
          AI_FIXTURE_MODE: replay

      - name: Build
        run: npm run build

      - name: E2E Tests
        run: npx playwright test
        env:
          AI_FIXTURE_MODE: replay

  ai-quality:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "npm"

      - run: npm ci
      - name: AI Quality Tests (Live)
        run: npm run test:ai-quality
        env:
          AI_FIXTURE_MODE: record
          GOOGLE_AI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

  deploy:
    needs: [test]
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to Vercel
        uses: amondnet/vercel-action@v25
        with:
          vercel-token: ${{ secrets.VERCEL_TOKEN }}
          vercel-org-id: ${{ secrets.VERCEL_ORG_ID }}
          vercel-project-id: ${{ secrets.VERCEL_PROJECT_ID }}
          vercel-args: "--prod"

Monitoring and Observability

In production, we track three categories of metrics:

Application metrics: Response times, error rates, concurrent users.

AI metrics: Generation latency, token costs, content quality scores, safety flag rates.

Learning metrics: Engagement time per session, mastery progression rate, retention (do kids come back?), parent dashboard usage.

// src/lib/monitoring/ai-metrics.ts
export async function trackAIGeneration(params: {
  model: string;
  operation: "lesson" | "evaluation" | "embedding";
  latencyMs: number;
  inputTokens: number;
  outputTokens: number;
  qualityScore?: number;
  safetyFlagged: boolean;
}) {
  await metrics.record({
    name: "ai.generation",
    tags: {
      model: params.model,
      operation: params.operation,
      safety_flagged: String(params.safetyFlagged),
    },
    fields: {
      latency_ms: params.latencyMs,
      input_tokens: params.inputTokens,
      output_tokens: params.outputTokens,
      quality_score: params.qualityScore ?? 0,
      cost_usd: calculateCost(params.model, params.inputTokens, params.outputTokens),
    },
  });
}

What Claude Changed About My Development Process

Using Claude as an agentic development partner wasn’t about writing code faster. It was about thinking more rigorously at every stage.

Design phase: Claude forced me to articulate “why” for every decision. Writing ADRs as conversations with Claude meant every assumption got questioned.

Implementation phase: Claude caught edge cases I would have missed. “What happens when a child closes the browser mid-lesson? What about race conditions in progress tracking?” These questions came from Claude analyzing my code in context.

Testing phase: Claude generated test scenarios I wouldn’t have thought of. “What if a child rapidly clicks through all answers in under 2 seconds? Your adaptive engine would misinterpret that as easy material.” That became a test case and a UI fix (answer cooldown timer).

Production phase: Claude helped me think about operational concerns early. “Your pgvector IVFFlat index will degrade as you approach 100K rows without rebuilding. Consider a HNSW index or scheduled REINDEX.” That became a production runbook item.

Key Takeaways

  1. Validate before you build. Two weeks of research saved months of building the wrong thing. Use AI to accelerate research, not skip it.

  2. ADRs are worth it. Every architecture decision is a bet. Writing it down with alternatives and tradeoffs means you can revisit decisions intelligently when context changes.

  3. PostgreSQL + pgvector is production-ready. For most AI applications, you don’t need a dedicated vector database. One database, one backup strategy, transactional consistency.

  4. Test AI outputs on properties, not values. Assert structure, safety, and quality ranges — never exact content.

  5. AI content for children needs safety gates. Every generated piece of content goes through safety evaluation. No exceptions. The quality gate is a hard requirement, not a nice-to-have.

  6. Agentic AI is a thinking partner, not a code generator. The biggest value wasn’t in lines of code written. It was in assumptions challenged, edge cases caught, and decisions made with more information.

  7. Cost management is a feature. Track AI token usage and costs from day one. The fixture system for tests alone saved us hundreds of dollars per month in Gemini API calls.

  8. Ship the learning loop early. The adaptive engine started simple (just accuracy-based difficulty adjustment) and got sophisticated over time based on actual child interaction data. Don’t over-engineer before you have users.

The full codebase for Kids Learn uses approximately 15,000 lines of TypeScript across 120 files. Claude was involved in roughly 80% of those files — not writing every line, but reviewing, questioning, and improving every significant piece of logic. That’s what agentic AI development looks like in practice: not replacing the developer, but making every developer decision more informed.

Export for reading

Comments