Here’s the problem: my application has an AI chat agent that answers user questions using a RAG pipeline — it retrieves context from a PostgreSQL-backed vector database, feeds it to an LLM, and generates a response. How do you write tests for something whose output is non-deterministic by design?

You can’t assert expect(response).toBe("The return policy is 30 days") because the LLM might phrase it differently every time. Traditional testing doesn’t work here. After months of trial and error, I found the right tools and patterns. This is Part 2 of my series on AI-powered quality engineering.

The Three Layers You Need to Test

An AI application has distinct layers, each requiring different testing strategies:

  1. Retrieval layer — Does the vector search return the right documents?
  2. Generation layer — Does the LLM produce faithful, relevant responses?
  3. Data layer — Is the database schema correct? Do migrations work?

Testing them independently reveals exactly where failures occur. A wrong answer might be a retrieval problem (wrong docs fetched), not a generation problem (LLM hallucinating).

Testing LLM Outputs with DeepEval

DeepEval is a Python framework that provides pytest-style assertions for LLM outputs. Think of it as “pytest for LLMs.” Install it:

pip install deepeval

The core concept is a LLMTestCase with metrics. Here’s how I test my customer support agent:

from deepeval import assert_test
from deepeval.metrics import (
    HallucinationMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
)
from deepeval.test_case import LLMTestCase

def test_agent_no_hallucination():
    test_case = LLMTestCase(
        input="What is the refund policy?",
        actual_output="You can get a full refund within 30 days of purchase.",
        retrieval_context=[
            "All customers are eligible for a 30-day full refund at no extra cost."
        ],
    )

    hallucination = HallucinationMetric(threshold=0.5)
    faithfulness = FaithfulnessMetric(threshold=0.7)
    relevancy = AnswerRelevancyMetric(threshold=0.7)

    assert_test(test_case, [hallucination, faithfulness, relevancy])

This test passes if the response is faithful to the provided context (doesn’t make things up), relevant to the question, and below the hallucination threshold.

Running tests with parallel execution:

deepeval test run tests/test_ai_quality.py -n 4

Testing Multiple Scenarios with Datasets

For production, I build evaluation datasets — a set of question-answer pairs with expected retrieval contexts:

import pytest
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Custom correctness metric
correctness = GEval(
    name="Correctness",
    criteria="Determine if the actual output is factually correct based on the expected output.",
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
    ],
    threshold=0.5,
)

# Build test cases from your golden dataset
test_cases = [
    LLMTestCase(
        input="What programming languages do you support?",
        actual_output=my_agent.invoke("What programming languages do you support?"),
        expected_output="We support Python, TypeScript, and Go.",
    ),
    LLMTestCase(
        input="How do I reset my password?",
        actual_output=my_agent.invoke("How do I reset my password?"),
        expected_output="Click 'Forgot Password' on the login page and follow the email instructions.",
    ),
]

dataset = EvaluationDataset(test_cases=test_cases)

@pytest.mark.parametrize("test_case", dataset)
def test_agent_correctness(test_case: LLMTestCase):
    assert_test(test_case, [correctness])

DeepEval provides 50+ metrics including ToxicityMetric, BiasMetric, and ContextualPrecisionMetric. For my use case, these four matter most: hallucination, faithfulness, relevancy, and correctness.

Testing RAG Retrieval Quality with Ragas

Ragas is specifically designed for evaluating RAG pipelines. While DeepEval tests the final answer, Ragas tests whether the retrieval step fetched the right documents.

pip install ragas

The key metrics for retrieval:

from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    LLMContextPrecisionWithReference,
    LLMContextRecall,
)
from datasets import Dataset

# Your test dataset
eval_data = Dataset.from_dict({
    "question": [
        "What is the pricing for the Pro plan?",
        "How do I integrate with Slack?",
    ],
    "answer": [
        my_rag.query("What is the pricing for the Pro plan?"),
        my_rag.query("How do I integrate with Slack?"),
    ],
    "contexts": [
        my_rag.retrieve("What is the pricing for the Pro plan?"),
        my_rag.retrieve("How do I integrate with Slack?"),
    ],
    "ground_truth": [
        "The Pro plan costs $49/month per user with unlimited projects.",
        "Go to Settings > Integrations > Slack and click Connect.",
    ],
})

result = evaluate(
    eval_data,
    metrics=[
        Faithfulness(),
        ResponseRelevancy(),
        LLMContextPrecisionWithReference(),
        LLMContextRecall(),
    ],
)

df = result.to_pandas()
print(df)

What each metric tells you:

MetricQuestion It AnswersTarget
FaithfulnessIs the answer supported by the retrieved context?> 0.85
Response RelevancyDoes the answer address the question?> 0.80
Context PrecisionAre the retrieved docs actually relevant?> 0.75
Context RecallDid retrieval find all necessary information?> 0.75

When context recall is low but faithfulness is high, your retrieval is the bottleneck — the LLM is doing fine with what it gets, but it’s not getting the right documents.

Vector Search Quality Metrics

Beyond Ragas, I track classical information retrieval metrics for the vector search itself. These tell you whether your embeddings and index configuration are working:

import numpy as np

def precision_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    """What fraction of top-k results are relevant?"""
    top_k = retrieved[:k]
    return len(set(top_k) & relevant) / k

def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    """What fraction of all relevant docs appear in top-k?"""
    top_k = retrieved[:k]
    return len(set(top_k) & relevant) / len(relevant)

def mrr(retrieved: list[str], relevant: set[str]) -> float:
    """How high is the first relevant result ranked?"""
    for i, doc_id in enumerate(retrieved):
        if doc_id in relevant:
            return 1.0 / (i + 1)
    return 0.0

I run these against a golden evaluation set — 50 queries with human-labeled relevant documents. If Precision@5 drops below 0.7 after an embedding model change, I know something broke.

Practical signals from these metrics:

  • Low Recall@K despite relevant content in the database → your embedding model is poor for this domain
  • Low Precision@K → your index is returning noise, tune HNSW ef_search parameter
  • Low MRR → relevant docs exist but are ranked too low, consider adding a reranker

Database Testing with Testcontainers

For the PostgreSQL layer (with pgvector for embeddings), I use Testcontainers to spin up a real Postgres instance in Docker for each test run:

npm install --save-dev testcontainers @testcontainers/postgresql pg
import {
  PostgreSqlContainer,
  StartedPostgreSqlContainer,
} from '@testcontainers/postgresql';
import { Client } from 'pg';

describe('Database Integration', () => {
  let container: StartedPostgreSqlContainer;
  let client: Client;

  beforeAll(async () => {
    container = await new PostgreSqlContainer('pgvector/pgvector:pg16')
      .withDatabase('testdb')
      .withUsername('test')
      .withPassword('test')
      .start();

    client = new Client({
      connectionString: container.getConnectionUri(),
    });
    await client.connect();

    // Run your migrations
    await client.query('CREATE EXTENSION IF NOT EXISTS vector');
    await client.query(`
      CREATE TABLE documents (
        id SERIAL PRIMARY KEY,
        content TEXT NOT NULL,
        embedding vector(1536),
        metadata JSONB DEFAULT '{}'
      )
    `);
  }, 60000);

  afterAll(async () => {
    await client.end();
    await container.stop();
  });

  beforeEach(async () => {
    await client.query('DELETE FROM documents');
  });

  test('stores and retrieves embeddings', async () => {
    const embedding = Array(1536).fill(0).map(() => Math.random());
    const vectorStr = `[${embedding.join(',')}]`;

    await client.query(
      'INSERT INTO documents (content, embedding) VALUES ($1, $2::vector)',
      ['Test document', vectorStr]
    );

    const result = await client.query(`
      SELECT content, 1 - (embedding <=> $1::vector) as similarity
      FROM documents
      ORDER BY embedding <=> $1::vector
      LIMIT 5
    `, [vectorStr]);

    expect(result.rows).toHaveLength(1);
    expect(result.rows[0].similarity).toBeCloseTo(1.0, 5);
  });

  test('enforces unique constraints', async () => {
    await client.query(
      "INSERT INTO documents (content, metadata) VALUES ($1, $2)",
      ['Doc 1', '{"source": "test"}']
    );

    const result = await client.query(
      "SELECT * FROM documents WHERE metadata->>'source' = 'test'"
    );
    expect(result.rows).toHaveLength(1);
  });
});

Key benefits over mocking: you test against real PostgreSQL with real pgvector, real SQL, and real constraint enforcement. Testcontainers handles cleanup — each test run gets a fresh database.

Migration Testing

I also test that migrations apply cleanly:

test('migrations apply on empty database', async () => {
  const container = await new PostgreSqlContainer('pgvector/pgvector:pg16')
    .start();

  // This should not throw
  await runMigrations(container.getConnectionUri());

  const client = new Client({
    connectionString: container.getConnectionUri(),
  });
  await client.connect();

  const tables = await client.query(`
    SELECT table_name FROM information_schema.tables
    WHERE table_schema = 'public'
    ORDER BY table_name
  `);

  expect(tables.rows.map(r => r.table_name)).toContain('documents');
  expect(tables.rows.map(r => r.table_name)).toContain('users');

  await client.end();
  await container.stop();
}, 60000);

The Combined Testing Strategy

Here’s how all three layers fit together in practice:

LayerToolWhat You Catch
DatabaseTestcontainers + pgvectorSchema errors, migration failures, query bugs
RetrievalRagas + custom IR metricsWrong documents fetched, poor embedding quality
GenerationDeepEvalHallucinations, irrelevant answers, unfaithful responses
End-to-EndPlaywrightUI broken, API integration failures, user flow regressions

Test the layers independently first. When all layers pass individually but E2E fails, the bug is in the glue code — the API route that connects retrieval to generation, or the frontend that renders the response.

Series Navigation

In Part 3, I’ll show the CI/CD pipeline that runs all of this automatically on every PR — GitHub Actions with quality gates for test coverage, performance budgets, AI response quality checks, and visual regression tests. If any gate fails, the PR can’t merge.

Export for reading

Comments