RAG Knowledge Base System - Luong Hong Thuan

Every company I’ve worked at has the same problem: documentation exists, but nobody can find anything. This project was my attempt to fix that.

The Problem

At NashTech, we had documentation spread across Confluence, Google Docs, markdown files in repos, and people’s heads. New developers would ask the same questions repeatedly, and experienced developers would spend 20 minutes searching for something they knew existed somewhere.

What We Built

A RAG (Retrieval-Augmented Generation) system that:

Ingests documents from multiple sources — Confluence, GitHub repos, PDF specs, internal wikis
Chunks and embeds the content using different strategies per document type
Stores vectors in ChromaDB with metadata for filtering
Retrieves relevant context using hybrid search (semantic + keyword BM25)
Generates answers via Claude with source citations

The user-facing interface is a chat-like web app where people can ask questions in natural language and get answers grounded in actual company documentation.

Architecture Decisions That Mattered

Chunking strategy ended up being more important than the embedding model or the LLM choice. We tried fixed-size chunks first and the answers were terrible — important context was getting split across boundaries. Switched to semantic chunking with overlap, plus parent-document retrieval for longer technical docs.

Hybrid retrieval was a game-changer. Pure vector search missed exact terms (like project names, specific error codes). Adding BM25 keyword matching alongside vector similarity improved answer quality significantly.

Query expansion — before searching, we use Claude to generate 2-3 alternative phrasings of the user’s question. This catches cases where the documentation uses different terminology than the question.

Tech Stack

Python with LlamaIndex for the orchestration layer
ChromaDB (self-hosted) and experimenting with Pinecone for managed option
Voyage AI embeddings (better than OpenAI for technical content in our testing)
Claude for generation
React frontend, FastAPI backend
Docker on AWS ECS

Results

Developers find answers 3x faster than searching Confluence manually
New hire onboarding questions dropped by about 40%
The system handles ~200 queries per day
Answer accuracy (based on user feedback) sits around 85%

The remaining 15% is mostly questions about very recent changes that haven’t been re-indexed yet, or questions that require reasoning across multiple documents.

Timeline: Prototype in mid-2024, production since Q4 2024. Continuous improvement on chunking and retrieval quality.