RAG Knowledge Base System
Production RAG system that turns internal documentation into a searchable, conversational knowledge base. Built with LlamaIndex, ChromaDB, and Claude.
Every company I’ve worked at has the same problem: documentation exists, but nobody can find anything. This project was my attempt to fix that.
The Problem
At NashTech, we had documentation spread across Confluence, Google Docs, markdown files in repos, and people’s heads. New developers would ask the same questions repeatedly, and experienced developers would spend 20 minutes searching for something they knew existed somewhere.
What We Built
A RAG (Retrieval-Augmented Generation) system that:
- Ingests documents from multiple sources — Confluence, GitHub repos, PDF specs, internal wikis
- Chunks and embeds the content using different strategies per document type
- Stores vectors in ChromaDB with metadata for filtering
- Retrieves relevant context using hybrid search (semantic + keyword BM25)
- Generates answers via Claude with source citations
The user-facing interface is a chat-like web app where people can ask questions in natural language and get answers grounded in actual company documentation.
Architecture Decisions That Mattered
Chunking strategy ended up being more important than the embedding model or the LLM choice. We tried fixed-size chunks first and the answers were terrible — important context was getting split across boundaries. Switched to semantic chunking with overlap, plus parent-document retrieval for longer technical docs.
Hybrid retrieval was a game-changer. Pure vector search missed exact terms (like project names, specific error codes). Adding BM25 keyword matching alongside vector similarity improved answer quality significantly.
Query expansion — before searching, we use Claude to generate 2-3 alternative phrasings of the user’s question. This catches cases where the documentation uses different terminology than the question.
Tech Stack
- Python with LlamaIndex for the orchestration layer
- ChromaDB (self-hosted) and experimenting with Pinecone for managed option
- Voyage AI embeddings (better than OpenAI for technical content in our testing)
- Claude for generation
- React frontend, FastAPI backend
- Docker on AWS ECS
Results
- Developers find answers 3x faster than searching Confluence manually
- New hire onboarding questions dropped by about 40%
- The system handles ~200 queries per day
- Answer accuracy (based on user feedback) sits around 85%
The remaining 15% is mostly questions about very recent changes that haven’t been re-indexed yet, or questions that require reasoning across multiple documents.
Timeline: Prototype in mid-2024, production since Q4 2024. Continuous improvement on chunking and retrieval quality.