A local RAG assistant for asking natural-language questions about a codebase or documentation folder.
The MVP uses:
- Ollama for local embeddings and chat
- Weaviate in Docker for vector search
- Python CLIs for ingestion and chat
- FastAPI for the first API surface
The application has two main workflows: indexing and asking.
During indexing, ingest.py receives a local repo or docs folder path. The walker scans that folder, skips noisy directories like .git, .venv, node_modules, dist, and build, and keeps useful source/documentation files. The chunker then splits those files into smaller records. Python files are split around top-level functions and classes, Markdown files are split by headings, JavaScript and TypeScript files are split around common declarations, and other supported text files are split by size.
Each chunk is sent to Ollama's embedding API using the nomic-embed-text model. The resulting vector plus metadata such as filepath, language, chunk type, line range, and symbol name are stored in Weaviate. Weaviate is configured with vectorizer: none, which means this app supplies its own embeddings instead of asking Weaviate to generate them.
During question answering, chat.py or the FastAPI /ask endpoint receives a user question. The question is embedded with the same Ollama embedding model, then Weaviate searches for the closest stored chunks using vector similarity. Those chunks are formatted into a grounded prompt and sent to Ollama's chat model, currently llama3.2. The final answer is returned with source references so the user can inspect which files supported the response.
In junior-dev terms: the app turns code files into searchable numeric fingerprints, finds the fingerprints most similar to your question, and gives only those relevant snippets to the LLM so the answer is based on the indexed repo instead of generic guesses.
This project is a local RAG pipeline for querying codebases and documentation in natural language. I built the ingestion, retrieval, and answer-generation flow manually in Python so the architecture is easy to inspect and explain. It uses Ollama for private local embeddings and chat inference, Weaviate for vector storage/search, and FastAPI for a simple API layer.
The key design decision is separating the system into clear stages: file discovery, chunking, embedding, indexing, retrieval, prompt construction, and answer generation. That keeps the app modular and makes it straightforward to improve retrieval quality, add more file parsers, swap models, or build a frontend later.
Run these from C:\src\projects\rag-codebase-assistant.
uv sync
docker compose up -d
ollama listIf the models are not listed yet:
ollama pull nomic-embed-text
ollama pull llama3.2Point the ingester at any local repo or docs folder. It does not need to live inside this project.
uv run python ingest.py C:\src\projects\some-other-repo --resetUse --reset when you want to clear the existing Weaviate collection before indexing.
uv run python chat.pyRetrieve more or fewer chunks per answer:
uv run python chat.py --top-k 10uv run uvicorn rag_assistant.api.main:app --reloadThen call:
Invoke-RestMethod `
-Method Post `
-Uri http://127.0.0.1:8000/ask `
-ContentType application/json `
-Body '{"question":"How does authentication work?","top_k":5}'The frontend uses the newline-delimited JSON stream at /ask/events.
In a second terminal:
cd frontend
npm.cmd install
npm.cmd run devThen open http://localhost:3000.
If your API is not running on http://127.0.0.1:8000, copy frontend/.env.example to frontend/.env.local and update NEXT_PUBLIC_API_URL.
Copy .env.example to .env if you want to customize endpoints or model names.
WEAVIATE_URL=http://localhost:8080
OLLAMA_URL=http://localhost:11434
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_CHAT_MODEL=llama3.2
WEAVIATE_CLASS=CodeChunk
uv run python -m unittest discover -s tests