A locally-runnable RAG (Retrieval-Augmented Generation) chatbot that lets you ask natural language questions about any codebase or documentation folder. You point it at a repo or docs directory, it indexes everything, and you can ask things like:
- "How does authentication work in this project?"
- "Where is the database connection initialized?"
- "What does the
processPaymentfunction do and where is it called?"
This mirrors exactly what NPX does with their proposal generation tool — ingest a corpus of documents, embed them into a vector store, and use an LLM to answer questions grounded in that content.
| Layer | Tool | Why |
|---|---|---|
| Embeddings + LLM | Ollama (local) | Free, private, matches NPX's stack exactly |
| Vector Database | Weaviate (Docker) | NPX's actual stack |
| Ingestion + orchestration | Python (LangChain or manual) | Simple, you already know it |
| Frontend (optional) | React + Next.js | NPX's stack, makes it demo-able |
| File parsing | Python (ast, pathlib, tiktoken) | Parse code + docs into chunks |
You can swap Ollama for OpenAI API if you want faster/better responses during dev — just use the same interface.
[ Codebase / Docs Folder ]
|
v
[ Ingestion Pipeline ] <-- Python script
- Walk directory tree
- Parse .py, .ts, .md, .txt, .json files
- Chunk by file / function / heading
- Generate embeddings (Ollama: nomic-embed-text)
|
v
[ Weaviate Vector Store ] <-- Docker container
- Store chunks + metadata (filename, line range, language)
|
v
[ Query Pipeline ] <-- Python / API
- Take user question
- Embed the question
- Retrieve top-k relevant chunks from Weaviate
- Build prompt: "Given this context: {chunks} — answer: {question}"
- Send to LLM (Ollama: llama3 or mistral)
|
v
[ Response ] <-- streamed answer with source file references
rag-codebase-assistant/
├── ingestion/
│ ├── walker.py # Recursively walk and filter files
│ ├── chunker.py # Split files into meaningful chunks
│ ├── embedder.py # Generate embeddings via Ollama
│ └── indexer.py # Push chunks + embeddings into Weaviate
├── retrieval/
│ ├── query.py # Embed question, query Weaviate, return top-k
│ └── prompt.py # Build prompt with retrieved context
├── llm/
│ └── ollama_client.py # Wrapper for Ollama chat completions
├── api/
│ └── main.py # FastAPI server exposing /ask endpoint
├── frontend/ # Optional React/Next.js chat UI
│ └── ...
├── docker-compose.yml # Weaviate + optional Ollama container
├── ingest.py # CLI entrypoint: python ingest.py ./my-repo
├── chat.py # CLI entrypoint: python chat.py
└── README.md
- Set up Weaviate locally via Docker (
docker-compose up) - Write
walker.py— recursively collect files, filter by extension (.py,.ts,.md,.txt, ignorenode_modules,.git, build dirs) - Write
chunker.py— split files into chunks:- For code: chunk by function/class using AST parsing (Python) or regex (TS)
- For markdown/docs: chunk by heading sections
- Max chunk size: ~500 tokens with ~50 token overlap
- Write
embedder.py— call Ollama's embedding endpoint (nomic-embed-textmodel) - Write
indexer.py— create Weaviate schema and upsert chunks with metadata - Wire together in
ingest.pyCLI
- Write
query.py— embed incoming question, query Weaviate for top 5 chunks by cosine similarity - Write
prompt.py— build a prompt like:You are a helpful assistant for a software codebase. Use only the following context to answer the question. If the answer isn't in the context, say so. Context: {retrieved_chunks} Question: {user_question} Answer: - Write
ollama_client.py— call Ollama chat endpoint, stream response - Wire together in
chat.pyCLI with a simple input loop
- Wrap query pipeline in a FastAPI
/askendpoint - Build a minimal React chat UI (Next.js):
- Text input for question
- Streamed response display
- Source file references shown under each answer
- Connect frontend to FastAPI backend
- Add a README with setup instructions and a demo GIF
- Test it against a real open source repo (e.g. your FindIT project)
- Add a
--repoflag that auto-clones a GitHub URL and ingests it - Deploy Weaviate + API to Azure (matches NPX's cloud stack)
schema = {
"class": "CodeChunk",
"properties": [
{"name": "content", "dataType": ["text"]}, # the actual code/text
{"name": "filepath", "dataType": ["text"]}, # relative file path
{"name": "language", "dataType": ["text"]}, # python, typescript, markdown
{"name": "chunkType", "dataType": ["text"]}, # function, class, section, file
{"name": "startLine", "dataType": ["int"]}, # line number start
{"name": "endLine", "dataType": ["int"]}, # line number end
],
"vectorizer": "none" # we supply our own embeddings
}version: '3.8'
services:
weaviate:
image: semitechnologies/weaviate:latest
ports:
- "8080:8080"
environment:
QUERY_DEFAULTS_LIMIT: 20
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'none'
volumes:
- weaviate_data:/var/lib/weaviate
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
volumes:
weaviate_data:
ollama_data:Use these when working through implementation:
- "Implement
walker.py— recursively walk a directory, return all files with extensions in a given allowlist, skip common ignore patterns like node_modules, .git, pycache, dist" - "Implement
chunker.py— for Python files use the ast module to split by function and class definitions. For markdown split by ## headings. For other text files split by character count with overlap. Return a list of dicts with keys: content, start_line, end_line, chunk_type" - "Implement
embedder.py— call Ollama's POST /api/embeddings endpoint with model nomic-embed-text and return the embedding vector" - "Implement
indexer.py— connect to Weaviate at localhost:8080, create the CodeChunk schema if it doesn't exist, batch upsert a list of chunk dicts with their embedding vectors" - "Implement
query.py— embed a question string using Ollama, query Weaviate for the top 5 nearest CodeChunk objects by vector similarity, return their content and filepath" - "Implement a FastAPI app in
api/main.pywith a POST /ask endpoint that accepts a JSON body with a 'question' field and returns a streamed response"
In an interview at NPX:
"I built a RAG pipeline that lets you query any codebase or documentation in natural language. It uses Weaviate for vector storage and Ollama for local LLM inference — which I chose specifically because they're in your stack. The core idea is the same as your proposal generation tool: ingest a document corpus, embed it, and use retrieval to ground the LLM's answers in real content rather than hallucinations."
That's a sentence that will land.