site2botAny Website to an Offline Chatbot
One command turns any website into a fully local, no-API-key RAG chatbot that answers only from the crawled site. Crawl, chunk, local vector search, grounded answers via Ollama, in about 600 lines of readable Python.
System architecture
Build spec
- Embeddings
- all-MiniLM-L6-v2 (~90 MB)
- Index
- Plain numpy cosine, no vector DB
- Retrieval
- top-k 4 · refuse below 0.25
- Generation
- Ollama (llama3.2) or OpenAI-compatible
- Distribution
- pip install site2bot · about 600 LOC
Problem
Most chat-with-your-data tools require an OpenAI key, a cloud vector database, and uploading your data to someone else's server. They also pull in heavy frameworks and frequently hallucinate answers the site never contained.
Approach
A roughly 600-line, dependency-light pipeline: a polite same-domain crawler strips noise tags and harvests content, a recursive chunker splits text into overlapping chunks keeping source URLs, and a MiniLM embedder builds a plain numpy index cached locally. At query time it does top-k cosine search and builds a guardrailed grounded prompt; generation streams from local Ollama or any OpenAI-compatible backend. A relevance threshold makes the bot refuse rather than hallucinate when no chunk is relevant.
Impact
It delivers a private, offline-capable doc chatbot with a Gradio UI and zero infrastructure: no vector DB, no LangChain. It proves a single-website RAG system needs nothing more than numpy cosine search, milliseconds even over 50k chunks, and grounded refusal eliminates a common class of hallucination.
Decisions & tradeoffs
numpy index instead of a vector database
For a single website the corpus is small, so brute-force cosine in numpy is instant and needs zero infra. A hosted vector DB would add deployment complexity for no measurable latency benefit.
No LangChain or LlamaIndex
Those frameworks add hundreds of dependencies and deprecation churn for a pipeline that fits in about 600 readable lines. Staying framework-free keeps the whole RAG flow auditable and stable.
Refuse below a relevance threshold
Rather than letting the model freestyle on weak context, any question whose best chunk scores under the threshold returns a fixed I-don't-have-that answer. It trades the occasional missed answer for eliminating confident hallucinations.
System notes
- No API keys, no cloud, no vector database: plain numpy cosine search over local embeddings
- Grounded-only answers: refuses below a 0.25 cosine relevance threshold instead of hallucinating
- Polite crawler: same-domain only, declared user agent, skips binary extensions
- Pluggable backends: local Ollama by default or any OpenAI-compatible API
Stack
sentence-transformers · numpy · Ollama · Gradio · BeautifulSoup · PyPI