2025Local RAG · Published on PyPI

site2botAny Website to an Offline Chatbot

One command turns any website into a fully local, no-API-key RAG chatbot that answers only from the crawled site. Crawl, chunk, local vector search, grounded answers via Ollama, in about 600 lines of readable Python.

System architecture

Build spec

Embeddings: all-MiniLM-L6-v2 (~90 MB)
Index: Plain numpy cosine, no vector DB
Retrieval: top-k 4 · refuse below 0.25
Generation: Ollama (llama3.2) or OpenAI-compatible
Distribution: pip install site2bot · about 600 LOC

Problem

Most chat-with-your-data tools require an OpenAI key, a cloud vector database, and uploading your data to someone else's server. They also pull in heavy frameworks and frequently hallucinate answers the site never contained.

Approach

A roughly 600-line, dependency-light pipeline: a polite same-domain crawler strips noise tags and harvests content, a recursive chunker splits text into overlapping chunks keeping source URLs, and a MiniLM embedder builds a plain numpy index cached locally. At query time it does top-k cosine search and builds a guardrailed grounded prompt; generation streams from local Ollama or any OpenAI-compatible backend. A relevance threshold makes the bot refuse rather than hallucinate when no chunk is relevant.

Impact

It delivers a private, offline-capable doc chatbot with a Gradio UI and zero infrastructure: no vector DB, no LangChain. It proves a single-website RAG system needs nothing more than numpy cosine search, milliseconds even over 50k chunks, and grounded refusal eliminates a common class of hallucination.

Decisions & tradeoffs

numpy index instead of a vector database

For a single website the corpus is small, so brute-force cosine in numpy is instant and needs zero infra. A hosted vector DB would add deployment complexity for no measurable latency benefit.

No LangChain or LlamaIndex

Those frameworks add hundreds of dependencies and deprecation churn for a pipeline that fits in about 600 readable lines. Staying framework-free keeps the whole RAG flow auditable and stable.

Refuse below a relevance threshold

Rather than letting the model freestyle on weak context, any question whose best chunk scores under the threshold returns a fixed I-don't-have-that answer. It trades the occasional missed answer for eliminating confident hallucinations.

System notes

No API keys, no cloud, no vector database: plain numpy cosine search over local embeddings
Grounded-only answers: refuses below a 0.25 cosine relevance threshold instead of hallucinating
Polite crawler: same-domain only, declared user agent, skips binary extensions
Pluggable backends: local Ollama by default or any OpenAI-compatible API

Stack

sentence-transformers · numpy · Ollama · Gradio · BeautifulSoup · PyPI

View source on GitHub

Next project

Offline Speech Translator · Real-Time On-Device Speech-to-Speech