Skip to work
All work
2025Local RAG · Published on PyPI

site2botAny Website to an Offline Chatbot

One command turns any website into a fully local, no-API-key RAG chatbot that answers only from the crawled site. Crawl, chunk, local vector search, grounded answers via Ollama, in about 600 lines of readable Python.

System architecture

Crawlsame-domainChunk + embedMiniLMnumpy indexno vector DBCosine top-krefuse < 0.25Ollamagrounded

Build spec

Embeddings
all-MiniLM-L6-v2 (~90 MB)
Index
Plain numpy cosine, no vector DB
Retrieval
top-k 4 · refuse below 0.25
Generation
Ollama (llama3.2) or OpenAI-compatible
Distribution
pip install site2bot · about 600 LOC

Problem

Most chat-with-your-data tools require an OpenAI key, a cloud vector database, and uploading your data to someone else's server. They also pull in heavy frameworks and frequently hallucinate answers the site never contained.

Approach

A roughly 600-line, dependency-light pipeline: a polite same-domain crawler strips noise tags and harvests content, a recursive chunker splits text into overlapping chunks keeping source URLs, and a MiniLM embedder builds a plain numpy index cached locally. At query time it does top-k cosine search and builds a guardrailed grounded prompt; generation streams from local Ollama or any OpenAI-compatible backend. A relevance threshold makes the bot refuse rather than hallucinate when no chunk is relevant.

Impact

It delivers a private, offline-capable doc chatbot with a Gradio UI and zero infrastructure: no vector DB, no LangChain. It proves a single-website RAG system needs nothing more than numpy cosine search, milliseconds even over 50k chunks, and grounded refusal eliminates a common class of hallucination.

Decisions & tradeoffs

numpy index instead of a vector database

For a single website the corpus is small, so brute-force cosine in numpy is instant and needs zero infra. A hosted vector DB would add deployment complexity for no measurable latency benefit.

No LangChain or LlamaIndex

Those frameworks add hundreds of dependencies and deprecation churn for a pipeline that fits in about 600 readable lines. Staying framework-free keeps the whole RAG flow auditable and stable.

Refuse below a relevance threshold

Rather than letting the model freestyle on weak context, any question whose best chunk scores under the threshold returns a fixed I-don't-have-that answer. It trades the occasional missed answer for eliminating confident hallucinations.

System notes

  • No API keys, no cloud, no vector database: plain numpy cosine search over local embeddings
  • Grounded-only answers: refuses below a 0.25 cosine relevance threshold instead of hallucinating
  • Polite crawler: same-domain only, declared user agent, skips binary extensions
  • Pluggable backends: local Ollama by default or any OpenAI-compatible API

Stack

sentence-transformers · numpy · Ollama · Gradio · BeautifulSoup · PyPI

View source on GitHub
Next project
Offline Speech Translator · Real-Time On-Device Speech-to-Speech