SEC RAG AnalystHybrid Retrieval over 10-K Filings
A production-style RAG assistant over SEC 10-K filings. Section-aware chunking feeds hybrid BM25-plus-dense retrieval, RRF fusion, and a cross-encoder rerank, returning grounded answers with inline citations and a labeled eval.
System architecture
Build spec
- Source
- SEC EDGAR 10-K, ticker to CIK
- Retrieval
- BM25 + BGE dense in FAISS, RRF-fused
- Rerank
- BAAI/bge-reranker-base cross-encoder
- Generation
- Claude with inline [n] citations
- Eval
- Section-recall@k + reranker ablation
Problem
10-K filings are the hard case for RAG: 100-plus pages, dense tables, repetitive boilerplate, and cross-references, where naive embed-everything plus top-k cosine retrieves plausible but wrong passages. Answers without provenance or measurement cannot be trusted in financial analysis.
Approach
An ingestion step resolves ticker to CIK to the latest 10-K from SEC EDGAR, strips HTML, and segments into Item sections. Section-aware chunking slides word windows that never straddle an Item boundary. A hybrid index combines BM25 lexical retrieval with BGE-small dense embeddings in FAISS, fuses them with Reciprocal Rank Fusion, and reranks with a BGE cross-encoder. Generation produces grounded answers with inline citations via Claude, with an extractive fallback when no API key is present. Served through FastAPI /ask and Streamlit, with an eval measuring section-recall@k and the reranker's lift.
Impact
It does the production-grade RAG parts tutorials skip, section-aware chunking, hybrid retrieval, reranking, citations, and then measures them. A no-rerank ablation proves the cross-encoder earns its latency, and the whole pipeline runs offline end to end via a synthetic filing and extractive fallback, with no network or API keys.
Decisions & tradeoffs
RRF fusion over score normalization
Reciprocal Rank Fusion merges BM25 and dense results by rank, avoiding fragile cross-scorer normalization. The hybrid merge stays robust when lexical and dense scores live on incomparable scales.
Extractive fallback when no API key
When the Anthropic key is absent, the system returns top passages extractively instead of failing. The whole pipeline and its eval run offline and key-free for reproducibility.
Section-aware chunking
Chunks are bounded to Item sections rather than fixed character counts, so a chunk never straddles a boundary. Retrieved context stays coherent for a document type full of cross-references and boilerplate.
System notes
- Four-stage retrieval (BM25, dense, RRF fusion, cross-encoder rerank) where each stage's contribution is independently measurable
- A no-rerank ablation flag proves the cross-encoder earns its added latency via section-recall@k lift
- Section-aware chunking never crosses an Item boundary, avoiding cross-reference contamination
- A fully offline path with a synthetic 10-K and extractive fallback runs end to end with no network and no key
Stack
BM25 · FAISS · BGE · Cross-encoder · Claude · FastAPI