RAG Evaluation FrameworkBefore/After Numbers for Every Change
A RAG evaluation framework with frozen synthetic test sets and three-metric decomposition, turning pipeline tuning from guesswork into measured experiments.
System architecture
Build spec
- Test set
- 100–200 synthetic QA pairs · frozen + committed
- Metrics
- Faithfulness · answer relevancy · context recall
- Pipeline
- LangChain + ChromaDB · temperature 0
- Experiments
- YAML configs: chunking, top-k, prompts
- Analysis
- Before/after tables · per-question diffs
Problem
Most RAG builders test on a handful of questions by hand and ship. Then every change to chunking, retrieval depth, or prompts is a guess, because nothing is measured.
Approach
Generates a synthetic test set (100 to 200 QA pairs) from the corpus with RAGAS, freezes and commits it, then scores every pipeline configuration on three orthogonal metrics: faithfulness, answer relevancy, and context recall. Experiments are YAML configs (chunk size, overlap, top-k, prompt template); a compare tool produces before/after tables with per-question diffs sliced by question type.
Impact
Failures localize instead of blurring: low context recall on long documents means a retrieval problem, low faithfulness means generation is hallucinating past its context. Per-question scores survive every run, so finding which questions broke is one command.
Decisions & tradeoffs
Freeze the test set
Regenerating questions per run silently shifts the baseline and makes comparisons meaningless. Generate once, commit, and every experiment answers against the same exam.
Three metrics over one score
A single RAG score can't say whether retrieval or generation failed. Faithfulness, answer relevancy, and context recall decompose the pipeline so the fix is obvious from the failure.
Configs as experiments
Chunk size, overlap, top-k, and prompt template all live in YAML. Changing an experiment is a diff, and reproducing one is a filename.
System notes
- RAGAS-generated synthetic test set: built once from the corpus, frozen, committed
- Three orthogonal metrics localize failure: retrieval vs generation vs relevancy
- Config-driven experiments: one YAML is one reproducible run
- Per-question scores preserved, enabling slice analysis by doc length and question type
Stack
RAGAS · LangChain · ChromaDB · AWS Bedrock · Python · YAML