2026Eval methodology · Experiment infrastructure

RAG Evaluation FrameworkBefore/After Numbers for Every Change

A RAG evaluation framework with frozen synthetic test sets and three-metric decomposition, turning pipeline tuning from guesswork into measured experiments.

System architecture

Build spec

Test set: 100–200 synthetic QA pairs · frozen + committed
Metrics: Faithfulness · answer relevancy · context recall
Pipeline: LangChain + ChromaDB · temperature 0
Experiments: YAML configs: chunking, top-k, prompts
Analysis: Before/after tables · per-question diffs

Problem

Most RAG builders test on a handful of questions by hand and ship. Then every change to chunking, retrieval depth, or prompts is a guess, because nothing is measured.

Approach

Generates a synthetic test set (100 to 200 QA pairs) from the corpus with RAGAS, freezes and commits it, then scores every pipeline configuration on three orthogonal metrics: faithfulness, answer relevancy, and context recall. Experiments are YAML configs (chunk size, overlap, top-k, prompt template); a compare tool produces before/after tables with per-question diffs sliced by question type.

Impact

Failures localize instead of blurring: low context recall on long documents means a retrieval problem, low faithfulness means generation is hallucinating past its context. Per-question scores survive every run, so finding which questions broke is one command.

Decisions & tradeoffs

Freeze the test set

Regenerating questions per run silently shifts the baseline and makes comparisons meaningless. Generate once, commit, and every experiment answers against the same exam.

Three metrics over one score

A single RAG score can't say whether retrieval or generation failed. Faithfulness, answer relevancy, and context recall decompose the pipeline so the fix is obvious from the failure.

Configs as experiments

Chunk size, overlap, top-k, and prompt template all live in YAML. Changing an experiment is a diff, and reproducing one is a filename.

System notes

RAGAS-generated synthetic test set: built once from the corpus, frozen, committed
Three orthogonal metrics localize failure: retrieval vs generation vs relevancy
Config-driven experiments: one YAML is one reproducible run
Per-question scores preserved, enabling slice analysis by doc length and question type

Stack

RAGAS · LangChain · ChromaDB · AWS Bedrock · Python · YAML

View source on GitHub

Next project

everytongue · A Translator for Any Language