Skip to work
All work
2026Eval methodology · Experiment infrastructure

RAG Evaluation FrameworkBefore/After Numbers for Every Change

A RAG evaluation framework with frozen synthetic test sets and three-metric decomposition, turning pipeline tuning from guesswork into measured experiments.

System architecture

CorpusdocumentsChunk + embedconfig-drivenTest setRAGAS · frozenChromaDBtop-k retrieveGeneratetemperature 0RAGAS metrics3-way decomposition

Build spec

Test set
100–200 synthetic QA pairs · frozen + committed
Metrics
Faithfulness · answer relevancy · context recall
Pipeline
LangChain + ChromaDB · temperature 0
Experiments
YAML configs: chunking, top-k, prompts
Analysis
Before/after tables · per-question diffs

Problem

Most RAG builders test on a handful of questions by hand and ship. Then every change to chunking, retrieval depth, or prompts is a guess, because nothing is measured.

Approach

Generates a synthetic test set (100 to 200 QA pairs) from the corpus with RAGAS, freezes and commits it, then scores every pipeline configuration on three orthogonal metrics: faithfulness, answer relevancy, and context recall. Experiments are YAML configs (chunk size, overlap, top-k, prompt template); a compare tool produces before/after tables with per-question diffs sliced by question type.

Impact

Failures localize instead of blurring: low context recall on long documents means a retrieval problem, low faithfulness means generation is hallucinating past its context. Per-question scores survive every run, so finding which questions broke is one command.

Decisions & tradeoffs

Freeze the test set

Regenerating questions per run silently shifts the baseline and makes comparisons meaningless. Generate once, commit, and every experiment answers against the same exam.

Three metrics over one score

A single RAG score can't say whether retrieval or generation failed. Faithfulness, answer relevancy, and context recall decompose the pipeline so the fix is obvious from the failure.

Configs as experiments

Chunk size, overlap, top-k, and prompt template all live in YAML. Changing an experiment is a diff, and reproducing one is a filename.

System notes

  • RAGAS-generated synthetic test set: built once from the corpus, frozen, committed
  • Three orthogonal metrics localize failure: retrieval vs generation vs relevancy
  • Config-driven experiments: one YAML is one reproducible run
  • Per-question scores preserved, enabling slice analysis by doc length and question type

Stack

RAGAS · LangChain · ChromaDB · AWS Bedrock · Python · YAML

View source on GitHub
Next project
everytongue · A Translator for Any Language