LLM-as-Judge SystemCalibrated Against Blind Human Labels
A calibrated LLM-as-judge evaluation system: versioned rubrics, persisted chain-of-thought for every score, and a blind human-labeling protocol that measures judge reliability instead of assuming it.
System architecture
Build spec
- Dataset
- AlpacaEval · 200 stratified outputs
- Rubrics
- 4 G-Eval rubrics · anchored 1–5 scales
- Judge
- Claude Sonnet · temperature 0 · CoT persisted
- Calibration
- Blind human labels · Spearman ρ per rubric
- Scale
- 800 judged scores · cached by rubric version
Problem
An unmeasured judge is just one model's opinion. Teams adopt LLM-as-judge for scale, then never check whether the judge agrees with humans, where it's biased, or why it disagrees.
Approach
Scores 200 Claude Haiku outputs (stratified from AlpacaEval's 805 instructions) with a Sonnet judge on four versioned G-Eval rubrics: coherence, factuality, tone, safety, with anchored 1-to-5 scales and pinned evaluation steps. All 800 scores persist with full chain-of-thought. A blind annotation sheet (judge scores hidden by construction) collects human labels for Spearman agreement analysis and a disagreement gallery.
Impact
800 judged scores with auditable reasoning. Factuality emerged as the weakest rubric (mean 4.51 vs 4.70 coherence), and the blind protocol quantifies exactly how far the judge can be trusted before replacing humans.
Decisions & tradeoffs
Pin the evaluation steps
G-Eval normally invents evaluation steps per run, which makes scores drift. Freezing rubric versions with pinned steps makes every score reproducible and every rubric change auditable.
Blind labels or no labels
Humans who can see the judge's score anchor to it, inflating agreement. The annotation sheet is built without judge output, so measured agreement is real.
Persist the chain-of-thought
A bare 3/5 is undebuggable. Storing the judge's reasoning for all 800 scores turns every disagreement into a diagnosable case instead of a mystery.
System notes
- 800 judged scores: 200 outputs by 4 rubrics, every score with stored chain-of-thought
- Versioned, frozen rubrics with pinned evaluation steps for reproducible scoring
- Blind calibration: annotation sheets contain no judge output, preventing agreement contamination
- Disagreement gallery pairs human labels with judge reasoning for bias diagnosis
Stack
DeepEval · G-Eval · AlpacaEval · AWS Bedrock · SciPy · Python