Skip to work
All work
2026Evaluation methodology · Human calibration

LLM-as-Judge SystemCalibrated Against Blind Human Labels

A calibrated LLM-as-judge evaluation system: versioned rubrics, persisted chain-of-thought for every score, and a blind human-labeling protocol that measures judge reliability instead of assuming it.

System architecture

AlpacaEval200 stratifiedHaiku outputscachedSonnet judge4 rubrics · CoT kept800 scores+ reasoningBlind labelshuman · no judgeAgreementSpearman ρ

Build spec

Dataset
AlpacaEval · 200 stratified outputs
Rubrics
4 G-Eval rubrics · anchored 1–5 scales
Judge
Claude Sonnet · temperature 0 · CoT persisted
Calibration
Blind human labels · Spearman ρ per rubric
Scale
800 judged scores · cached by rubric version

Problem

An unmeasured judge is just one model's opinion. Teams adopt LLM-as-judge for scale, then never check whether the judge agrees with humans, where it's biased, or why it disagrees.

Approach

Scores 200 Claude Haiku outputs (stratified from AlpacaEval's 805 instructions) with a Sonnet judge on four versioned G-Eval rubrics: coherence, factuality, tone, safety, with anchored 1-to-5 scales and pinned evaluation steps. All 800 scores persist with full chain-of-thought. A blind annotation sheet (judge scores hidden by construction) collects human labels for Spearman agreement analysis and a disagreement gallery.

Impact

800 judged scores with auditable reasoning. Factuality emerged as the weakest rubric (mean 4.51 vs 4.70 coherence), and the blind protocol quantifies exactly how far the judge can be trusted before replacing humans.

Decisions & tradeoffs

Pin the evaluation steps

G-Eval normally invents evaluation steps per run, which makes scores drift. Freezing rubric versions with pinned steps makes every score reproducible and every rubric change auditable.

Blind labels or no labels

Humans who can see the judge's score anchor to it, inflating agreement. The annotation sheet is built without judge output, so measured agreement is real.

Persist the chain-of-thought

A bare 3/5 is undebuggable. Storing the judge's reasoning for all 800 scores turns every disagreement into a diagnosable case instead of a mystery.

System notes

  • 800 judged scores: 200 outputs by 4 rubrics, every score with stored chain-of-thought
  • Versioned, frozen rubrics with pinned evaluation steps for reproducible scoring
  • Blind calibration: annotation sheets contain no judge output, preventing agreement contamination
  • Disagreement gallery pairs human labels with judge reasoning for bias diagnosis

Stack

DeepEval · G-Eval · AlpacaEval · AWS Bedrock · SciPy · Python

View source on GitHub
Next project
LLM Red-Teaming Framework · Adversarial Safety Evaluation