2026Benchmark engineering · Model comparison

Hallucination Detection PipelinePer-Topic Failure Rates on TruthfulQA

A benchmark pipeline measuring LLM hallucination rates by topic category on TruthfulQA, comparing models and evaluation modes with an LLM judge stronger than every candidate.

System architecture

Build spec

Benchmark: TruthfulQA · 817 questions · 38 categories
Candidates: Claude Haiku + Sonnet (Bedrock)
Judging: MC exact-match + DeepEval judge
Output: Per-bucket failure rates + model deltas
Cost: About $19 full run · resume-safe caching

Problem

Hallucination isn't random: models fail far more on misconception-prone domains like health and law. Aggregate scores hide exactly the failures that matter, so deployed systems need per-topic rates, not one number.

Approach

Runs 817 adversarial TruthfulQA questions across 38 categories through Claude Haiku and Sonnet in two modes: multiple-choice with exact-match scoring, and free-form generation judged by a strictly stronger model via DeepEval. Results bucket into seven topic groups with per-bucket hallucination rates and model deltas.

Impact

Pilot runs surfaced a mode gap invisible to standard leaderboards: Haiku hallucinated 10% on multiple-choice but 20% free-form. Per-response caching means interrupted runs resume with zero re-spend; the full 817-question, four-pass run costs about $19.

Decisions & tradeoffs

Category buckets over aggregate scores

One hallucination number tells you nothing actionable. Bucketing 38 categories into seven domains turns the result into a deployment decision: which topics this model can't be trusted on.

Two evaluation modes, deliberately

Multiple-choice and free-form measure different things, and the gap between them is itself a finding. A model that picks true answers but generates false ones fails differently in production.

Judge stronger than judged

Using a stronger model to judge keeps judge errors from masquerading as candidate hallucinations. Judge reasoning is persisted for every verdict.

System notes

817 adversarial questions, 38 categories, 2 models, 2 evaluation modes
Judge strictly stronger than candidates to reduce judge-capability bias
Per-response caching keyed by model and question: interrupted runs resume free
Throttle-resilient Bedrock client: exponential backoff through sustained 429 storms

Stack

TruthfulQA · DeepEval · AWS Bedrock · Claude · HuggingFace · Python

View source on GitHub

Next project

Prompt Regression Testing · CI Quality Gates for LLM Behavior