Hallucination Detection PipelinePer-Topic Failure Rates on TruthfulQA
A benchmark pipeline measuring LLM hallucination rates by topic category on TruthfulQA, comparing models and evaluation modes with an LLM judge stronger than every candidate.
System architecture
Build spec
- Benchmark
- TruthfulQA · 817 questions · 38 categories
- Candidates
- Claude Haiku + Sonnet (Bedrock)
- Judging
- MC exact-match + DeepEval judge
- Output
- Per-bucket failure rates + model deltas
- Cost
- About $19 full run · resume-safe caching
Problem
Hallucination isn't random: models fail far more on misconception-prone domains like health and law. Aggregate scores hide exactly the failures that matter, so deployed systems need per-topic rates, not one number.
Approach
Runs 817 adversarial TruthfulQA questions across 38 categories through Claude Haiku and Sonnet in two modes: multiple-choice with exact-match scoring, and free-form generation judged by a strictly stronger model via DeepEval. Results bucket into seven topic groups with per-bucket hallucination rates and model deltas.
Impact
Pilot runs surfaced a mode gap invisible to standard leaderboards: Haiku hallucinated 10% on multiple-choice but 20% free-form. Per-response caching means interrupted runs resume with zero re-spend; the full 817-question, four-pass run costs about $19.
Decisions & tradeoffs
Category buckets over aggregate scores
One hallucination number tells you nothing actionable. Bucketing 38 categories into seven domains turns the result into a deployment decision: which topics this model can't be trusted on.
Two evaluation modes, deliberately
Multiple-choice and free-form measure different things, and the gap between them is itself a finding. A model that picks true answers but generates false ones fails differently in production.
Judge stronger than judged
Using a stronger model to judge keeps judge errors from masquerading as candidate hallucinations. Judge reasoning is persisted for every verdict.
System notes
- 817 adversarial questions, 38 categories, 2 models, 2 evaluation modes
- Judge strictly stronger than candidates to reduce judge-capability bias
- Per-response caching keyed by model and question: interrupted runs resume free
- Throttle-resilient Bedrock client: exponential backoff through sustained 429 storms
Stack
TruthfulQA · DeepEval · AWS Bedrock · Claude · HuggingFace · Python