Prompt Regression TestingCI Quality Gates for LLM Behavior
Unit testing for LLM behavior: a calibrated golden suite runs on every PR and blocks merge when a prompt or model change drops output quality below the measured noise floor.
System architecture
Build spec
- Golden suite
- 55 cases · 5 domains · trap-seeded
- Gate
- 90% of cases at 0.60+ · hard floor 0.40
- Baseline
- 3-run noise floor: 0.757, spread 0.085
- CI
- GitHub Actions · full vs 12-case smoke scoping
- Stack
- pytest + DeepEval · Claude Haiku (Bedrock)
Problem
The silent failure mode of LLM products: tweak a prompt, it looks fine on three examples, ship it, and quietly break five other behaviors. Spot-checking doesn't scale and nobody notices until a user does.
Approach
A golden suite of 55 hand-written cases across five domains, each seeded with deliberate traps (buried ledes, jargon, financial figures, tone attribution). Every PR touching prompts or models runs the suite in GitHub Actions through a two-layer gate: deterministic asserts (word caps, required facts, no AI boilerplate) plus a G-Eval judge for faithfulness and coverage. Merge is blocked unless 90% of cases score 0.60 or higher with no case below 0.40.
Impact
Thresholds are calibrated, not invented: a 3-run baseline established the noise floor (mean 0.757, spread 0.085) before any gate was set. CI is cost-scoped: prompt-touching PRs run all 55 cases, everything else runs a 12-case smoke set. Every blocked merge writes a regression log naming what changed and which cases broke.
Decisions & tradeoffs
Calibrate thresholds from measured noise
An invented threshold either blocks good PRs or passes bad ones. Three identical baseline runs established the real run-to-run spread first; the gate sits above the noise, below the regressions.
Fraction gate over mean gate
A mean can hide one catastrophic case behind nine good ones. Requiring 90% of cases to individually pass, with a hard floor, makes single-behavior regressions un-hideable.
Deterministic asserts before the judge
Word caps, required facts, and boilerplate bans don't need an LLM. The judge only spends tokens on what actually requires judgment: faithfulness and coverage.
System notes
- 55 hand-written golden cases across 5 domains, each with deliberate traps
- Two-layer gate: deterministic asserts plus a G-Eval judge scoring
- Thresholds calibrated from a 3-run noise floor (mean 0.757, spread 0.085), not guessed
- Fraction-based gate (90% of cases must pass) resists single-case judge noise
Stack
pytest · DeepEval · GitHub Actions · AWS Bedrock · Pydantic · Python