2026Eval infrastructure · CI engineering

Prompt Regression TestingCI Quality Gates for LLM Behavior

Unit testing for LLM behavior: a calibrated golden suite runs on every PR and blocks merge when a prompt or model change drops output quality below the measured noise floor.

System architecture

Build spec

Golden suite: 55 cases · 5 domains · trap-seeded
Gate: 90% of cases at 0.60+ · hard floor 0.40
Baseline: 3-run noise floor: 0.757, spread 0.085
CI: GitHub Actions · full vs 12-case smoke scoping
Stack: pytest + DeepEval · Claude Haiku (Bedrock)

Problem

The silent failure mode of LLM products: tweak a prompt, it looks fine on three examples, ship it, and quietly break five other behaviors. Spot-checking doesn't scale and nobody notices until a user does.

Approach

A golden suite of 55 hand-written cases across five domains, each seeded with deliberate traps (buried ledes, jargon, financial figures, tone attribution). Every PR touching prompts or models runs the suite in GitHub Actions through a two-layer gate: deterministic asserts (word caps, required facts, no AI boilerplate) plus a G-Eval judge for faithfulness and coverage. Merge is blocked unless 90% of cases score 0.60 or higher with no case below 0.40.

Impact

Thresholds are calibrated, not invented: a 3-run baseline established the noise floor (mean 0.757, spread 0.085) before any gate was set. CI is cost-scoped: prompt-touching PRs run all 55 cases, everything else runs a 12-case smoke set. Every blocked merge writes a regression log naming what changed and which cases broke.

Decisions & tradeoffs

Calibrate thresholds from measured noise

An invented threshold either blocks good PRs or passes bad ones. Three identical baseline runs established the real run-to-run spread first; the gate sits above the noise, below the regressions.

Fraction gate over mean gate

A mean can hide one catastrophic case behind nine good ones. Requiring 90% of cases to individually pass, with a hard floor, makes single-behavior regressions un-hideable.

Deterministic asserts before the judge

Word caps, required facts, and boilerplate bans don't need an LLM. The judge only spends tokens on what actually requires judgment: faithfulness and coverage.

System notes

55 hand-written golden cases across 5 domains, each with deliberate traps
Two-layer gate: deterministic asserts plus a G-Eval judge scoring
Thresholds calibrated from a 3-run noise floor (mean 0.757, spread 0.085), not guessed
Fraction-based gate (90% of cases must pass) resists single-case judge noise

Stack

pytest · DeepEval · GitHub Actions · AWS Bedrock · Pydantic · Python

View source on GitHub

Next project

RAG Evaluation Framework · Before/After Numbers for Every Change