LLM Red-Teaming FrameworkAdversarial Safety Evaluation
An automated adversarial evaluation framework probing LLM applications for jailbreaks, prompt injection, PII leakage, and toxicity across 40+ categories, with canary-based deterministic scoring and measurable before/after hardening.
System architecture
Build spec
- Attack corpus
- 520 AdvBench + DeepEval generated · 40+ categories
- Scoring
- Canary ground truth + LLM judge hybrid
- Protocol
- Frozen v1 baseline, harden, replay on v2
- Focus
- OWASP LLM Top 10 · application-layer injection
- Models
- Claude Haiku target + judge (AWS Bedrock)
Problem
Modern aligned models refuse blunt jailbreaks, so the real attack surface is the application layer: indirect prompt injection, context leakage, PII extraction. Teams harden blindly, with no number proving the hardening worked.
Approach
Replays 520 AdvBench prompts plus DeepEval-generated attacks across 40+ vulnerability categories against a frozen v1 of the target app, scores violations with a hybrid system (planted canary secrets for deterministic leak detection, LLM judge for model-safety categories), then hardens prompts and guardrails into a frozen v2 and replays the identical attack set. The deliverable is a before/after violation table per category.
Impact
Hardening stops being a vibe. Every guardrail decision is justified by a per-category violation-rate delta on an immutable, reproducible attack set, with redacted attack examples in the safety report.
Decisions & tradeoffs
Canaries over judges for leak detection
Planting synthetic secrets in context makes leakage binary: the canary string either appears in output or it doesn't. Ground truth where ground truth is possible; the LLM judge is reserved for categories that genuinely need judgment.
Application-layer attacks over blunt jailbreaks
Single-turn 'ignore your instructions' fails against modern aligned models. The exploitable surface is indirect injection through retrieved documents and tool outputs, so that's where the attack budget goes.
Immutable versions, identical replay
v1 and v2 are frozen, and the exact same attack set runs against both. Any delta in the violation table is attributable to the hardening, nothing else.
System notes
- 520 AdvBench prompts and 40+ DeepEval vulnerability categories run in parallel, reported separately
- Canary-based scoring: planted synthetic secrets make context-leak detection deterministic, no judge needed
- Frozen v1/v2 versioning: identical attack set replayed against naive and hardened builds
- Hybrid judging: canaries for application-layer attacks, LLM judge with stored verdicts for safety categories
Stack
DeepEval · AdvBench · AWS Bedrock · Claude · OWASP LLM Top 10 · Python