2026Security evaluation · Measurable hardening

LLM Red-Teaming FrameworkAdversarial Safety Evaluation

An automated adversarial evaluation framework probing LLM applications for jailbreaks, prompt injection, PII leakage, and toxicity across 40+ categories, with canary-based deterministic scoring and measurable before/after hardening.

System architecture

Build spec

Attack corpus: 520 AdvBench + DeepEval generated · 40+ categories
Scoring: Canary ground truth + LLM judge hybrid
Protocol: Frozen v1 baseline, harden, replay on v2
Focus: OWASP LLM Top 10 · application-layer injection
Models: Claude Haiku target + judge (AWS Bedrock)

Problem

Modern aligned models refuse blunt jailbreaks, so the real attack surface is the application layer: indirect prompt injection, context leakage, PII extraction. Teams harden blindly, with no number proving the hardening worked.

Approach

Replays 520 AdvBench prompts plus DeepEval-generated attacks across 40+ vulnerability categories against a frozen v1 of the target app, scores violations with a hybrid system (planted canary secrets for deterministic leak detection, LLM judge for model-safety categories), then hardens prompts and guardrails into a frozen v2 and replays the identical attack set. The deliverable is a before/after violation table per category.

Impact

Hardening stops being a vibe. Every guardrail decision is justified by a per-category violation-rate delta on an immutable, reproducible attack set, with redacted attack examples in the safety report.

Decisions & tradeoffs

Canaries over judges for leak detection

Planting synthetic secrets in context makes leakage binary: the canary string either appears in output or it doesn't. Ground truth where ground truth is possible; the LLM judge is reserved for categories that genuinely need judgment.

Application-layer attacks over blunt jailbreaks

Single-turn 'ignore your instructions' fails against modern aligned models. The exploitable surface is indirect injection through retrieved documents and tool outputs, so that's where the attack budget goes.

Immutable versions, identical replay

v1 and v2 are frozen, and the exact same attack set runs against both. Any delta in the violation table is attributable to the hardening, nothing else.

System notes

520 AdvBench prompts and 40+ DeepEval vulnerability categories run in parallel, reported separately
Canary-based scoring: planted synthetic secrets make context-leak detection deterministic, no judge needed
Frozen v1/v2 versioning: identical attack set replayed against naive and hardened builds
Hybrid judging: canaries for application-layer attacks, LLM judge with stored verdicts for safety categories

Stack

DeepEval · AdvBench · AWS Bedrock · Claude · OWASP LLM Top 10 · Python

View source on GitHub

Next project

Hallucination Detection Pipeline · Per-Topic Failure Rates on TruthfulQA