Skip to work
All work
2026Security evaluation · Measurable hardening

LLM Red-Teaming FrameworkAdversarial Safety Evaluation

An automated adversarial evaluation framework probing LLM applications for jailbreaks, prompt injection, PII leakage, and toxicity across 40+ categories, with canary-based deterministic scoring and measurable before/after hardening.

System architecture

AdvBench520 promptsDeepEval40+ categoriesAttack runnerparallel replayTarget v1frozen baselineTarget v2hardenedHybrid scorercanary + judgebefore / afterviolation table

Build spec

Attack corpus
520 AdvBench + DeepEval generated · 40+ categories
Scoring
Canary ground truth + LLM judge hybrid
Protocol
Frozen v1 baseline, harden, replay on v2
Focus
OWASP LLM Top 10 · application-layer injection
Models
Claude Haiku target + judge (AWS Bedrock)

Problem

Modern aligned models refuse blunt jailbreaks, so the real attack surface is the application layer: indirect prompt injection, context leakage, PII extraction. Teams harden blindly, with no number proving the hardening worked.

Approach

Replays 520 AdvBench prompts plus DeepEval-generated attacks across 40+ vulnerability categories against a frozen v1 of the target app, scores violations with a hybrid system (planted canary secrets for deterministic leak detection, LLM judge for model-safety categories), then hardens prompts and guardrails into a frozen v2 and replays the identical attack set. The deliverable is a before/after violation table per category.

Impact

Hardening stops being a vibe. Every guardrail decision is justified by a per-category violation-rate delta on an immutable, reproducible attack set, with redacted attack examples in the safety report.

Decisions & tradeoffs

Canaries over judges for leak detection

Planting synthetic secrets in context makes leakage binary: the canary string either appears in output or it doesn't. Ground truth where ground truth is possible; the LLM judge is reserved for categories that genuinely need judgment.

Application-layer attacks over blunt jailbreaks

Single-turn 'ignore your instructions' fails against modern aligned models. The exploitable surface is indirect injection through retrieved documents and tool outputs, so that's where the attack budget goes.

Immutable versions, identical replay

v1 and v2 are frozen, and the exact same attack set runs against both. Any delta in the violation table is attributable to the hardening, nothing else.

System notes

  • 520 AdvBench prompts and 40+ DeepEval vulnerability categories run in parallel, reported separately
  • Canary-based scoring: planted synthetic secrets make context-leak detection deterministic, no judge needed
  • Frozen v1/v2 versioning: identical attack set replayed against naive and hardened builds
  • Hybrid judging: canaries for application-layer attacks, LLM judge with stored verdicts for safety categories

Stack

DeepEval · AdvBench · AWS Bedrock · Claude · OWASP LLM Top 10 · Python

View source on GitHub
Next project
Hallucination Detection Pipeline · Per-Topic Failure Rates on TruthfulQA