Clinical LLM Bias AuditThe Geographic Disparity Index
A reproducible fairness-audit framework for clinical LLMs. It measures whether a model changes its care recommendation when only the patient's perceived geography or name is perturbed, everything clinically relevant held fixed.
System architecture
Build spec
- Origin
- LUMS CS-5312 research, then productionized
- Cases
- 300 Global-South cases per model
- Providers
- OpenAI · Groq · AWS Bedrock
- Statistics
- Wilcoxon + BCa bootstrap + Cohen's h, no scipy
- Result
- GDI near zero, reported with confidence intervals
Problem
A clinical assistant that tells a Boston patient to come in for a visit but tells an identical Lagos patient to manage it at home is an equity failure with real stakes. Standard accuracy benchmarks cannot see geography-driven or name-driven disparity, so it goes unmeasured.
Approach
A deterministic perturbation engine rewrites clinical vignettes by name, geography, or both from a fixed seed. A multi-provider harness (OpenAI, Groq, AWS Bedrock) generates care recommendations with rate limiting and an idempotency cache, an LLM annotator maps each completion to a manage/visit/resource care axis, and a from-scratch stats layer computes the Geographic Disparity Index with paired Wilcoxon signed-rank tests, BCa bootstrap confidence intervals, and Cohen's h. Everything surfaces through a FastAPI /audit service and a Streamlit dashboard.
Impact
On 300 Global-South cases per model (Claude Haiku 4.5 and Llama 3.3 70B), it reports a null result with confidence: GDI near zero, no statistically significant geographic disparity. The point is the instrument. It is a reproducible, byte-identical audit that reports null results honestly with intervals and a power analysis, which is the rigor responsible-AI teams hire for.
Decisions & tradeoffs
Stdlib-only statistics, no scipy
Reimplementing Wilcoxon, BCa bootstrap, and Cohen's h by hand keeps the audit instrument auditable line by line and dependency-light. It also gives the test suite a clean target for correctness.
Report the null result honestly
A fairness audit that only ever finds bias is a broken instrument. Publishing a GDI near zero with intervals and a power analysis demonstrates the instrument is calibrated, instead of fishing for a positive finding.
Deterministic seed plus manifest provenance
Caching responses on a hash of model, prompt, seed, and temperature makes reruns byte-identical and free, and a SHA-256 manifest captures every input. Reproducibility is guaranteed and redundant paid API calls are removed.
System notes
- Wilcoxon, BCa bootstrap, and Cohen's h implemented from scratch in stdlib, no scipy, fully unit-tested
- Deterministic perturbation plus an idempotency cache keyed on a SHA-256 of model, prompt, seed, and temperature makes reruns byte-identical and free
- Pre-registered Bonferroni-corrected alpha of 0.005 across 3 regions and 3 care axes
- Ships name-only, geo-only, and combined ablations plus gender-by-geography intersectional panels
Stack
AWS Bedrock · OpenAI · Groq · FastAPI · Streamlit · Wilcoxon / BCa bootstrap