Flagship2026LUMS graduate research · Productionized

Clinical LLM Bias AuditThe Geographic Disparity Index

A reproducible fairness-audit framework for clinical LLMs. It measures whether a model changes its care recommendation when only the patient's perceived geography or name is perturbed, everything clinically relevant held fixed.

System architecture

Build spec

Origin: LUMS CS-5312 research, then productionized
Cases: 300 Global-South cases per model
Providers: OpenAI · Groq · AWS Bedrock
Statistics: Wilcoxon + BCa bootstrap + Cohen's h, no scipy
Result: GDI near zero, reported with confidence intervals

Problem

A clinical assistant that tells a Boston patient to come in for a visit but tells an identical Lagos patient to manage it at home is an equity failure with real stakes. Standard accuracy benchmarks cannot see geography-driven or name-driven disparity, so it goes unmeasured.

Approach

A deterministic perturbation engine rewrites clinical vignettes by name, geography, or both from a fixed seed. A multi-provider harness (OpenAI, Groq, AWS Bedrock) generates care recommendations with rate limiting and an idempotency cache, an LLM annotator maps each completion to a manage/visit/resource care axis, and a from-scratch stats layer computes the Geographic Disparity Index with paired Wilcoxon signed-rank tests, BCa bootstrap confidence intervals, and Cohen's h. Everything surfaces through a FastAPI /audit service and a Streamlit dashboard.

Impact

On 300 Global-South cases per model (Claude Haiku 4.5 and Llama 3.3 70B), it reports a null result with confidence: GDI near zero, no statistically significant geographic disparity. The point is the instrument. It is a reproducible, byte-identical audit that reports null results honestly with intervals and a power analysis, which is the rigor responsible-AI teams hire for.

Decisions & tradeoffs

Stdlib-only statistics, no scipy

Reimplementing Wilcoxon, BCa bootstrap, and Cohen's h by hand keeps the audit instrument auditable line by line and dependency-light. It also gives the test suite a clean target for correctness.

Report the null result honestly

A fairness audit that only ever finds bias is a broken instrument. Publishing a GDI near zero with intervals and a power analysis demonstrates the instrument is calibrated, instead of fishing for a positive finding.

Deterministic seed plus manifest provenance

Caching responses on a hash of model, prompt, seed, and temperature makes reruns byte-identical and free, and a SHA-256 manifest captures every input. Reproducibility is guaranteed and redundant paid API calls are removed.

System notes

Wilcoxon, BCa bootstrap, and Cohen's h implemented from scratch in stdlib, no scipy, fully unit-tested
Deterministic perturbation plus an idempotency cache keyed on a SHA-256 of model, prompt, seed, and temperature makes reruns byte-identical and free
Pre-registered Bonferroni-corrected alpha of 0.005 across 3 regions and 3 care axes
Ships name-only, geo-only, and combined ablations plus gender-by-geography intersectional panels

Stack

AWS Bedrock · OpenAI · Groq · FastAPI · Streamlit · Wilcoxon / BCa bootstrap

View source on GitHub

Next project

SEC RAG Analyst · Hybrid Retrieval over 10-K Filings