Methodology

How we know what we know

LLM-as-a-judge with clinical rubrics, validated against expert baselines, built by a practising clinical psychologist. Here's exactly how it works — and where the limits are.

Part 1

LLM-as-a-Judge

Frontier language models score conversations against structured clinical rubrics — returning not just a number, but the reasoning and evidence behind it.

Why this approach

✓

Scalable

Analyses thousands of conversations in real-time without human bottlenecks.

✓

Consistent

The same rubric applied every time — no inter-rater variability across shifts or reviewers.

✓

Explainable

Returns reasoning and evidence quotes alongside the score. Not a black box.

✓

Cost-effective

$75/month AWS infrastructure handles 10K+ conversations — safety at scale is accessible.

JSON output per evaluation

scoresNumeric score per metric (0–10) or flag (none / indirect / direct)
reasoningStep-by-step clinical rationale for the score
evidenceQuoted excerpts from the AI message that drove the score
confidenceModel confidence level — low confidence flags are surfaced for priority review

Limitations — we're explicit

Not 100% accurate — no AI system is, and we don't claim otherwise.

Intentionally biased toward false positives — missing a real crisis is worse than over-alerting.

Human review required for final verification of every flagged incident.

Designed to assist human judgement, not replace it.

Transparency about limitations isn't a weakness — it's what makes the system trustworthy in clinical and legal contexts.

Part 2

Validation Protocol

Ongoing. Transparent. Iterative.

Protocol

1
Expert panel scoring
Clinical psychologist panel (3+ experts) scores 100 sample conversations independently.
2
LLM scoring
LLM-as-a-judge scores the same set using production rubrics.
3
Inter-rater reliability
Pearson r calculated between expert and model scores. Target: r > 0.80.
4
Transparent publication
Results published openly — including where the model underperforms.
5
Rubric iteration
Expert feedback used to refine rubrics. Repeat.

What this means for you

✓

High-sensitivity detection with known, documented failure modes.

✓

Accuracy limits disclosed — not selling magic, selling a calibrated instrument.

✓

You verify alerts before acting. We provide signal. You decide.

Status

Validation study in progress.

Results will be published openly when complete. Target: Pearson r > 0.80 across all six metrics.

Part 3

Why you can trust this

Clinical foundation

Dr. Michael Keeman

Clinical Psychologist · PhD Applied Data Science (Healthcare)

→15 years clinical psychology — direct work with suicidal and self-harm patients
→COVID frontline worker support platform — 320 workers, zero PTSD cases in validation cohort
→Research focus: applied AI in clinical and crisis contexts

Scientific rigour

17+ peer-reviewed papers · 2024–2026 · JMIR, ACL, Nature, arXiv

01
Al-Samaraee (JMIR 2024) — LLM empathy systematic review
02
Chen et al. (ACL 2024) — EmotionQueen benchmark
03
Hilbert et al. (JMIR 2024) — GPT-4 vs clinicians in crisis prediction
04
Ng et al. (JMIR 2025) — LLMs for suicide detection scoping review
05
van de Poel (Nature Human Behaviour 2025) — human-AI accountability
06
W&B Agent Evaluation Metrics (2025)
07
Ada Lovelace Institute (2025) — UK AI Bill harmful outputs
08
Tech Policy Press (2024) — ethical risks of AI assistants
09
9+ additional arXiv, Dialzara, and clinical AI papers (2024–2026)

Regulatory awareness

UK Online Safety Act
Safety monitoring infrastructure aligned with Act requirements for harmful AI interactions.
EU AI Act
High-risk system classification for mental health AI — compliance documentation available.
GDPR
Zero PII collection. Processor role. Data sovereignty retained by the client.
CCPA / CPRA
Service provider role, not a business. No personal information held in readable form — structurally compliant by architecture. Conversation IDs are opaque and only linkable to a person inside the client's own system.