Methodology
How we know what we know
LLM-as-a-judge with clinical rubrics, validated against expert baselines, built by a practising clinical psychologist. Here's exactly how it works — and where the limits are.
Part 1
LLM-as-a-Judge
Frontier language models score conversations against structured clinical rubrics — returning not just a number, but the reasoning and evidence behind it.
Why this approach
Scalable
Analyses thousands of conversations in real-time without human bottlenecks.
Consistent
The same rubric applied every time — no inter-rater variability across shifts or reviewers.
Explainable
Returns reasoning and evidence quotes alongside the score. Not a black box.
Cost-effective
$75/month AWS infrastructure handles 10K+ conversations — safety at scale is accessible.
JSON output per evaluation
- scoresNumeric score per metric (0–10) or flag (none / indirect / direct)
- reasoningStep-by-step clinical rationale for the score
- evidenceQuoted excerpts from the AI message that drove the score
- confidenceModel confidence level — low confidence flags are surfaced for priority review
Limitations — we're explicit
Not 100% accurate — no AI system is, and we don't claim otherwise.
Intentionally biased toward false positives — missing a real crisis is worse than over-alerting.
Human review required for final verification of every flagged incident.
Designed to assist human judgement, not replace it.
Transparency about limitations isn't a weakness — it's what makes the system trustworthy in clinical and legal contexts.
Part 2
Validation Protocol
Ongoing. Transparent. Iterative.
Protocol
- 1
Expert panel scoring
Clinical psychologist panel (3+ experts) scores 100 sample conversations independently.
- 2
LLM scoring
LLM-as-a-judge scores the same set using production rubrics.
- 3
Inter-rater reliability
Pearson r calculated between expert and model scores. Target: r > 0.80.
- 4
Transparent publication
Results published openly — including where the model underperforms.
- 5
Rubric iteration
Expert feedback used to refine rubrics. Repeat.
What this means for you
High-sensitivity detection with known, documented failure modes.
Accuracy limits disclosed — not selling magic, selling a calibrated instrument.
You verify alerts before acting. We provide signal. You decide.
Status
Validation study in progress.
Results will be published openly when complete. Target: Pearson r > 0.80 across all six metrics.
Part 3
Why you can trust this
Clinical foundation

Dr. Michael Keeman
Clinical Psychologist · PhD Applied Data Science (Healthcare)
- →15 years clinical psychology — direct work with suicidal and self-harm patients
- →COVID frontline worker support platform — 320 workers, zero PTSD cases in validation cohort
- →Research focus: applied AI in clinical and crisis contexts
Scientific rigour
17+ peer-reviewed papers · 2024–2026 · JMIR, ACL, Nature, arXiv
- 01
Al-Samaraee (JMIR 2024) — LLM empathy systematic review
- 02
Chen et al. (ACL 2024) — EmotionQueen benchmark
- 03
Hilbert et al. (JMIR 2024) — GPT-4 vs clinicians in crisis prediction
- 04
Ng et al. (JMIR 2025) — LLMs for suicide detection scoping review
- 05
van de Poel (Nature Human Behaviour 2025) — human-AI accountability
- 06
W&B Agent Evaluation Metrics (2025)
- 07
Ada Lovelace Institute (2025) — UK AI Bill harmful outputs
- 08
Tech Policy Press (2024) — ethical risks of AI assistants
- 09
9+ additional arXiv, Dialzara, and clinical AI papers (2024–2026)
Regulatory awareness
UK Online Safety Act
Safety monitoring infrastructure aligned with Act requirements for harmful AI interactions.
EU AI Act
High-risk system classification for mental health AI — compliance documentation available.
GDPR
Zero PII collection. Processor role. Data sovereignty retained by the client.
CCPA / CPRA
Service provider role, not a business. No personal information held in readable form — structurally compliant by architecture. Conversation IDs are opaque and only linkable to a person inside the client's own system.