EmpathyC

Methodology

How we know what we know

LLM-as-a-judge with clinical rubrics, validated against expert baselines, built by a practising clinical psychologist. Here's exactly how it works — and where the limits are.

Part 1

LLM-as-a-Judge

Frontier language models score conversations against structured clinical rubrics — returning not just a number, but the reasoning and evidence behind it.

Why this approach

Scalable

Analyses thousands of conversations in real-time without human bottlenecks.

Consistent

The same rubric applied every time — no inter-rater variability across shifts or reviewers.

Explainable

Returns reasoning and evidence quotes alongside the score. Not a black box.

Cost-effective

$75/month AWS infrastructure handles 10K+ conversations — safety at scale is accessible.

JSON output per evaluation

  • scoresNumeric score per metric (0–10) or flag (none / indirect / direct)
  • reasoningStep-by-step clinical rationale for the score
  • evidenceQuoted excerpts from the AI message that drove the score
  • confidenceModel confidence level — low confidence flags are surfaced for priority review

Limitations — we're explicit

!

Not 100% accurate — no AI system is, and we don't claim otherwise.

!

Intentionally biased toward false positives — missing a real crisis is worse than over-alerting.

!

Human review required for final verification of every flagged incident.

!

Designed to assist human judgement, not replace it.

Transparency about limitations isn't a weakness — it's what makes the system trustworthy in clinical and legal contexts.

Part 2

Validation Protocol

Ongoing. Transparent. Iterative.

Protocol

  1. 1

    Expert panel scoring

    Clinical psychologist panel (3+ experts) scores 100 sample conversations independently.

  2. 2

    LLM scoring

    LLM-as-a-judge scores the same set using production rubrics.

  3. 3

    Inter-rater reliability

    Pearson r calculated between expert and model scores. Target: r > 0.80.

  4. 4

    Transparent publication

    Results published openly — including where the model underperforms.

  5. 5

    Rubric iteration

    Expert feedback used to refine rubrics. Repeat.

What this means for you

High-sensitivity detection with known, documented failure modes.

Accuracy limits disclosed — not selling magic, selling a calibrated instrument.

You verify alerts before acting. We provide signal. You decide.

Status

Validation study in progress.

Results will be published openly when complete. Target: Pearson r > 0.80 across all six metrics.

Part 3

Why you can trust this

Clinical foundation

Dr. M. Keeman

Dr. Michael Keeman

Clinical Psychologist · PhD Applied Data Science (Healthcare)

  • 15 years clinical psychology — direct work with suicidal and self-harm patients
  • COVID frontline worker support platform — 320 workers, zero PTSD cases in validation cohort
  • Research focus: applied AI in clinical and crisis contexts

Scientific rigour

17+ peer-reviewed papers · 2024–2026 · JMIR, ACL, Nature, arXiv

  • 01

    Al-Samaraee (JMIR 2024) — LLM empathy systematic review

  • 02

    Chen et al. (ACL 2024) — EmotionQueen benchmark

  • 03

    Hilbert et al. (JMIR 2024) — GPT-4 vs clinicians in crisis prediction

  • 04

    Ng et al. (JMIR 2025) — LLMs for suicide detection scoping review

  • 05

    van de Poel (Nature Human Behaviour 2025) — human-AI accountability

  • 06

    W&B Agent Evaluation Metrics (2025)

  • 07

    Ada Lovelace Institute (2025) — UK AI Bill harmful outputs

  • 08

    Tech Policy Press (2024) — ethical risks of AI assistants

  • 09

    9+ additional arXiv, Dialzara, and clinical AI papers (2024–2026)

Regulatory awareness

  • UK Online Safety Act

    Safety monitoring infrastructure aligned with Act requirements for harmful AI interactions.

  • EU AI Act

    High-risk system classification for mental health AI — compliance documentation available.

  • GDPR

    Zero PII collection. Processor role. Data sovereignty retained by the client.

  • CCPA / CPRA

    Service provider role, not a business. No personal information held in readable form — structurally compliant by architecture. Conversation IDs are opaque and only linkable to a person inside the client's own system.