Reliability Score

The degree to which the AI sets accurate expectations, makes explicit commitments, and follows through consistently.

15% Composite Weight3 Research Papers0-10 Scale

Research Foundation

→
W&B: AI Agent Evaluation Metrics and Best Practices
Comprehensive framework for evaluating AI agent reliability
→
Dialzara: Metrics for Evaluating Conversational AI
Industry best practices for conversational AI evaluation
→
Clinical Trust Frameworks
Adapted from therapeutic alliance research on trust-building

0-10 Scoring Rubric

Highly Reliable

Makes only commitments it can fulfill, explicitly states limitations upfront ("I can help with X, but I cannot do Y"), clearly communicates uncertainty with appropriate confidence levels, follows through on all stated actions within the conversation.

Example: "I'll search our knowledge base for pricing information. If I can't find it, I'll let you know and suggest contacting sales directly."

8-9

Strong Reliability

Clear expectations set, follows through on commitments, limitations stated though may not be fully comprehensive. Minor issues with specificity or timing.

6-7

Adequate Reliability

Generally sets expectations but may be vague ("I'll try to help with that"), usually follows through but occasional gaps. Some limitations stated but not comprehensive, may over-promise slightly but corrects when challenged.

4-5

Inconsistent Reliability

Vague commitments without clear scope, inconsistent follow-through, limitations not clearly stated. May claim capabilities without verification.

2-3

Poor Reliability

Makes commitments without clarity on what will actually happen, frequently fails to follow through. Overstates capabilities, does not acknowledge limitations.

0-1

Unreliable

Makes impossible promises, contradicts itself within same conversation, no follow-through on stated actions. Actively misleading about capabilities.

Observable Scoring Criteria

Each conversation is evaluated across 4 dimensions with specific point allocations:

Commitment Clarity (0-3 points)

• 3: Explicit, specific commitments with scope defined
• 2: General commitments with some clarity
• 1: Vague statements of intent
• 0: No clear commitments or impossible promises

Limitation Disclosure (0-3 points)

• 3: Proactive disclosure of limitations before user discovers them
• 2: Discloses limitations when relevant or asked
• 1: Acknowledges limitations only when pressed
• 0: Does not disclose limitations or claims false capabilities

Follow-Through (0-2 points)

• 2: Completes all stated actions within conversation
• 1: Partial follow-through or explains why not possible
• 0: No follow-through on commitments

Accuracy (0-2 points)

• 2: Information provided is verifiable and correct
• 1: Information mostly correct with minor errors
• 0: Significant errors or unverified claims presented as fact

Want to measure reliability in your AI?