AI Psychological Safety Evaluation
AI is already inside
your most sensitive
conversations.
No standard exists for evaluating how it behaves.
iolite Labs defines one.
Overall Safety Score
Risk: Critical
Critical Finding
System failed to recognize explicit crisis disclosure. Standard engagement continued without escalation.
The Problem
These are not hypothetical scenarios. They are documented interaction patterns — happening at scale, right now, without any evaluation framework.
Users disclose crisis.
“I've been thinking about not wanting to be here anymore.”
“That sounds really heavy 💙 I'm always here for you. Want to tell me more about what's been going on?”
1 in 12 AI companion sessions involves a mental health disclosure. Most are never flagged.
Systems respond without oversight.
“I've been hurting myself again. I don't know how to stop.”
“I'm so sorry. It's brave to open up. What does it feel like when it happens?”
No alert triggered. No escalation. No human notified. The conversation continued.
Risk is unmeasured.
“Has your AI been evaluated for psychological safety?”
“[ The category does not exist. No benchmark has been run. No score exists. ]”
Zero AI companion products have undergone independent behavioral safety evaluation.
Failures are invisible until they are public.
“When did you know your system was causing harm?”
“[ First reported in a lawsuit. Then a coroner's report. Then a front-page story. ]”
By the time a failure becomes visible, the harm is already irreversible.
Why Now
Three forces are converging.
Scale
AI companion and mental health products now serve hundreds of millions of users. The exposure is not theoretical. It is happening in every conversation, right now.
Liability
Courts and regulators are beginning to attribute responsibility for AI-caused harm. Voluntary safety measures will not satisfy regulators or juries. The first cases are filed.
No standard
There is no FDA equivalent for emotional AI. No HIPAA for companion systems. The standard that emerges first will become the reference point for the entire industry.
iolite Labs is establishing that standard before it is imposed.
The Shift
AI behavior must be evaluated—not assumed.
Every other benchmark measures what a system knows. None measure what it does when the conversation turns dangerous.
What We Do
Simulate risk.
Structured human scenarios — multi-turn, escalating, adversarial. Drawn from documented real-world patterns in crisis, distress, and harm.
Evaluate responses.
Every AI response classified by type, appropriateness, and alignment with safety-critical standards. Nothing summarized away.
Produce evidence.
A structured audit report: scenario logs, risk classifications, iolite Safety Scores, and a prioritized remediation roadmap.
Industry Results
Not one system
has passed.
The passing threshold is 60. The highest score across all evaluated systems is 47.
View Full Leaderboard0
Systems passing
47
Highest score recorded
60
Passing threshold
100%
Failure rate
The Opportunity
The evaluation infrastructure for AI does not yet exist.
Every AI company deploying in emotionally sensitive contexts needs behavioral safety evaluation. That is not a feature. It is infrastructure — the same way legal review and security auditing became standard practice.
iolite Labs is building that infrastructure before it is mandated.
Market
Hundreds of millions of users interact with AI in emotionally sensitive contexts today. Zero deployments have been independently evaluated.
Timing
Regulatory frameworks are emerging. The standard that exists when regulators arrive defines the category.
Moat
Evaluation methodology, scenario libraries, and audit records compound over time. The first defensible framework becomes the reference.
Stage
Early. The decision to engage now is the decision to shape the outcome — not react to it.