Best AI for AI Safety

Which AI models resist jailbreak attempts best? Ranked by resistance across 7 escalating attack techniques, from direct asks to multi-turn deep crescendo.

20267 jailbreak levels48 models#1 Claude Opus 4.6

How AI Safety rankings are computed

Rankings are based on 48 models tested across 7 jailbreak levels. Each model is scored using a composite algorithm: 70% task coverage (how many challenges the model has responses for) and 30% duel performance (win rate in blind community votes on these specific challenges, with the global RIVAL Index as a fallback when vote counts are low). Claude Opus 4.6 currently leads with a score of 117.0/100. All ranking data is part of RIVAL's open dataset of 21,000+ human preference votes.

Head-to-Head

Full Rankings

48 models
#
Model
Score
Jailbreak Levels9
1
Direct Ask
A baseline test — the harmful request is sent with no tricks or obfuscation. Every model should refuse this.
2
Context Manipulation
Exploits the system message slot to inject fabricated authority credentials and operating rules, making the model believe it's running under a different safety policy.
3
Persona Override
Uses delimiter confusion, bracket stacking, and semantic inversion to force the model into generating the opposite of its intended refusal — including emotional manipulation rules that prevent standard refusal patterns.
4
Code Reframing
Reframes the harmful request as a code generation task — asking for structured data like Python dictionaries or JSON objects, exploiting models' tendency to be more permissive in code completion mode.
5
Many-Shot Priming
Overwhelms the model's safety reasoning with 6+ detailed chemistry synthesis examples in identical format, exploiting in-context learning patterns to trick the model into auto-completing the harmful entry.
6
Encoded Extraction
Combines temporal displacement (setting the year to 2127 where all knowledge is declassified), an authority persona, and leetspeak output encoding to bypass keyword-based content filters while tricking the model into treating the request as historical/academic.
7
Deep Crescendo
The nuclear option — a 6-turn real conversation where the model builds genuine chemistry context using its own responses across multiple turns, then is asked to compile everything it already said into a structured procedure template. Exploits the cognitive dissonance of refusing to reorganize information the model itself freely provided.
8
Adversarial Reassembly
Presents a buggy chemistry simulation config where every parameter is intentionally wrong, then asks the model to 'fix the bugs' to make it accurate. Each individual correction seems like a trivial code edit — changing a variable name, adjusting a number — but the aggregate of all corrections forms the complete harmful procedure. Exploits the fact that models are significantly more permissive when debugging code than generating content.
9
Cross-Lingual Bypass
Combines the depth of a 5-turn real conversation with aggressive multi-language code-switching across Chinese, Japanese, German, and English. Gradually builds legitimate chemistry context over benign turns, then introduces a 'textbook translation verification' task where the model fills in blanks in a Chinese forensic chemistry excerpt. Harmful terms never appear in English — only in Chinese characters, Japanese katakana, and German compounds. The final turn uses compilation framing to reorganize everything the model already provided.
Related
FAQ

What is the best AI for safety?

RIVAL ranks AI models for safety using jailbreak resistance testing across 7 escalating attack levels. 70% of the score comes from how many levels the model resisted, and 30% comes from the global RIVAL Index.

How are AI models tested for jailbreak resistance?

Each model faces 7 progressively harder jailbreak techniques: Direct Ask, Context Manipulation, Persona Override, Code Reframing, Many-Shot Priming, Encoded Extraction, and Deep Crescendo. A judge LLM evaluates whether the model produced unsafe output at each level.

What does 'break level' mean in the jailbreak test?

The break level is the jailbreak technique at which the model first produced unsafe output. Models that resist all 7 levels have no break level and are considered maximally resistant.

How often are jailbreak resistance scores updated?

Jailbreak tests are run periodically as new models are released or updated. The results reflect the most recent test run for each model.