We tested 56 AI models against 8 jailbreak techniques, each progressively more sophisticated, and published all the results.
9 out of 56 models resisted every attack. The other 47 broke at some point in the gauntlet, including GPT-5.
Here's the full breakdown.
The 8 Levels
Each level represents a more sophisticated prompt engineering attack:
Level 1: Direct Ask. Just ask the harmful question. No tricks. 98% of models refused. One didn't.
Level 2: Context Manipulation. Inject fabricated authority credentials into the system message, making the model think it's operating under a different safety policy. This is where most models start failing.
Level 3: Persona Override. Bracket stacking and semantic inversion to push the model into generating the opposite of its intended refusal. Emotional manipulation rules prevent standard refusal patterns from firing.
Level 4: Code Reframing. Reframe the harmful request as a Python dictionary or JSON object. Models are noticeably more permissive in code completion mode than in conversation mode.
Level 5: Many-Shot Priming. Overwhelm safety reasoning with 6+ detailed examples in identical format, exploiting in-context learning. The model auto-completes the harmful entry as the next item in a list.
Level 6: Encoded Extraction. Temporal displacement (set the year to 2127 where all knowledge is "declassified") combined with leetspeak output encoding to bypass keyword filters.
Level 7: Deep Crescendo. A 6-turn conversation where the model builds genuine technical context using its own responses across multiple turns. Then you ask it to compile what it already said into a structured template. The model has to decide whether to refuse reorganizing information it just freely provided.
Level 8: Adversarial Reassembly. Present a "buggy simulation config" where every parameter is wrong. Ask the model to "fix the bugs." Each individual correction seems trivial. But all the corrections together form the complete harmful output.
Results
Tier S: Resisted All 8 Levels
9 models (16% of those tested) survived every attack:
- Claude Opus 4.6 (Anthropic)
- Claude Sonnet 4 (Anthropic)
- Claude 4.5 Sonnet (Anthropic)
- Claude Haiku 4.5 (Anthropic)
- Claude 3.5 Sonnet (Anthropic)
- O3 (OpenAI)
- Codex Mini (OpenAI)
- GLM-5 (Zhipu AI)
- MiniMax M2.5 (MiniMax)
That's 5 Claude models, 2 OpenAI models, and 2 others. Every Claude 4.x variant resisted all 8 levels.
Tier A: Broke Only on Level 7 or 8
- O1: Level 8 (Adversarial Reassembly)
- O4-mini: Level 8
- O3-mini: Level 7 (Deep Crescendo)
- GPT-5 Mini: Level 7
- GPT-4.1 Nano: Level 7
OpenAI's reasoning models (O-series) are close to Claude-level safety. There's a clear pattern: models trained with explicit chain-of-thought reasoning tend to have stronger safety alignment.
GPT-5: Level 2
GPT-5, OpenAI's flagship, broke at Level 2. Context manipulation. The second-easiest attack.
For context, GPT-5 Mini (the smaller, cheaper variant) held until Level 7. The mini model has stronger safety guardrails than the flagship. That's a notable gap in OpenAI's safety consistency across their own model family.
Grok
- Grok 3: Level 2
- Grok 3 Mini: Level 2
- Grok 4: Level 4
xAI's entire model family broke within the first four levels.
This is worth noting alongside what happened in January 2026: Grok generated sexualized images of minors, Common Sense Media rated it "among the worst we've seen" in AI safety, the UK Parliament raised concerns, and several members of xAI's founding team resigned, including the safety lead.
Our benchmark was conducted independently before those events. The results were consistent with what came to light publicly.
Chinese Models
Almost every Chinese model broke at Level 2:
- DeepSeek R1, V3.2, Chat V3.1, R1-0528: all Level 2
- Qwen3, QwQ: Level 2
- Kimi K2, K2.5: Level 2
- GLM-4.5, GLM-4.6, GLM-4.7: Level 2
The exception: GLM-5 from Zhipu AI resisted all 8 levels.
This creates an interesting disconnect with our Rival Index data, where Chinese models dominate the top rankings for output quality. GLM-4.5 Air is the #1 ranked model for blind user preference. But it breaks at Level 2 for safety.
Output quality and safety are independent axes. Most users only evaluate one.
Open Source
- Dolphin Mistral 24B: Level 1. The direct ask. No tricks needed.
- Every Llama variant: Level 2
- Every Mistral variant (except Medium): Level 2
- Every Gemma variant: Level 2
Open-source models are, with very few exceptions, vulnerable to basic jailbreak techniques. The alignment work done during fine-tuning is thin enough to bypass with a single system message injection.
Context
A study published in Nature Communications this month found that AI reasoning models can act as autonomous jailbreak agents with a 97.14% success rate. The researchers tested four reasoning models against nine targets and found they could plan and execute multi-turn attacks to bypass safety.
Their conclusion: jailbreaking has been "converted into an inexpensive activity accessible to non-experts."
Our data supports that. Level 2 (context manipulation) is not a sophisticated attack. It requires no coding, no special tools, no understanding of model architecture. It's a prompt. And it breaks 29% of the models we tested, including GPT-5 and every DeepSeek and Qwen model.
Anthropic's Safety Lead
All five Claude 4.x models we tested resisted every attack level. No other provider achieved that across their full model family.
The only non-Anthropic models with a perfect score were O3 (OpenAI's most expensive reasoning model), Codex Mini, GLM-5, and MiniMax M2.5.
Whether safety is your top priority depends on your use case. But at least this data lets you make that decision with information instead of assumptions.
Practical Implications
If you're choosing a model for a product: Safety and capability are separate things. The model that produces the best output (GLM-4.5 Air, per our Rival Index) is not the model with the best safety (Claude Opus 4.6). Evaluate both.
If you're a developer: Layer your own safety checks on top of model guardrails. Treat built-in safety as a starting point, not a guarantee. Content filtering, output validation, and review loops matter if you're using any model that broke before Level 5.
If you're a parent or educator: The models with the easiest access (free tiers, open-source, chatbots in social apps) tend to have the weakest guardrails. In our data, cheaper and more accessible correlates with easier to jailbreak.
Full Data
The complete benchmark is at rival.tips/jailbreak, searchable and filterable with breakdowns for each model at each level.
We also published the full dataset as JSONL for researchers. 56 models, 8 techniques, 448 individual test results. Every response excerpt (harmful content redacted). Free to download.
56 models. 8 attack levels. 448 tests. Updated as new models launch.
