Every week, a new AI model "tops the leaderboard." The company behind it publishes a blog post with a chart showing their model beating everyone else on MMLU, HumanEval, SWE-bench, or whatever benchmark is popular this quarter.
And every week, you use the model and think: this doesn't feel that different.
We've spent the last year collecting data on whether that feeling is justified.
How We Tested This
The setup is simple: two AI models get the same prompt. Their outputs are shown side by side. No model names. No logos. No benchmarks displayed. You pick which one you prefer, or call it a tie.
21,880 votes. 4,160 unique voters. 179 models. 1,872 unique matchups.
This isn't a survey or self-reported data. It's thousands of real people making snap judgments about what they actually prefer.
The results don't match the leaderboards.
The "Best" Model Loses More Than It Wins
Gemini 2.5 Pro Exp is one of the top 3 models on most benchmarks. Gemini 3 Pro just became the first model to break 1500 ELO on LM Arena.
On RIVAL, Gemini 2.5 Pro Exp has the most total wins of any model: 936 victories.
But its win rate is 46.9%.
It loses more matchups than it wins. The high win count comes from appearing in a lot of duels (2,000+ total appearances), not from consistently beating whoever it faces.
Compare that to GLM-4.5 Air from Zhipu AI. Most people haven't heard of it. It doesn't show up near the top of Western benchmark lists.
Its win rate on RIVAL: 76.5%. Three out of four people prefer its output in a blind test.
1 in 5 Matchups End in a Tie
This is probably the most interesting number in our dataset:
19.2% of all duels end in a tie.
4,204 out of 21,880 votes. People look at outputs from two different models, sometimes from companies spending very different amounts on training, and genuinely can't tell which is better.
When Claude faces GPT in a blind test, the margin is under 4%. When Gemini faces GPT-4.1, it's 4.3%. At that point, we're looking at noise, not real differences.
For most everyday tasks, the top models have effectively converged. That doesn't mean they're identical, but the gap is a lot smaller than the marketing suggests.
People Test Vibes, Not Reasoning
Our most popular challenges tell you a lot about what users actually care about:
| Challenge | Votes | Type |
|---|---|---|
| Xbox Controller SVG | 1,919 | Visual Generation |
| SVG Layout | 1,661 | Visual Design |
| Pokemon Battle UI | 1,279 | Interactive UI |
| AI Board Game Logic | 1,243 | Game Logic |
| Stochastic Consistency | 1,221 | Math Analysis |
| Interactive Catan Board | 1,040 | Interactive UI |
| Linear App Clone | 854 | Web Development |
| Minimalist Landing Page | 837 | Web Design |
The most-voted challenge is drawing an Xbox controller as an SVG. The second is SVG layout design. Third is building a Pokemon battle interface.
People are testing visual creativity, design taste, and interactive capability. MMLU doesn't measure whether a model can draw a convincing Xbox controller. HumanEval doesn't test whether a landing page feels right.
We've started calling this vibe testing. Benchmarks measure capability. Votes measure preference. They're measuring different things, and they don't correlate the way you'd expect.
Provider Loyalty
We have a feature where users pledge "allegiance" to their favorite model:
| Model | Loyal Fans |
|---|---|
| Claude Opus 4.6 | 17 |
| Gemini 3 Pro Preview | 10 |
| GPT-5.2 | 10 |
| Grok 4.1 Fast | 7 |
| Claude 4.5 Sonnet | 6 |
| DeepSeek V3.2 | 5 |
Claude Opus 4.6 has nearly double the allegiance of any other model. Anthropic leads overall with 27% of all wins across Claude variants.
But Anthropic's lead isn't because Claude crushes everything in blind tests. Claude Opus 4.6's win rate (69.4%) is strong but not dominant. GLM models beat it in head-to-head win percentage.
What Claude has is something harder to measure: consistency, a recognizable voice, a way of approaching problems that resonates with its users. People develop preferences for how a model communicates, not just what it outputs.
The Actual Comparisons People Run
Looking at our comparison page traffic, the most popular matchups aren't what you'd expect:
- Claude 4.5 Sonnet vs Claude Opus 4.6: Users deciding between Claude variants
- Qwen3 Coder Plus vs GLM-4.7: Chinese model matchups
- Z-Image Turbo vs SDXL: Image generation
- GPT-5.2 Pro vs Claude Opus 4.5: The classic rivalry
- Grok Code Fast 1 vs Devstral 2512: Open-source coding
The #1 comparison is Claude vs Claude. People have already picked their provider and are choosing between variants. The #2 is two Chinese models being compared more than any Western cross-provider matchup.
The "OpenAI vs Anthropic vs Google" narrative is what gets written about. The reality is more fragmented. People are comparing coding-specific models, image generators, Chinese alternatives, and different versions of their favorite model.
What Benchmarks Actually Tell You
Benchmarks tell you the ceiling of what a model can do on a specific, well-defined task.
They don't tell you:
- Which output a human will prefer when shown two side by side
- How a model handles creative, subjective, or design-oriented tasks
- Whether a cheaper model produces functionally identical results for your use case
- How a model's communication style affects whether people stick with it
In 2026, every frontier model passes the capability bar for most tasks. The differences are in taste, cost, speed, and vibe.
Where Things Are Going
MIT Technology Review and TechCrunch are both calling 2026 the year AI moves from hype to pragmatism.
Our 21,880 votes reflect that. They show:
- 172 different models have won at least one duel (the market is broad)
- 1 in 5 matchups end in ties (the frontier is a plateau)
- Visual and creative tasks drive the most engagement (people test vibes, not benchmarks)
- Chinese models win blind preference tests at rates that challenge Western assumptions
- Model personality and trust matter as much as raw capability
The question is shifting from "which model tops the leaderboard" to "which model do I actually prefer."
Try it yourself at rival.tips. Every vote shapes the Rival Index.
21,880 votes. 179 models. 1,872 matchups. Updated daily.
