Benchmarks Don't Match What People Actually Prefer

Every week, a new AI model "tops the leaderboard." The company behind it publishes a blog post with a chart showing their model beating everyone else on MMLU, HumanEval, SWE-bench, or whatever benchmark is popular this quarter.

And every week, you use the model and think: this doesn't feel that different.

We've spent the last year collecting data on whether that feeling is justified.

How We Tested This

The setup is simple: two AI models get the same prompt. Their outputs are shown side by side. No model names. No logos. No benchmarks displayed. You pick which one you prefer, or call it a tie.

21,880 votes. 4,160 unique voters. 179 models. 1,872 unique matchups.

This isn't a survey or self-reported data. It's thousands of real people making snap judgments about what they actually prefer.

The results don't match the leaderboards.

The "Best" Model Loses More Than It Wins

Gemini 2.5 Pro Exp is one of the top 3 models on most benchmarks. Gemini 3 Pro just became the first model to break 1500 ELO on LM Arena.

On RIVAL, Gemini 2.5 Pro Exp has the most total wins of any model: 936 victories.

But its win rate is 46.9%.

It loses more matchups than it wins. The high win count comes from appearing in a lot of duels (2,000+ total appearances), not from consistently beating whoever it faces.

Compare that to GLM-4.5 Air from Zhipu AI. Most people haven't heard of it. It doesn't show up near the top of Western benchmark lists.

Its win rate on RIVAL: 76.5%. Three out of four people prefer its output in a blind test.

1 in 5 Matchups End in a Tie

This is probably the most interesting number in our dataset:

19.2% of all duels end in a tie.

4,204 out of 21,880 votes. People look at outputs from two different models, sometimes from companies spending very different amounts on training, and genuinely can't tell which is better.

When Claude faces GPT in a blind test, the margin is under 4%. When Gemini faces GPT-4.1, it's 4.3%. At that point, we're looking at noise, not real differences.

For most everyday tasks, the top models have effectively converged. That doesn't mean they're identical, but the gap is a lot smaller than the marketing suggests.

People Test Vibes, Not Reasoning

Our most popular challenges tell you a lot about what users actually care about:

Challenge	Votes	Type
Xbox Controller SVG	1,919	Visual Generation
SVG Layout	1,661	Visual Design
Pokemon Battle UI	1,279	Interactive UI
AI Board Game Logic	1,243	Game Logic
Stochastic Consistency	1,221	Math Analysis
Interactive Catan Board	1,040	Interactive UI
Linear App Clone	854	Web Development
Minimalist Landing Page	837	Web Design

The most-voted challenge is drawing an Xbox controller as an SVG. The second is SVG layout design. Third is building a Pokemon battle interface.

People are testing visual creativity, design taste, and interactive capability. MMLU doesn't measure whether a model can draw a convincing Xbox controller. HumanEval doesn't test whether a landing page feels right.

We've started calling this vibe testing. Benchmarks measure capability. Votes measure preference. They're measuring different things, and they don't correlate the way you'd expect.

Provider Loyalty

We have a feature where users pledge "allegiance" to their favorite model:

Model	Loyal Fans
Claude Opus 4.6	17
Gemini 3 Pro Preview	10
GPT-5.2	10
Grok 4.1 Fast	7
Claude 4.5 Sonnet	6
DeepSeek V3.2	5

Claude Opus 4.6 has nearly double the allegiance of any other model. Anthropic leads overall with 27% of all wins across Claude variants.

But Anthropic's lead isn't because Claude crushes everything in blind tests. Claude Opus 4.6's win rate (69.4%) is strong but not dominant. GLM models beat it in head-to-head win percentage.

What Claude has is something harder to measure: consistency, a recognizable voice, a way of approaching problems that resonates with its users. People develop preferences for how a model communicates, not just what it outputs.

The Actual Comparisons People Run

Looking at our comparison page traffic, the most popular matchups aren't what you'd expect:

Claude 4.5 Sonnet vs Claude Opus 4.6: Users deciding between Claude variants
Qwen3 Coder Plus vs GLM-4.7: Chinese model matchups
Z-Image Turbo vs SDXL: Image generation
GPT-5.2 Pro vs Claude Opus 4.5: The classic rivalry
Grok Code Fast 1 vs Devstral 2512: Open-source coding

The #1 comparison is Claude vs Claude. People have already picked their provider and are choosing between variants. The #2 is two Chinese models being compared more than any Western cross-provider matchup.

The "OpenAI vs Anthropic vs Google" narrative is what gets written about. The reality is more fragmented. People are comparing coding-specific models, image generators, Chinese alternatives, and different versions of their favorite model.

What Benchmarks Actually Tell You

Benchmarks tell you the ceiling of what a model can do on a specific, well-defined task.

They don't tell you:

Which output a human will prefer when shown two side by side
How a model handles creative, subjective, or design-oriented tasks
Whether a cheaper model produces functionally identical results for your use case
How a model's communication style affects whether people stick with it

In 2026, every frontier model passes the capability bar for most tasks. The differences are in taste, cost, speed, and vibe.

Where Things Are Going

MIT Technology Review and TechCrunch are both calling 2026 the year AI moves from hype to pragmatism.

Our 21,880 votes reflect that. They show:

172 different models have won at least one duel (the market is broad)
1 in 5 matchups end in ties (the frontier is a plateau)
Visual and creative tasks drive the most engagement (people test vibes, not benchmarks)
Chinese models win blind preference tests at rates that challenge Western assumptions
Model personality and trust matter as much as raw capability

The question is shifting from "which model tops the leaderboard" to "which model do I actually prefer."

Try it yourself at rival.tips. Every vote shapes the Rival Index.

21,880 votes. 179 models. 1,872 matchups. Updated daily.

RIVAL

Tracking 200+ AI models. 21,880+ community votes. Our methodology

More Insights

Analysis