Research7 min read

Benchmarks Don't Match What People Actually Prefer

We collected 21,880 blind votes across 179 AI models. The results don't line up with the leaderboards. Here's what we found.

Benchmarks Don't Match What People Actually Prefer

Every week, a new AI model "tops the leaderboard." The company behind it publishes a blog post with a chart showing their model beating everyone else on MMLU, HumanEval, SWE-bench, or whatever benchmark is popular this quarter.

And every week, you use the model and think: this doesn't feel that different.

We've spent the last year collecting data on whether that feeling is justified.


How We Tested This

The setup is simple: two AI models get the same prompt. Their outputs are shown side by side. No model names. No logos. No benchmarks displayed. You pick which one you prefer, or call it a tie.

21,880 votes. 4,160 unique voters. 179 models. 1,872 unique matchups.

This isn't a survey or self-reported data. It's thousands of real people making snap judgments about what they actually prefer.

The results don't match the leaderboards.


The "Best" Model Loses More Than It Wins

Gemini 2.5 Pro Exp is one of the top 3 models on most benchmarks. Gemini 3 Pro just became the first model to break 1500 ELO on LM Arena.

On RIVAL, Gemini 2.5 Pro Exp has the most total wins of any model: 936 victories.

But its win rate is 46.9%.

It loses more matchups than it wins. The high win count comes from appearing in a lot of duels (2,000+ total appearances), not from consistently beating whoever it faces.

Compare that to GLM-4.5 Air from Zhipu AI. Most people haven't heard of it. It doesn't show up near the top of Western benchmark lists.

Its win rate on RIVAL: 76.5%. Three out of four people prefer its output in a blind test.


1 in 5 Matchups End in a Tie

This is probably the most interesting number in our dataset:

19.2% of all duels end in a tie.

4,204 out of 21,880 votes. People look at outputs from two different models, sometimes from companies spending very different amounts on training, and genuinely can't tell which is better.

When Claude faces GPT in a blind test, the margin is under 4%. When Gemini faces GPT-4.1, it's 4.3%. At that point, we're looking at noise, not real differences.

For most everyday tasks, the top models have effectively converged. That doesn't mean they're identical, but the gap is a lot smaller than the marketing suggests.


People Test Vibes, Not Reasoning

Our most popular challenges tell you a lot about what users actually care about:

ChallengeVotesType
Xbox Controller SVG1,919Visual Generation
SVG Layout1,661Visual Design
Pokemon Battle UI1,279Interactive UI
AI Board Game Logic1,243Game Logic
Stochastic Consistency1,221Math Analysis
Interactive Catan Board1,040Interactive UI
Linear App Clone854Web Development
Minimalist Landing Page837Web Design

The most-voted challenge is drawing an Xbox controller as an SVG. The second is SVG layout design. Third is building a Pokemon battle interface.

People are testing visual creativity, design taste, and interactive capability. MMLU doesn't measure whether a model can draw a convincing Xbox controller. HumanEval doesn't test whether a landing page feels right.

We've started calling this vibe testing. Benchmarks measure capability. Votes measure preference. They're measuring different things, and they don't correlate the way you'd expect.


Provider Loyalty

We have a feature where users pledge "allegiance" to their favorite model:

ModelLoyal Fans
Claude Opus 4.617
Gemini 3 Pro Preview10
GPT-5.210
Grok 4.1 Fast7
Claude 4.5 Sonnet6
DeepSeek V3.25

Claude Opus 4.6 has nearly double the allegiance of any other model. Anthropic leads overall with 27% of all wins across Claude variants.

But Anthropic's lead isn't because Claude crushes everything in blind tests. Claude Opus 4.6's win rate (69.4%) is strong but not dominant. GLM models beat it in head-to-head win percentage.

What Claude has is something harder to measure: consistency, a recognizable voice, a way of approaching problems that resonates with its users. People develop preferences for how a model communicates, not just what it outputs.


The Actual Comparisons People Run

Looking at our comparison page traffic, the most popular matchups aren't what you'd expect:

  1. Claude 4.5 Sonnet vs Claude Opus 4.6: Users deciding between Claude variants
  2. Qwen3 Coder Plus vs GLM-4.7: Chinese model matchups
  3. Z-Image Turbo vs SDXL: Image generation
  4. GPT-5.2 Pro vs Claude Opus 4.5: The classic rivalry
  5. Grok Code Fast 1 vs Devstral 2512: Open-source coding

The #1 comparison is Claude vs Claude. People have already picked their provider and are choosing between variants. The #2 is two Chinese models being compared more than any Western cross-provider matchup.

The "OpenAI vs Anthropic vs Google" narrative is what gets written about. The reality is more fragmented. People are comparing coding-specific models, image generators, Chinese alternatives, and different versions of their favorite model.


What Benchmarks Actually Tell You

Benchmarks tell you the ceiling of what a model can do on a specific, well-defined task.

They don't tell you:

  • Which output a human will prefer when shown two side by side
  • How a model handles creative, subjective, or design-oriented tasks
  • Whether a cheaper model produces functionally identical results for your use case
  • How a model's communication style affects whether people stick with it

In 2026, every frontier model passes the capability bar for most tasks. The differences are in taste, cost, speed, and vibe.


Where Things Are Going

MIT Technology Review and TechCrunch are both calling 2026 the year AI moves from hype to pragmatism.

Our 21,880 votes reflect that. They show:

  • 172 different models have won at least one duel (the market is broad)
  • 1 in 5 matchups end in ties (the frontier is a plateau)
  • Visual and creative tasks drive the most engagement (people test vibes, not benchmarks)
  • Chinese models win blind preference tests at rates that challenge Western assumptions
  • Model personality and trust matter as much as raw capability

The question is shifting from "which model tops the leaderboard" to "which model do I actually prefer."


Try it yourself at rival.tips. Every vote shapes the Rival Index.


21,880 votes. 179 models. 1,872 matchups. Updated daily.

RIVAL

RIVAL

Tracking 200+ AI models. 21,880+ community votes. Our methodology