Benchmarks vs Vibes
AI labs publish benchmark scores to prove their model is the best. We collected 21,686 blind preference votes to see if users agree. Mostly, they don't.
21,686
Blind votes
159
Models tested
1,872
Unique matchups
19.2%
End in ties
The Disconnect
Every major AI lab publishes benchmark scores when they launch a new model. MMLU, HumanEval, MATH, ARC. Higher score means better model. The pitch is straightforward.
But when we show people real AI outputs side by side, with no model names and no scores visible, they consistently prefer models that don't top those leaderboards. The #1 model on the RIVAL Index (GLM-4.5 Air, 76.5% win rate) doesn't lead any traditional benchmark.
What Each Approach Measures
Benchmarks
- +Factual recall on multiple-choice tests
- +Code completion on defined problems
- +Math reasoning under controlled conditions
- +Reproducible, automated scoring
Good at measuring the ceiling of narrow capabilities. Less useful for predicting which model you'll prefer in practice.
Vibe Testing
- +Human preference on real, open-ended tasks
- +Design taste, tone, and aesthetics
- +Creative and subjective output quality
- +Blind comparison, no brand bias
Good at measuring what users actually prefer when they see the output. Harder to automate, requires real human votes.
What We Found
of all matchups end in ties. Nearly 1 in 5 times, users can't tell the difference between two models. For most everyday tasks, the top models have effectively converged.
different models have won at least one duel. The market isn't a few leaders and everyone else. Hundreds of models are competitive on real-world tasks.
win rate for GLM-4.5 Air, the #1 model on the RIVAL Index. It doesn't lead any traditional benchmark. Benchmark ranking and user preference are measuring different things.
most-voted challenge is drawing an Xbox controller as SVG. People test visual creativity, design taste, and interactive capability. Not math proofs.
We're Not the Only Ones Saying This
Andrej Karpathy (formerly of OpenAI and Tesla AI) has repeatedly advocated for “vibe checks” over benchmarks. Stanford's HAI AI Index Report documented benchmark saturation across frontier models. Artificial Analysis found benchmark scores diverge from real-world preference.
RIVAL's contribution is an open dataset: 21,686 votes across 1,872 matchups, covering everything from landing page design to SVG art to code generation. The methodology is transparent and the data is free to download.
See for Yourself
Pick any two models. See their real outputs on the same prompt. Vote on which you prefer. Then check if the benchmarks agree.