Ask GPT to build you a landing page. Then ask Claude. Then ask Gemini.
You'll get three pages that all work. They'll all have headers, buttons, copy, and responsive layouts. If you benchmark them on correctness, they'll score within a few percentage points of each other.
But you'll prefer one. Maybe it's the copy tone. Maybe it's how the layout breathes. Maybe it's a design choice that feels intentional rather than generic.
That preference is hard to quantify, and it turns out to be one of the biggest factors in which model people stick with.
The Setup
RIVAL started as a simple experiment: show people raw AI outputs side by side and let them decide which is better.
We now have responses from 200+ AI models across 25+ challenge prompts. Everything from "build a minimalist landing page" to "draw an Xbox controller as SVG" to "write the logic for an AI board game."
Every response is pre-generated and displayed in a sandboxed iframe so you see the actual rendered output, not just code. HTML pages render as real web pages. SVGs display as actual graphics. Interactive elements work.
Then we let people vote in blind duels. No model names shown. Just Output A vs Output B.
After 21,880 votes, the clearest pattern isn't about which model is "smarter." It's about how each model has a recognizable style.
The Xbox Controller Test
Our most-voted challenge (1,919 votes) asks models to generate an SVG of an Xbox controller.
There's no "correct" answer here. It's a creative task, and it reveals a lot about how different models approach open-ended problems.
Some models produce technically accurate but flat diagrams. Others create stylized, artistic versions. Some obsess over button placement. Others prioritize visual appeal and simplicity.
When users vote, they're responding to design sensibility, something that comes from training data, fine-tuning, and whatever combination of decisions produced that particular model's aesthetic instincts.
This is what we mean by vibe testing. It's the category of tasks where models diverge the most.
How Each Provider Approaches Design
Ask 10 frontier models to build a minimalist landing page. You'll get noticeably different results:
-
Claude models tend toward clean, restrained design with thoughtful copy. Subtle gradients. Deliberate whitespace. Headlines that hint rather than shout. It reads like it was made by someone who's spent too much time on design Twitter.
-
GPT models produce competent, well-structured pages that feel optimized. The copy is punchy, the CTAs are prominent, the layout follows conversion best practices. GPT builds pages that want to perform.
-
Gemini models often take bigger visual swings. Bolder color palettes, more experimental layouts. Less constrained by convention, more willing to try something different.
-
Grok models have a directness to them. Less polished but more personality. The copy is sharper, sometimes irreverent. It feels like a page built by someone with opinions.
-
GLM and Qwen models tend to be technically sophisticated but follow different design conventions, sometimes reflecting East Asian design sensibilities around information density and visual hierarchy.
None of these are objectively "better." But each attracts different preferences, and those preferences are consistent enough to form clear patterns in the data.
What the Numbers Show
After 21,880 votes across 179 models:
Visual and creative challenges get the most engagement:
| Challenge | Votes |
|---|---|
| Xbox Controller SVG | 1,919 |
| SVG Layout | 1,661 |
| Pokemon Battle UI | 1,279 |
| Interactive Catan Board | 1,040 |
| Linear App Clone | 854 |
| Minimalist Landing Page | 837 |
| Framer-Style Animation | 657 |
| Dark Mode Dashboard | 559 |
People come to RIVAL to see which model has better taste, not to test math skills.
Provider profiles:
Anthropic (Claude): 27% of all wins. The most consistent performer. Six Claude variants appear in the top 20 of the Rival Index. More users pledge allegiance to Claude than any other model.
OpenAI (GPT): 21.5% of all wins. Competitive across the board, not dominant in any single category. GPT-5 Mini (56.8% win rate, 1,084 duels) is the definition of solid all-around performance.
Google (Gemini): 15.9% of all wins. Gemini 3 Flash Preview has an 80.6% win rate (the highest of any model with enough data), but it's only been in 67 duels. When it shows up, it wins big.
Chinese Labs (Zhipu + Qwen + DeepSeek): 18.3% of all wins combined. GLM-4.5 Air holds the #1 Rival Index position at 76.5% win rate.
xAI (Grok): 4% of all wins. Small share, but Grok 4.1 Fast has the 4th-highest allegiance count. The users who like Grok really like Grok.
Why Personality Wins When Capability Converges
19.2% of all duels on RIVAL end in a tie.
Nearly one in five matchups, users look at outputs from two different models and can't tell the difference. For roughly 20% of prompts, the capability gap between models is imperceptible.
So what decides which model people use?
Price. DeepSeek V3.2 does what GPT-5 does for 3% of the cost. When outputs are indistinguishable, economics wins.
Speed. Flash and Mini variants keep getting more popular. People increasingly prefer a fast "good enough" over a slow "marginally better."
Personality. The model that communicates in a way that matches how you think. The model whose design choices align with your taste. The model whose approach feels right for your workflow.
This is why Claude Opus 4.6 has the most loyal fans despite not having the highest win rate. This is why Grok has a following that outpaces its market share. This is why people compare Claude vs Claude more than Claude vs GPT. They've already picked the brand and are choosing the variant.
The Convergence Reality
Our data shows 172 different models have won at least one duel. There are hundreds of competent models, and the gap between #1 and #50 in real-world user preference is smaller than you'd think.
The same dynamic played out with smartphones. Apple didn't win by having the best specs. They won by having the strongest product identity.
Claude is the "thoughtful one." GPT is the "reliable one." Grok is the "unfiltered one." These aren't just brand perceptions. They're emergent properties of the training process, and they're the main reason users develop long-term preferences.
Practical Takeaways
If you're choosing a model for your product: Run your actual prompts through multiple models and pick the one whose outputs match your product's tone. The capability gap between top models is small. The personality gap is not.
If you're building with AI: Test beyond the Big 3. GLM, Qwen, and DeepSeek models are winning blind preference tests at rates that make them real contenders. The cost savings alone justify evaluating them.
If you're an AI company: Your model already has a personality. Figure out what it is. The era of competing on "we score 0.3% higher on MMLU" is fading. The era of competing on how your model feels is starting.
Find your model at rival.tips. Compare any two models in a blind test.
200+ models. 25+ challenges. 21,880 community votes. The Rival Index is built on what people actually prefer, not what scores highest on a test.
