Chinese AI Models Are Outperforming Western Ones in Blind Votes

Most of the AI conversation is about OpenAI vs Anthropic vs Google. That's where the attention goes.

But in our blind voting data, a Chinese AI lab that most people haven't heard of holds 4 of the top 7 spots. Their models beat Claude, GPT, and Gemini in user preference when the labels are stripped away.

Here's what 21,880 votes from 4,160 people actually look like.

How We Measure This

At RIVAL, we show people the same prompt answered by two different AI models, side by side. No model names. No logos. Just raw output. You pick which one is better, or call it a tie.

We've been running these blind duels across 179 models for almost a year. The dataset includes 1,872 unique head-to-head matchups. Every vote feeds into what we call the Rival Index, a ranking weighted by win percentage and volume of duels.

The Current Top 10

Here's the Rival Index as of February 12, 2026:

Rank	Model	Provider	Rival Score	Win Rate
#1	GLM-4.5 Air	Zhipu AI	70.92	76.5%
#2	Gemini 3 Flash Preview	Google	68.59	80.6%
#3	GLM-4.6	Zhipu AI	67.99	70.6%
#4	GLM-4.7	Zhipu AI	67.70	74.4%
#5	Claude Opus 4.6	Anthropic	66.84	69.4%
#6	Gemini 3 Pro Preview	Google	66.64	66.9%
#7	GLM-4.5	Zhipu AI	64.28	67.8%
#8	Claude 3.7 Sonnet Thinking	Anthropic	64.11	64.4%
#9	Claude Opus 4	Anthropic	64.09	63.6%
#10	Nano Banana Pro	-	63.64	70.5%

Zhipu AI's GLM family holds #1, #3, #4, and #7. Four of the top seven positions belong to a lab headquartered in Beijing.

GLM-4.5 Air wins 76.5% of its blind duels. When someone sees its output next to a competitor's with no branding or context, they pick GLM three out of four times.

Claude Opus 4.6, which many consider the best overall model right now, sits at #5.

Benchmarks Tell a Different Story

Gemini 3 Pro just broke 1500 on LM Arena ELO. Claude Opus 4.5 leads SWE-bench coding at 74.2%. GPT-5.3-Codex set a new SWE-Bench Pro record.

These are real accomplishments. But models are also specifically tuned to perform well on the tests they're measured by. That's the whole development methodology.

The question is whether benchmark performance translates to what users actually prefer in practice.

Based on our data: not consistently.

Gemini 2.5 Pro Exp leads in raw win count on our platform with 936 total wins. But its win rate is just 46.9%. It appears in a lot of duels, so it accumulates wins through volume. Per-matchup, it loses more than it wins.

GLM-4.5 Air wins at 67.6% with 457 appearances. Fewer duels, much better outcomes.

Provider Breakdown

When we aggregate all 21,880 votes by provider:

Provider	Total Wins	Share
Anthropic (Claude)	4,770	27.0%
OpenAI (GPT)	3,801	21.5%
Google (Gemini)	2,807	15.9%
Alibaba (Qwen)	1,496	8.5%
Zhipu AI (GLM)	1,155	6.5%
xAI (Grok)	701	4.0%
DeepSeek	583	3.3%

Anthropic leads overall with 27%. But group the Chinese labs together:

Zhipu AI + Alibaba/Qwen + DeepSeek = 18.3% of all wins.

That's close to OpenAI's 21.5%. Five of the top nine comparison pages on RIVAL involve Chinese models. Qwen3 Coder Plus vs GLM-4.7 is the #2 most-viewed comparison on the site.

Why This Is Worth Paying Attention To

A few things happening at once:

Cost. DeepSeek V3.2 costs $0.27/$1.10 per million tokens. A task that costs $15 on GPT-5 costs about $0.50 on DeepSeek. When outputs are comparable (and our voting data shows they often are), cost becomes the deciding factor for production use.

The tie rate. 19.2% of all duels on RIVAL end in a tie. Nearly 1 in 5. Users can't tell the difference between the two outputs. When models from different continents produce indistinguishable results, the competition shifts to price, speed, and ecosystem.

Multi-model is already here. Perplexity launched Model Council, running Claude, GPT, and Gemini in parallel on the same query because "every AI model has blind spots." No single model wins everything. The future is multi-model, and geography doesn't limit that.

The Closest Matchups

These are the duels where our community was most evenly split:

Claude Sonnet 4 vs GPT-5 Codex: 3.8% margin (101 votes)
GPT-4o Mini vs GPT-5 Nano: 3.9% margin (106 votes)
Gemini 2.5 Pro Exp vs GPT-4.1: 4.3% margin (162 votes)
Claude Opus 4 vs Gemini 2.5 Pro Exp: 8.9% margin (117 votes)

When Claude meets GPT in a blind test, it's basically a coin flip. Same for Gemini vs GPT. The frontier models are very close to each other.

Add Chinese models that are already winning blind preference votes at high rates, and the idea that American labs have a clear lead gets harder to support with data.

What's Next

DeepSeek V4 is expected soon. MiniMax just launched M2.5, claiming programming capabilities matching Claude Opus 4.6. Qwen3 variants are showing up across comparison charts.

The AI landscape is more global than the media narrative suggests. Our 21,880 votes show a race that's closer than most people realize.

See the full rankings and vote at rival.tips

The Rival Index updates daily. Every vote counts.

RIVAL tracks 200+ AI models across every major provider. 21,880 votes across 1,872 unique matchups.

RIVAL

Tracking 200+ AI models. 21,880+ community votes. Our methodology

More Insights

Research