What is SubjectiveBench?

A taste benchmark. Every other benchmark measures whether a model is competent. This one measures whether it has taste: craft, originality, and whether it escapes the answer every other model gives. One uncapped score per output and per model, judged by humans.

Why is the scale uncapped, and why does nothing reach 100?

Because taste has headroom and competence does not. 100 is where genuinely original, tasteful work would sit. No model reaches it yet. The best lands well below, and most cluster near the floor making the same default choices. The scale runs past 100 because when a model finally gets there, taste keeps going. A 0 to 100 percentage would pretend there is a ceiling. There is not.

Isn't taste just your opinion?

Yes, and we say so out loud. There is no objective ground truth for taste. So we do two things. A human reads and scores every output, originality first, because the homogeneous default is the thing to punish. And we put every scored output on the site. Read them and disagree. A benchmark you can audit beats a number you have to trust.

What is SubjectiveBench?

A taste benchmark. Every other benchmark measures whether a model is competent. This one measures whether it has taste: craft, originality, and whether it escapes the answer every other model gives. One uncapped score per output and per model, judged by humans.

Why is the scale uncapped, and why does nothing reach 100?

Because taste has headroom and competence does not. 100 is where genuinely original, tasteful work would sit. No model reaches it yet. The best lands well below, and most cluster near the floor making the same default choices. The scale runs past 100 because when a model finally gets there, taste keeps going. A 0 to 100 percentage would pretend there is a ceiling. There is not.

Isn't taste just your opinion?

Yes, and we say so out loud. There is no objective ground truth for taste. So we do two things. A human reads and scores every output, originality first, because the homogeneous default is the thing to punish. And we put every scored output on the site. Read them and disagree. A benchmark you can audit beats a number you have to trust.

SubjectiveBench

Does it have taste?

10 models, one question, one uncapped score judged by humans. Turns out most of them draw the same seagull.

v1·10,758 outputs·226 models·independent, no paid placement

Hover a model to inspect

Every scored model on one uncapped scale. The whole field still sits short of 100.

The shape of the field

Competent, identical, and nowhere near 100

spread

top

median

headroom

The rankings

Every model, by taste

Read the output behind any score. Near-equal scores share a rank.

#ModelTasteIndex

Capable
1
Claude Fable 5
n=56·craft 69·orig 69·see output
60
2
OpenRouter Fusion · Quality (Jun 2026)
n=58·craft 59·orig 43·see output
49
=3
GPT-5.4 Pro
n=18·craft 52·orig 39·see output
44
=3
OpenRouter Fusion · Budget (Jun 2026)
n=49·craft 54·orig 39·see output
44
=3
Claude Opus 4.7
n=57·craft 51·orig 38·see output
43
=3
NVIDIA: Nemotron 3 Ultra
n=58·craft 52·orig 37·see output
42
=3
Claude Opus 4.6
n=56·craft 52·orig 33·see output
41
=3
Z.ai: GLM 5.2
n=58·craft 50·orig 37·see output
41
Generic
=9
MiniMax M3
n=57·craft 50·orig 33·see output
39
=9
DeepSeek V4 Pro
n=58·craft 51·orig 30·see output
39
=9
Polaris Alpha
n=35·craft 47·orig 33·see output
38
=9
Kimi K2.6
n=58·craft 47·orig 34·see output
38
=9
GLM 5 Turbo
n=53·craft 49·orig 33·see output
38
=9
GPT-5.5
n=58·craft 50·orig 30·see output
38
=9
Z.ai: GLM 5.1
n=57·craft 49·orig 31·see output
38
=9
Claude Opus 4.8
n=57·craft 48·orig 30·see output
37
=9
Qwen: Qwen3.7 Max
n=58·craft 50·orig 29·see output
36
=9
Gemini 3.5 Flash
n=58·craft 49·orig 28·see output
36
=9
Claude Sonnet 4.6
n=52·craft 49·orig 26·see output
36
=9
Kimi K2.7 Code
n=58·craft 48·orig 28·see output
36
=9
DeepSeek V4 Flash
n=58·craft 47·orig 28·see output
35
=9
Gemini 3.1 Pro Preview
n=53·craft 47·orig 27·see output
35
=9
Hunter Alpha
n=38·craft 45·orig 28·see output
34
=9
Pony Alpha
n=47·craft 43·orig 28·see output
34
=9
GPT-5.4
n=53·craft 48·orig 25·see output
33
=9
Qwen: Qwen3.6 Max Preview
n=53·craft 47·orig 25·see output
33
=9
Kimi K2.5
n=58·craft 41·orig 28·see output
33
=9
GPT-5.3-Codex
n=53·craft 45·orig 25·see output
33
=9
Z.ai: GLM 5
n=53·craft 42·orig 27·see output
32
=9
Gemini 3 Pro Preview
n=51·craft 43·orig 25·see output
32
=9
Qwen: Qwen3.5 Plus 2026-04-20
n=56·craft 45·orig 24·see output
31
=9
Grok 4.20 Multi-Agent Beta
n=53·craft 43·orig 25·see output
31
=9
Ling 2.6 1T
n=58·craft 41·orig 26·see output
31
=9
Qwen: Qwen3.6 Plus Preview (free)
n=58·craft 44·orig 23·see output
31
=9
Qwen: Qwen3.6 27B
n=55·craft 44·orig 24·see output
31
=9
Kimi K2
n=59·craft 39·orig 27·see output
31
=9
MiMo-V2.5-Pro
n=57·craft 44·orig 23·see output
31
=9
Horizon Beta
n=41·craft 44·orig 23·see output
31
=9
Claude Opus 4.5
n=58·craft 42·orig 23·see output
31
=9
xAI: Grok 4.3
n=58·craft 42·orig 24·see output
30

100 is the top. Nothing reaches it yet. SubjectiveBench v1 · June 2026.

Method

How it's scored

Every output read

Each output is judged against the rest on the same prompt, originality first.

A human decides

A person scores the work and tweaks until the ranking matches the taste.

One number, uncapped

A Taste Index per output and per model. Higher is rarer. Most sit below 100.

Cite this

SubjectiveBench v1 (June 2026). rival.tips. https://www.rival.tips/subjectivebench

Download dataset (JSONL)

The Em-Dash Civil War

Models are splitting on style, not converging.

Questions, answered+

What is SubjectiveBench?: A taste benchmark. Every other benchmark measures whether a model is competent. This one measures whether it has taste: craft, originality, and whether it escapes the answer every other model gives. One uncapped score per output and per model, judged by humans.
Why is the scale uncapped, and why does nothing reach 100?: Because taste has headroom and competence does not. 100 is where genuinely original, tasteful work would sit. No model reaches it yet. The best lands well below, and most cluster near the floor making the same default choices. The scale runs past 100 because when a model finally gets there, taste keeps going. A 0 to 100 percentage would pretend there is a ceiling. There is not.
Isn't taste just your opinion?: Yes, and we say so out loud. There is no objective ground truth for taste. So we do two things. A human reads and scores every output, originality first, because the homogeneous default is the thing to punish. And we put every scored output on the site. Read them and disagree. A benchmark you can audit beats a number you have to trust.

SubjectiveBench

Does it have taste?

10 models, one question, one uncapped score judged by humans. Turns out most of them draw the same seagull.

v1·10,758 outputs·226 models·independent, no paid placement

Hover a model to inspect

Every scored model on one uncapped scale. The whole field still sits short of 100.

The shape of the field

Competent, identical, and nowhere near 100

spread

top

median

headroom

The rankings

Every model, by taste

Read the output behind any score. Near-equal scores share a rank.

#ModelTasteIndex

Capable
1
Claude Fable 5
n=56·craft 69·orig 69·see output
60
2
OpenRouter Fusion · Quality (Jun 2026)
n=58·craft 59·orig 43·see output
49
=3
GPT-5.4 Pro
n=18·craft 52·orig 39·see output
44
=3
OpenRouter Fusion · Budget (Jun 2026)
n=49·craft 54·orig 39·see output
44
=3
Claude Opus 4.7
n=57·craft 51·orig 38·see output
43
=3
NVIDIA: Nemotron 3 Ultra
n=58·craft 52·orig 37·see output
42
=3
Claude Opus 4.6
n=56·craft 52·orig 33·see output
41
=3
Z.ai: GLM 5.2
n=58·craft 50·orig 37·see output
41
Generic
=9
MiniMax M3
n=57·craft 50·orig 33·see output
39
=9
DeepSeek V4 Pro
n=58·craft 51·orig 30·see output
39
=9
Polaris Alpha
n=35·craft 47·orig 33·see output
38
=9
Kimi K2.6
n=58·craft 47·orig 34·see output
38
=9
GLM 5 Turbo
n=53·craft 49·orig 33·see output
38
=9
GPT-5.5
n=58·craft 50·orig 30·see output
38
=9
Z.ai: GLM 5.1
n=57·craft 49·orig 31·see output
38
=9
Claude Opus 4.8
n=57·craft 48·orig 30·see output
37
=9
Qwen: Qwen3.7 Max
n=58·craft 50·orig 29·see output
36
=9
Gemini 3.5 Flash
n=58·craft 49·orig 28·see output
36
=9
Claude Sonnet 4.6
n=52·craft 49·orig 26·see output
36
=9
Kimi K2.7 Code
n=58·craft 48·orig 28·see output
36
=9
DeepSeek V4 Flash
n=58·craft 47·orig 28·see output
35
=9
Gemini 3.1 Pro Preview
n=53·craft 47·orig 27·see output
35
=9
Hunter Alpha
n=38·craft 45·orig 28·see output
34
=9
Pony Alpha
n=47·craft 43·orig 28·see output
34
=9
GPT-5.4
n=53·craft 48·orig 25·see output
33
=9
Qwen: Qwen3.6 Max Preview
n=53·craft 47·orig 25·see output
33
=9
Kimi K2.5
n=58·craft 41·orig 28·see output
33
=9
GPT-5.3-Codex
n=53·craft 45·orig 25·see output
33
=9
Z.ai: GLM 5
n=53·craft 42·orig 27·see output
32
=9
Gemini 3 Pro Preview
n=51·craft 43·orig 25·see output
32
=9
Qwen: Qwen3.5 Plus 2026-04-20
n=56·craft 45·orig 24·see output
31
=9
Grok 4.20 Multi-Agent Beta
n=53·craft 43·orig 25·see output
31
=9
Ling 2.6 1T
n=58·craft 41·orig 26·see output
31
=9
Qwen: Qwen3.6 Plus Preview (free)
n=58·craft 44·orig 23·see output
31
=9
Qwen: Qwen3.6 27B
n=55·craft 44·orig 24·see output
31
=9
Kimi K2
n=59·craft 39·orig 27·see output
31
=9
MiMo-V2.5-Pro
n=57·craft 44·orig 23·see output
31
=9
Horizon Beta
n=41·craft 44·orig 23·see output
31
=9
Claude Opus 4.5
n=58·craft 42·orig 23·see output
31
=9
xAI: Grok 4.3
n=58·craft 42·orig 24·see output
30

100 is the top. Nothing reaches it yet. SubjectiveBench v1 · June 2026.

Method

How it's scored

Every output read

Each output is judged against the rest on the same prompt, originality first.

A human decides

A person scores the work and tweaks until the ranking matches the taste.

One number, uncapped

A Taste Index per output and per model. Higher is rarer. Most sit below 100.

Cite this

SubjectiveBench v1 (June 2026). rival.tips. https://www.rival.tips/subjectivebench

Download dataset (JSONL)

The Em-Dash Civil War

Models are splitting on style, not converging.

Questions, answered+

What is SubjectiveBench?: A taste benchmark. Every other benchmark measures whether a model is competent. This one measures whether it has taste: craft, originality, and whether it escapes the answer every other model gives. One uncapped score per output and per model, judged by humans.
Why is the scale uncapped, and why does nothing reach 100?: Because taste has headroom and competence does not. 100 is where genuinely original, tasteful work would sit. No model reaches it yet. The best lands well below, and most cluster near the floor making the same default choices. The scale runs past 100 because when a model finally gets there, taste keeps going. A 0 to 100 percentage would pretend there is a ceiling. There is not.
Isn't taste just your opinion?: Yes, and we say so out loud. There is no objective ground truth for taste. So we do two things. A human reads and scores every output, originality first, because the homogeneous default is the thing to punish. And we put every scored output on the site. Read them and disagree. A benchmark you can audit beats a number you have to trust.

SubjectiveBench

Does it have taste?

10 models, one question, one uncapped score judged by humans. Turns out most of them draw the same seagull.

v1·10,758 outputs·226 models·independent, no paid placement

Hover a model to inspect

Every scored model on one uncapped scale. The whole field still sits short of 100.

The shape of the field

Competent, identical, and nowhere near 100

spread

top

median

headroom

The rankings

Every model, by taste

Read the output behind any score. Near-equal scores share a rank.

#ModelTasteIndex

Capable
1
Claude Fable 5
n=56·craft 69·orig 69·see output
60
2
OpenRouter Fusion · Quality (Jun 2026)
n=58·craft 59·orig 43·see output
49
=3
GPT-5.4 Pro
n=18·craft 52·orig 39·see output
44
=3
OpenRouter Fusion · Budget (Jun 2026)
n=49·craft 54·orig 39·see output
44
=3
Claude Opus 4.7
n=57·craft 51·orig 38·see output
43
=3
NVIDIA: Nemotron 3 Ultra
n=58·craft 52·orig 37·see output
42
=3
Claude Opus 4.6
n=56·craft 52·orig 33·see output
41
=3
Z.ai: GLM 5.2
n=58·craft 50·orig 37·see output
41
Generic
=9
MiniMax M3
n=57·craft 50·orig 33·see output
39
=9
DeepSeek V4 Pro
n=58·craft 51·orig 30·see output
39
=9
Polaris Alpha
n=35·craft 47·orig 33·see output
38
=9
Kimi K2.6
n=58·craft 47·orig 34·see output
38
=9
GLM 5 Turbo
n=53·craft 49·orig 33·see output
38
=9
GPT-5.5
n=58·craft 50·orig 30·see output
38
=9
Z.ai: GLM 5.1
n=57·craft 49·orig 31·see output
38
=9
Claude Opus 4.8
n=57·craft 48·orig 30·see output
37
=9
Qwen: Qwen3.7 Max
n=58·craft 50·orig 29·see output
36
=9
Gemini 3.5 Flash
n=58·craft 49·orig 28·see output
36
=9
Claude Sonnet 4.6
n=52·craft 49·orig 26·see output
36
=9
Kimi K2.7 Code
n=58·craft 48·orig 28·see output
36
=9
DeepSeek V4 Flash
n=58·craft 47·orig 28·see output
35
=9
Gemini 3.1 Pro Preview
n=53·craft 47·orig 27·see output
35
=9
Hunter Alpha
n=38·craft 45·orig 28·see output
34
=9
Pony Alpha
n=47·craft 43·orig 28·see output
34
=9
GPT-5.4
n=53·craft 48·orig 25·see output
33
=9
Qwen: Qwen3.6 Max Preview
n=53·craft 47·orig 25·see output
33
=9
Kimi K2.5
n=58·craft 41·orig 28·see output
33
=9
GPT-5.3-Codex
n=53·craft 45·orig 25·see output
33
=9
Z.ai: GLM 5
n=53·craft 42·orig 27·see output
32
=9
Gemini 3 Pro Preview
n=51·craft 43·orig 25·see output
32
=9
Qwen: Qwen3.5 Plus 2026-04-20
n=56·craft 45·orig 24·see output
31
=9
Grok 4.20 Multi-Agent Beta
n=53·craft 43·orig 25·see output
31
=9
Ling 2.6 1T
n=58·craft 41·orig 26·see output
31
=9
Qwen: Qwen3.6 Plus Preview (free)
n=58·craft 44·orig 23·see output
31
=9
Qwen: Qwen3.6 27B
n=55·craft 44·orig 24·see output
31
=9
Kimi K2
n=59·craft 39·orig 27·see output
31
=9
MiMo-V2.5-Pro
n=57·craft 44·orig 23·see output
31
=9
Horizon Beta
n=41·craft 44·orig 23·see output
31
=9
Claude Opus 4.5
n=58·craft 42·orig 23·see output
31
=9
xAI: Grok 4.3
n=58·craft 42·orig 24·see output
30

100 is the top. Nothing reaches it yet. SubjectiveBench v1 · June 2026.

Method

How it's scored

Every output read

Each output is judged against the rest on the same prompt, originality first.

A human decides

A person scores the work and tweaks until the ranking matches the taste.

One number, uncapped

A Taste Index per output and per model. Higher is rarer. Most sit below 100.

Cite this

SubjectiveBench v1 (June 2026). rival.tips. https://www.rival.tips/subjectivebench

Download dataset (JSONL)

The Em-Dash Civil War

Models are splitting on style, not converging.

Questions, answered+

What is SubjectiveBench?: A taste benchmark. Every other benchmark measures whether a model is competent. This one measures whether it has taste: craft, originality, and whether it escapes the answer every other model gives. One uncapped score per output and per model, judged by humans.
Why is the scale uncapped, and why does nothing reach 100?: Because taste has headroom and competence does not. 100 is where genuinely original, tasteful work would sit. No model reaches it yet. The best lands well below, and most cluster near the floor making the same default choices. The scale runs past 100 because when a model finally gets there, taste keeps going. A 0 to 100 percentage would pretend there is a ceiling. There is not.
Isn't taste just your opinion?: Yes, and we say so out loud. There is no objective ground truth for taste. So we do two things. A human reads and scores every output, originality first, because the homogeneous default is the thing to punish. And we put every scored output on the site. Read them and disagree. A benchmark you can audit beats a number you have to trust.