SubjectiveBench
Does it have taste?
Every benchmark measures whether AI is smart. This one measures whether it has taste. One uncapped score, judged by humans, originality first. Turns out most models draw the same seagull.
SubjectiveBench v1 · 10,758 outputs · 226 models · updated June 2026
Independent. We sell nothing to the labs we rank.
The scale
100 is the reference. Nothing reaches it yet.
- Frontier
- Capable
- Generic
Showing 226 of 226 models. 100 is the reference; nothing reaches it yet. The number is uncapped; near-equal scores share a rank. SubjectiveBench v1 · calibrated June 2026.
01
AI seeds the score
A model scores every output against the same prompt's reference answer, originality first.
02
A human decides
The machine under-rewards originality, so a person re-checks and tweaks. The human is the point, not the formality.
03
One number, uncapped
You get a Taste Index per output and per model. Higher is rarer. Most of today's models sit below 100.
How we keep it honest
- Anchored
- Every score is relative to one frozen reference set to 100.
- Originality first
- We punish the homogeneous default. Polish does not rescue sameness.
- Shown, not asserted
- Every scored output is on the site. Read them and disagree.
- Human-curated
- An AI seeds, a human decides. The override rate is public.
- No paid placement
- No model pays to rank. We run Compare too. That is the only conflict, and now you know it.
Taste has no objective ground truth. SubjectiveBench measures one curated reference, on a fixed scale, originality first. The honest move is to read the outputs yourself and disagree. Every score links to the real output behind it. SubjectiveBench v1, calibrated June 2026.
SubjectiveBench v1 (June 2026). rival.tips. https://www.rival.tips/subjectivebenchQuestions, answered
- What is SubjectiveBench?
- A taste benchmark. Every other benchmark measures whether a model is competent. This one measures whether it has taste: craft, originality, and whether it escapes the answer every other model gives. One uncapped score per output and per model, judged by humans.
- Why is the scale uncapped, and why does nothing reach 100?
- Because taste has headroom and competence does not. 100 is the reference: the level of genuinely original, tasteful work the scale is anchored to. No model reaches it yet. The best sits well below, and most cluster near the floor making the same default choices. The scale runs past 100 to infinity because when a model finally gets there, taste keeps going. A 0 to 100 percentage would pretend there is a ceiling. There is not.
- What does "originality first" mean?
- A polished, generic answer scores low. The clean purple landing page everyone generates, the seagull drawn from the one angle every model picks, the joke that is technically a joke but not actually new: all of it sits near the floor, no matter how competent. We reward a point of view, not homework.
- Isn't taste just your opinion?
- Yes, and we say so out loud. There is no objective ground truth for taste. So we do two things. An AI pass seeds every score against a fixed reference for the same prompt, then a human re-checks and adjusts, because the machine systematically under-rewards originality. And we put every scored output on the site. Read them and disagree. A benchmark you can audit beats a number you have to trust.
- Can a model game it?
- Eventually, like any benchmark. The defenses: scores are anchored to a fixed reference, the rubric punishes the homogeneous default rather than rewarding polish, the prompt set rotates between versions, and no model can submit its own best run or pay for placement. When a model games one version, the next version is built to expose it.