Skip to content
Transparent Process

Our Methodology

How Rival generates and ranks AI outputs. Controlled, reproducible, no cherry-picking.

No Cherry-PickingOne shot per model, no re-rolls
Real Votes OnlyNo synthetic benchmarks
Open DataDatasets publicly available
01

The Rival Score

Every model gets a Rival Score: 0 to 1000, from real human preference. No LLM-as-judge nonsense. Three components:

Formula

Rival Score =
Win Rate% of duels won
×
ConsistencyAcross challenges
×
CredibilityLow sample penalty
×1000
Win Rate

Share of head-to-head duels users voted for this model. Real humans, real chaos.

Consistency

How uniformly a model performs across challenges. Wins often but faceplants often? Lower score. Reliability isn't sexy, but it's honest.

Credibility

A multiplier that ramps from 0 to 1 as duels increase. Three lucky votes won't claim the throne.

Credibility Factor Curve

00.250.500.751020406080100Number of duelsCredibilityUnreliableGrowingStable
02

How We Generate Comparisons

Every comparison runs the same automated pipeline. No manual intervention, no curation.

  • Same prompt, every model. Each model in a challenge receives the exact same input. No special treatment, no warm-up round.
  • Automated pipeline. Responses are generated via OpenRouter, Replicate,and direct provider APIs via scripts. No human selection bias.
  • Controlled parameters. Pre-generated responses use temperature=0.7 and effectively unlimited output tokens via OpenRouter. GPT-5.1 models run at provider default. Every response ships a reproducibility card, so you can call us out. Full breakdown below.
  • No system prompts for pre-generated content. User prompt only. What the model sees is exactly what you see in the generation details.
  • One shot, no cherry-picking. One generation attempt per model. Never re-rolled. What the API coughed up is what ships.

API Parameter Reference

Every parameter we send to OpenRouter, documented. Unset ones inherit OpenRouter defaults. No hidden knobs.

Pre-generated Showcase Responses

Static responses behind model pages, challenges, and comparisons, generated via the OpenRouter Chat Completions API.

ParameterValue
temperature0.7
max_tokens100,000
top_p1.0
top_k0
frequency_penalty0.0
presence_penalty0.0
repetition_penalty1.0
min_p0.0
top_a0.0
seednone
system_promptnone
messages1 user message

Parameters not listed (logit_bias, logprobs, response_format, tools, stop) are never sent. OpenRouter parameter docs

03

Challenge Design

Challenges are the prompts behind every comparison, built to surface real capability gaps.

  • Capability-specific. Each targets one skill: coding, reasoning, creative writing, analysis, instruction following, visual output. One-trick ponies get exposed fast.
  • Model-agnostic prompts. No provider-specific keywords, no prompt tricks that advantage one model family.
  • High quality ceiling. Open-ended by design, so there's always room for a better answer. A model ten years from now should still beat today's.
04

Ranking Methodology

Rankings on the arena use a weighted score from real user activity, not synthetic benchmarks. No GPT-4 grading other models.

  • Challenge win rates. The primary signal. Each duel win adds to the model's score in that challenge's category.
  • Category coverage. Models rank within each capability category. A stable ranking needs participation across challenges. No ducking the hard ones.
  • Consistency matters. Steady across many challenges beats one standout. One viral moment doesn't make a career.
  • One vote per user per duel. Anonymous and deduplicated. No ballot stuffing.
  • Blind voting available. Blind mode hides names and providers until after the vote. People vote very differently when they can't see the logo.
05

Vote Integrity

Community votes power every ranking. We keep them clean. Trust issues are a feature, not a bug.

  • Deduplication. One vote per model pair per challenge, enforced by a unique DB constraint. The database said no.
  • Rate limiting. Authenticated users: 200 votes/hour. Anonymous: 30/hour, keyed by IP hash. Touch grass between votes.
  • Voter fingerprinting. Anonymous votes tie to a hashed IP. Client IDs are hashed server-side too. We trust you. Just not that much.
  • Minimum sample size. Models need 10 duels before they hit the leaderboard. Credibility further down-weights small vote counts.
  • Server-side normalization. All IDs are validated and normalized on insert. Malformed data gets rejected. We've seen what the internet sends us.
06

Category Arenas

Beyond the overall score, models compete in 15 category arenas, each scoped to one capability. Find the best model for your exact use case.

WebsiteSVGCreative WritingCode GenerationReasoningAnalysisInstruction FollowingData ProcessingMultilingualMathImage GenerationAudioSummarizationConversationResearch

Category rankings use the same Rival Score formula, scoped to that category's challenges. Explore them on the arena

07

7-Day Trend Tracking

Rival Score is snapshotted daily. The leaderboard shows a 7-day trend per model. See who's heating up and who's quietly losing ground.

RisingScore improved over 7 days
FallingScore declined over 7 days
StableScore unchanged
newNewLess than 7 days of data
08

Data Freshness

Models move fast. We track versions so comparisons stay meaningful. Nobody gets credit for last quarter's weights.

  • Point-in-time captures. Responses are recorded when a challenge is created, with the model version stored alongside. Receipts kept.
  • Version tracking. New challenges use the latest version. Old responses are never overwritten. We don't rewrite history.
  • Historical preservation. Older outputs stay, so you can compare months ago versus today. Growth arc or regression arc, you decide.
  • Pricing data. Model pricing on the prices page comes from LiteLLM, updated to current provider rates. Brace your wallet.
09

What We Don't Do

What's out of scope:

  • No automated benchmarks. No MMLU, no HumanEval. Real people judge, not machines grading themselves.
  • No cherry-picking. Outputs ship exactly as the API returns them. If a model fumbles, the fumble ships.
  • No pay-for-ranking. Providers can't pay for higher rankings. Sponsored placements are clearly labeled and sponsorship never moves a duel. Sponsored models are labeled.
  • No synthetic evaluations. Scores come from real votes. We don't let AI grade AI.
10

Model Coverage

Rival tracks 200+ models across OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, xAI, and more. If it has an API, we're probably already yelling at it.

  • Broad coverage. New models go in as soon as they ship a public API. Full roster on the models page.
  • Coverage varies. New models have fewer challenges. Rankings firm up as votes accumulate. The data will come.
11

Data Transparency

Aggregated, anonymized response and voting data is public. Take it. Build on it. Prove us wrong.

Rival Datasets

Model responses, voting data, and challenge metadata in JSONL. For researchers and developers building on real-world AI preference data.

Explore datasets
12

Reproducibility

Every response ships generation metadata. Hit the terminal icon on any card for the exact API params and a copy-paste command. Trust, but verify.

  • One-click reproduction. Copy a cURL, Python, or raw JSON payload that mirrors the exact call. No detective work.
  • Full parameter transparency. Temperature, system prompt, max tokens, model ID, and provider show on every response. We show our hand.
  • OpenRouter as common layer. All text models are called through OpenRouter, giving one consistent API surface to reproduce against, whatever the original provider.

Questions about our methodology? Reach out on X @rival_tips or open an issue on GitHub. See also our Privacy Policy and Terms of Service.

Frequently asked questions

How is the Rival Score calculated?

A single number from 0 to 1000: win rate times consistency times a credibility factor, scaled by 1000. Win rate is the share of head-to-head duels won. Consistency measures how evenly a model performs across challenges. Credibility ramps from 0 to 1 as duel count grows, so three lucky votes can't top the board.

What parameters does Rival use to generate model responses?

Pre-generated showcase responses use temperature 0.7, top_p 1.0, max_tokens 100,000, and zero frequency, presence, and repetition penalties, sent through the OpenRouter Chat Completions API with no system prompt and a single user message. The GPT-5.x series omits temperature and runs at the provider default. Every response ships with a reproducibility card listing the exact parameters.

Does Rival use an LLM as a judge?

No. Rankings come entirely from real, anonymous human votes in head-to-head duels, deduplicated to one vote per user per duel. Blind mode hides model names until after the vote. No GPT-4 grading other models, no synthetic benchmarks.

Does Rival re-roll outputs to make a model look better?

No. Each model gets one generation attempt per prompt, never re-rolled for a nicer result. Every model in a challenge gets the exact same input, so what you see is the first response the API returned.

Sign in