Our Methodology
How Rival generates and ranks AI outputs. Controlled, reproducible, no cherry-picking.
01The Rival Score
Every model gets a Rival Score: 0 to 1000, from real human preference. No LLM-as-judge nonsense. Three components:
Formula
Share of head-to-head duels users voted for this model. Real humans, real chaos.
How uniformly a model performs across challenges. Wins often but faceplants often? Lower score. Reliability isn't sexy, but it's honest.
A multiplier that ramps from 0 to 1 as duels increase. Three lucky votes won't claim the throne.
Credibility Factor Curve
02How We Generate Comparisons
Every comparison runs the same automated pipeline. No manual intervention, no curation.
- Same prompt, every model. Each model in a challenge receives the exact same input. No special treatment, no warm-up round.
- Automated pipeline. Responses are generated via OpenRouter, Replicate,and direct provider APIs via scripts. No human selection bias.
- Controlled parameters. Pre-generated responses use
temperature=0.7and effectively unlimited output tokens via OpenRouter. GPT-5.1 models run at provider default. Every response ships a reproducibility card, so you can call us out. Full breakdown below. - No system prompts for pre-generated content. User prompt only. What the model sees is exactly what you see in the generation details.
- One shot, no cherry-picking. One generation attempt per model. Never re-rolled. What the API coughed up is what ships.
API Parameter Reference
Every parameter we send to OpenRouter, documented. Unset ones inherit OpenRouter defaults. No hidden knobs.
Static responses behind model pages, challenges, and comparisons, generated via the OpenRouter Chat Completions API.
| Parameter | Value |
|---|---|
| temperature | 0.7 |
| max_tokens | 100,000 |
| top_p | 1.0 |
| top_k | 0 |
| frequency_penalty | 0.0 |
| presence_penalty | 0.0 |
| repetition_penalty | 1.0 |
| min_p | 0.0 |
| top_a | 0.0 |
| seed | none |
| system_prompt | none |
| messages | 1 user message |
Parameters not listed (logit_bias, logprobs, response_format, tools, stop) are never sent. OpenRouter parameter docs
03Challenge Design
Challenges are the prompts behind every comparison, built to surface real capability gaps.
- Capability-specific. Each targets one skill: coding, reasoning, creative writing, analysis, instruction following, visual output. One-trick ponies get exposed fast.
- Model-agnostic prompts. No provider-specific keywords, no prompt tricks that advantage one model family.
- High quality ceiling. Open-ended by design, so there's always room for a better answer. A model ten years from now should still beat today's.
04Ranking Methodology
Rankings on the arena use a weighted score from real user activity, not synthetic benchmarks. No GPT-4 grading other models.
- Challenge win rates. The primary signal. Each duel win adds to the model's score in that challenge's category.
- Category coverage. Models rank within each capability category. A stable ranking needs participation across challenges. No ducking the hard ones.
- Consistency matters. Steady across many challenges beats one standout. One viral moment doesn't make a career.
- One vote per user per duel. Anonymous and deduplicated. No ballot stuffing.
- Blind voting available. Blind mode hides names and providers until after the vote. People vote very differently when they can't see the logo.
05Vote Integrity
Community votes power every ranking. We keep them clean. Trust issues are a feature, not a bug.
- Deduplication. One vote per model pair per challenge, enforced by a unique DB constraint. The database said no.
- Rate limiting. Authenticated users: 200 votes/hour. Anonymous: 30/hour, keyed by IP hash. Touch grass between votes.
- Voter fingerprinting. Anonymous votes tie to a hashed IP. Client IDs are hashed server-side too. We trust you. Just not that much.
- Minimum sample size. Models need 10 duels before they hit the leaderboard. Credibility further down-weights small vote counts.
- Server-side normalization. All IDs are validated and normalized on insert. Malformed data gets rejected. We've seen what the internet sends us.
06Category Arenas
Beyond the overall score, models compete in 15 category arenas, each scoped to one capability. Find the best model for your exact use case.
Category rankings use the same Rival Score formula, scoped to that category's challenges. Explore them on the arena
077-Day Trend Tracking
Rival Score is snapshotted daily. The leaderboard shows a 7-day trend per model. See who's heating up and who's quietly losing ground.
08Data Freshness
Models move fast. We track versions so comparisons stay meaningful. Nobody gets credit for last quarter's weights.
- Point-in-time captures. Responses are recorded when a challenge is created, with the model version stored alongside. Receipts kept.
- Version tracking. New challenges use the latest version. Old responses are never overwritten. We don't rewrite history.
- Historical preservation. Older outputs stay, so you can compare months ago versus today. Growth arc or regression arc, you decide.
- Pricing data. Model pricing on the prices page comes from LiteLLM, updated to current provider rates. Brace your wallet.
09What We Don't Do
What's out of scope:
- No automated benchmarks. No MMLU, no HumanEval. Real people judge, not machines grading themselves.
- No cherry-picking. Outputs ship exactly as the API returns them. If a model fumbles, the fumble ships.
- No pay-for-ranking. Providers can't pay for higher rankings. Sponsored placements are clearly labeled and sponsorship never moves a duel. Sponsored models are labeled.
- No synthetic evaluations. Scores come from real votes. We don't let AI grade AI.
10Model Coverage
Rival tracks 200+ models across OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, xAI, and more. If it has an API, we're probably already yelling at it.
- Broad coverage. New models go in as soon as they ship a public API. Full roster on the models page.
- Coverage varies. New models have fewer challenges. Rankings firm up as votes accumulate. The data will come.
11Data Transparency
Aggregated, anonymized response and voting data is public. Take it. Build on it. Prove us wrong.
Rival Datasets
Model responses, voting data, and challenge metadata in JSONL. For researchers and developers building on real-world AI preference data.
12Reproducibility
Every response ships generation metadata. Hit the terminal icon on any card for the exact API params and a copy-paste command. Trust, but verify.
- One-click reproduction. Copy a cURL, Python, or raw JSON payload that mirrors the exact call. No detective work.
- Full parameter transparency. Temperature, system prompt, max tokens, model ID, and provider show on every response. We show our hand.
- OpenRouter as common layer. All text models are called through OpenRouter, giving one consistent API surface to reproduce against, whatever the original provider.
Questions about our methodology? Reach out on X @rival_tips or open an issue on GitHub. See also our Privacy Policy and Terms of Service.
Frequently asked questions
How is the Rival Score calculated?
A single number from 0 to 1000: win rate times consistency times a credibility factor, scaled by 1000. Win rate is the share of head-to-head duels won. Consistency measures how evenly a model performs across challenges. Credibility ramps from 0 to 1 as duel count grows, so three lucky votes can't top the board.
What parameters does Rival use to generate model responses?
Pre-generated showcase responses use temperature 0.7, top_p 1.0, max_tokens 100,000, and zero frequency, presence, and repetition penalties, sent through the OpenRouter Chat Completions API with no system prompt and a single user message. The GPT-5.x series omits temperature and runs at the provider default. Every response ships with a reproducibility card listing the exact parameters.
Does Rival use an LLM as a judge?
No. Rankings come entirely from real, anonymous human votes in head-to-head duels, deduplicated to one vote per user per duel. Blind mode hides model names until after the vote. No GPT-4 grading other models, no synthetic benchmarks.
Does Rival re-roll outputs to make a model look better?
No. Each model gets one generation attempt per prompt, never re-rolled for a nicer result. Every model in a challenge gets the exact same input, so what you see is the first response the API returned.