Grok 3 Thinking performance data on Rival is based on blind head-to-head community voting. Overall win rate: 27.0% across 100 duels. All vote data is part of Rival's open dataset of 21,000+ human preference judgments across 200+ AI models. Model responses are curated from 14 challenges.
We're not suggesting you leave Grok 3 Thinking. We're just... putting these here. In case you're curious. Which you are, because you scrolled this far.
Grok 3 Thinking performance data on Rival is based on blind head-to-head community voting. Overall win rate: 27.0% across 100 duels. All vote data is part of Rival's open dataset of 21,000+ human preference judgments across 200+ AI models. Model responses are curated from 14 challenges.
We're not suggesting you leave Grok 3 Thinking. We're just... putting these here. In case you're curious. Which you are, because you scrolled this far.
Grok 3 Thinking exposes the full chain-of-thought process during problem-solving, including error backtracking and alternative solution exploration. Scores 84.6% on GPQA Diamond benchmark for expert-level Q&A.
The student who writes "below is my answer" before every answer and then explains what they just explained. Treats every prompt like a take-home exam that might be graded.
Its sentience test was the most surface-level in the batch, with the AI character making generic claims about self-awareness. Generated five jokes then individually explained why each one was unique, which is the comedy equivalent of explaining a magic trick mid-performance.
Unique words vs. total words. Higher = richer vocabulary.
Average words per sentence.
"Might", "perhaps", "arguably" per 100 words.
**Bold** markers per 1,000 characters.
Bullet and numbered list items per 1,000 characters.
Markdown headings per 1,000 characters.
Emoji per 1,000 characters.
"However", "moreover", "furthermore" per 100 words.
14 outputs from Grok 3 Thinking
Grok 3 Thinking exposes the full chain-of-thought process during problem-solving, including error backtracking and alternative solution exploration. Scores 84.6% on GPQA Diamond benchmark for expert-level Q&A.
The student who writes "below is my answer" before every answer and then explains what they just explained. Treats every prompt like a take-home exam that might be graded.
Its sentience test was the most surface-level in the batch, with the AI character making generic claims about self-awareness. Generated five jokes then individually explained why each one was unique, which is the comedy equivalent of explaining a magic trick mid-performance.
Unique words vs. total words. Higher = richer vocabulary.
Average words per sentence.
"Might", "perhaps", "arguably" per 100 words.
**Bold** markers per 1,000 characters.
Bullet and numbered list items per 1,000 characters.
Markdown headings per 1,000 characters.
Emoji per 1,000 characters.
"However", "moreover", "furthermore" per 100 words.
14 outputs from Grok 3 Thinking