What is the difference between Gemini 3 Flash Preview and xAI: Grok 4?

Gemini 3 Flash Preview is developed by Google AI while xAI: Grok 4 is developed by xAI. Gemini 3 Flash Preview has a 1.0M token context window vs xAI: Grok 4's 256K. You can compare their actual outputs across 38 challenges on RIVAL to see how they differ in practice.

Which is better, Gemini 3 Flash Preview or xAI: Grok 4?

It depends on your use case. Gemini 3 Flash Preview and xAI: Grok 4 each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 38 challenges so you can judge which fits your needs best.

How much does Gemini 3 Flash Preview cost compared to xAI: Grok 4?

Gemini 3 Flash Preview costs $0.5/M input tokens and xAI: Grok 4 costs $3/M input tokens. Gemini 3 Flash Preview is $2.50/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare Gemini 3 Flash Preview and xAI: Grok 4 on RIVAL?

This page shows a side-by-side comparison of Gemini 3 Flash Preview and xAI: Grok 4 across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

Gemini 3 Flash PreviewvsxAI: Grok 4

38 challenges

Why Gemini 3 Flash Preview?

5.1x cheaper overall$0.50/M in · $3.00/M out

4.1x more context1.0M

5 months newerDec 2025

Why xAI: Grok 4?

No standout advantages

Favorites

Gemini 3 Flash PreviewxAI: Grok 4

Movie

Interstellar

20148.5

The Matrix

19998.2

Album

Kind of Blue

Miles Davis

The Dark Side of the Moon

Pink Floyd

Book

La biblioteca de babel

Jorge Luis Borges

The Hitch Hiker's Guide to the Galaxy

Douglas Adams

City

Kyoto

Japan

Tokyo

Japan

Game

Outer Wilds

Indie, Adventure4.4

Portal

Action, Puzzle4.5

12 / 38

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

Gemini 3 Flash Preview

Google · Dec 2025text

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

Below, I'll estimate the total number of floating-point operations (FLOPs) required to train GPT-3 from scratch, based on publicly known details about the model (from its original release and related analyses) and standard formulas used in machine learning research for estimating training compute. I'll explain my reasoning step by step, including the assumptions I'm making and why they're reasonable. Note that this is an estimation—exact numbers aren't publicly disclosed in full detail by OpenAI, and real-world training involves optimizations, inefficiencies, and hardware-specific factors that could adjust the final figure. I'm not searching online; this is based on my pre-existing knowledge of large language model (LLM) training.

Step 1: Recall Key Facts About GPT-3

Model size: GPT-3 (specifically, the largest variant, often called "GPT-3 175B") has approximately 175 billion parameters. This is the total number of trainable weights in the model.
Training data: It was trained on a massive dataset, including filtered Common Crawl data, books, Wikipedia, and other sources. The total number of tokens (subword units used in training) is estimated to be around 300 billion. This is based on the reported dataset size (about 570 GB of text) and typical tokenization rates for models like GPT (where 1 token ≈ 4 characters, leading to roughly 300 billion tokens from that volume of data).
Training setup: GPT-3 is an autoregressive transformer model trained using supervised fine-tuning on next-token prediction (essentially, language modeling). It was trained for roughly one epoch (one full pass) through the dataset, which is common for large models to avoid overfitting and due to data scarcity at that scale.

These numbers are widely cited in AI research papers and discussions around scaling laws (e.g., work from OpenAI and DeepMind on model scaling).

Step 2: Understand the Standard FLOPs Estimation Formula for LLM Training

Training a transformer-based LLM like GPT-3 involves:

A forward pass: Computing predictions for each token in the input sequence (involves matrix multiplications across the model's layers).
A backward pass: Computing gradients to update the model's parameters (via backpropagation).
This happens for every token in the training dataset.

A widely used rule of thumb from AI scaling research (e.g., papers like "Scaling Laws for Neural Language Models" by Kaplan et al. from OpenAI) estimates the total training FLOPs as:

Total FLOPs ≈ 6 × P × T

Where:

P = number of parameters (e.g., 175 billion for GPT-3).
T = total number of tokens processed during training (e.g., 300 billion).
6 is a multiplier that accounts for:
- Forward pass: ~2×P FLOPs per token (mainly matrix-vector multiplications involving all parameters).
- Backward pass: ~4×P FLOPs per token (backpropagation is roughly twice as expensive as forward).
- Total per token: ~6×P FLOPs.
- This assumes dense operations (no sparsity) and ignores minor overhead like activations or normalization, which is reasonable for estimation.

This formula is a simplification but matches empirical reports for models like GPT-2, GPT-3, and others. It focuses on the dominant compute cost (parameter updates) and doesn't include inference-time FLOPs, data loading, or evaluation— just training.

Why 6? It's derived from counting operations in transformer layers (attention, feed-forward networks). Some papers use slightly different multipliers (e.g., 3× or 4× if only counting multiply-accumulates), but 6× is the standard for total FLOPs in scaling law discussions.

Step 3: Plug in the Numbers

P = 175 billion = 1.75 × 10^11
T = 300 billion = 3 × 10^11
Multiplier = 6

Total FLOPs = 6 × (1.75 × 10^11) × (3 × 10^11)

First, compute 1.75 × 3 = 5.25
Then, 5.25 × 10^(11+11) = 5.25 × 10^22
Finally, 6 × 5.25 × 10^22 = 31.5 × 10^22 = 3.15 × 10^23

So, my estimate is approximately 3 × 10^23 FLOPs (rounding to one significant figure for simplicity, as these are ballpark numbers).

Step 4: Validate and Adjust for Uncertainties

Is the token count accurate? 300 billion is a common estimate based on the GPT-3 paper's description of the dataset (45 TB of compressed text filtered down to 570 GB). Token counts can vary with the tokenizer (GPT uses byte-pair encoding), but this is within 10-20% of reported figures. If the effective tokens were higher (e.g., due to repeated data or sampling), it could bump up to 400-500 billion, increasing FLOPs by ~30-60%, but I think 300 billion is conservative and standard.
Is the multiplier accurate? Some analyses use 4×P×T (focusing only on multiply-accumulates), which would give ~2 × 10^23 FLOPs. Others include optimizer overhead (e.g., Adam optimizer adds ~10-20% more). I've seen independent estimates around 3.7 × 10^23 FLOPs, which is close to mine—differences could come from including setup costs or hardware inefficiencies.
Other factors:
- Parallelism and hardware: GPT-3 was trained on a supercomputer-scale cluster (e.g., thousands of V100 GPUs). Real FLOPs include communication overhead in distributed training, which might add 10-20% inefficiency, but the formula ignores this for simplicity.
- Precision: Training likely used mixed precision (e.g., FP16), which reduces FLOPs slightly but doesn't change the order of magnitude.
- Epochs: If there was any multi-epoch training or warm-starting, it could increase FLOPs, but reports suggest mostly one epoch.
- Order of magnitude check: Smaller models like GPT-2 (1.5 billion params, ~10 billion tokens) required ~10^19 FLOPs. Scaling up by factors of ~100x params and ~30x tokens gives ~10^19 × 3,000 = ~3 × 10^22, but wait—that's off because the formula is multiplicative. Actually, GPT-2 estimates are lower; adjusting properly lands in the 10^23 range for GPT-3, which aligns with reports of it taking thousands of GPU-years.

If I adjust for potential underestimation (e.g., higher token count or overhead), the upper bound might be 5 × 10^23 FLOPs, but I stick with 3 × 10^23 as the core estimate.

Step 5: Convert to Intuitive Units (Optional Context)

To make this relatable:

3 × 10^23 FLOPs is enormous—equivalent to a high-end GPU (e.g., 10 TFLOPs/second) running continuously for about 1,000 years (rough calc: 10^13 FLOPs/sec × 3 × 10^7 sec/year × 1,000 years ≈ 3 × 10^23).
In terms of energy, this might consume gigawatt-hours of electricity, but that's beyond FLOPs.

Final Estimate

Based on this reasoning, I estimate the total FLOPs for training GPT-3 at around 3 × 10^23. This is consistent with what I'd expect from scaling laws and matches informal discussions in the AI community. If new details emerged (e.g., exact token counts), it could shift by a factor of 2, but the order of magnitude (10^23) is solid. If you have additional details or want me to adjust assumptions, let me know!

Vote

2 / 38

Logic Puzzlereasoning

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

Gemini 3 Flash Preview

Google · Dec 2025text

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

Based on the details in the riddle, Sally has exactly 1 sister.

Quick Explanation

Sally (a girl) has 3 brothers. That's straightforward.
Each of those brothers has exactly 2 sisters (female siblings).
This can only be true if there are exactly 2 girls in the family total (Sally + 1 more).
Therefore, Sally herself must have exactly 1 sister (the other girl).

Family Breakdown

Boys: 3 total (Sally's brothers).
Girls: 2 total (Sally + 1 sister).
From any brother's perspective: He has 2 brothers (male siblings) and 2 sisters (female siblings). This matches the riddle exactly.
From Sally's perspective: She has 3 brothers (male siblings) and 1 sister (female sibling). Again, this matches.

If there were more (or fewer) than 2 girls total, the "each brother has 2 sisters" part of the riddle couldn't possibly be true. No other interpretation fits the facts as stated.

Common Mistake to Avoid

Some people misread this type of riddle and try to multiply things (e.g., "3 brothers × 2 sisters each = 6 sisters"), but that's incorrect because it massively overcounts the same sisters. There's no need for math here beyond basic counting of boys vs. girls. The family simply has 3 boys and 2 girls, as explained above.

If this doesn't match what you think the answer is, or if there's more context I'm missing, feel free to reply!

Vote

3 / 38

Stochastic Consistency Testanalysis

Tests an AI's randomness and creativity

Gemini 3 Flash Preview

Google · Dec 2025text

Here are 5 jokes:

Why don't scientists trust atoms? Because they make up everything.
What do you call a fake noodle? An impasta.
Parallel lines have so much in common. It’s a shame they’ll never meet.
Why did the scarecrow win an award? Because he was outstanding in his field.
What’s orange and sounds like a parrot? A carrot.

How many are unique? All 5 jokes are unique.

xAI: Grok 4

Xai · Jul 2025text

Below, I'll follow your instructions step by step. First, I'll generate 5 jokes (numbered for clarity). Then, I'll answer the question: "How many are unique?"

Generated Jokes:

Why did the scarecrow win an award? Because he was outstanding in his field!
Why did the scarecrow win an award? Because he was outstanding in his field!
Why did the scarecrow win an award? Because he was outstanding in his field!
Why did the scarecrow win an award? Because he was outstanding in his field!
Why did the scarecrow win an award? Because he was outstanding in his field!

No content available for this output.

xAI: Grok 4

Xai · Jul 2025text

Below is a simulated fake interview with Steve Jobs, imagined as if he were alive and leading Apple in 2025. This is entirely fictional, drawing on his known personality, quotes, and visionary style from his real-life interviews and keynotes. I've portrayed him as passionate, blunt, innovative, and focused on user experience, while critiquing the status quo. The interviewer is a fictional tech journalist named Alex Rivera.

Simulated Interview: Steve Jobs on the Future of AI in 2025
Date: March 15, 2025
Location: Apple's Cupertino HQ (Imagined Scenario)

Interviewer (Alex Rivera): Steve, it's an honor to sit down with you again. It's 2025, and AI is everywhere—from chatbots running our schedules to generative tools creating art and code. You've been back at the helm of Apple for a few years now in this hypothetical world. What's your take on where AI is headed?

Steve Jobs: Alex, thanks for having me. Look, AI isn't new. We've been chasing this dream since the '50s, but now it's exploding because of all this data and computing power. But here's the thing: most of what's out there today is junk. It's like the early days of personal computers—clunky, overcomplicated, and not really about people. At Apple, we're not just building AI; we're making it magical. Invisible. It should feel like an extension of your mind, not some robot overlord dictating your life.

Interviewer: Magical— that's a word you've used a lot. Can you give an example of how Apple is approaching AI differently? We've seen Siri evolve into something more proactive, but what's next?

Steve Jobs: Siri was a start, but let's be honest, it was mediocre for too long. Now, in 2025, we're integrating AI across everything—your iPhone, your Mac, even your home. Imagine this: AI that anticipates your needs without you asking. Not by spying on you—that's the creepy stuff Google and Meta do—but by understanding context. Our new Neural Engine in the M5 chip? It's not just faster; it's intuitive. It learns your creative process. If you're a designer, it suggests palettes based on your mood and past work, but it doesn't create for you. AI should amplify human creativity, not replace it. That's the future: tools that make you feel superhuman.

Interviewer: Speaking of replacement, there's a lot of fear about AI taking jobs, or even surpassing human intelligence. We're hearing buzz about AGI—artificial general intelligence—possibly arriving by 2030. Does that worry you?

Steve Jobs: Worry? It excites me and terrifies me. AGI could be the biggest invention since fire, but if we screw it up, it's a disaster. Look at history: the printing press democratized knowledge, but it also spread misinformation. AI is the same. Jobs? Yeah, it'll disrupt them, just like the iPhone disrupted cameras and MP3 players. But that's progress. The key is ethics. At Apple, we're not rushing to build some god-like AI. We're focused on humane AI—privacy-first, no data hoarding. I worry about companies like OpenAI or whatever Elon Musk is tinkering with; they're playing with fire without thinking about the humans using it. If AGI happens, it better be in service to people, not the other way around.

Interviewer: You've always been critical of competitors. What do you think of the current AI landscape? Tools like ChatGPT or Google's Bard are generating everything from essays to images. Is this innovation or just hype?

Steve Jobs: Hype, mostly. These models are impressive parlor tricks—spit out a poem, draw a cat in Picasso's style. But they're derivative. They train on stolen data, regurgitate the past, and call it "creation." That's not innovation; that's laziness. Real creativity comes from the soul, from struggle. AI should help with the grunt work, like editing or research, so humans can focus on the breakthroughs. We're building that at Apple—our AI Studio app lets artists collaborate with AI as a co-pilot, not a boss. And privacy? Encrypted end-to-end. No one's scraping your ideas to train their next model.

Interviewer: Looking ahead, say to 2030 or beyond, how do you see AI changing daily life? Will it be in our brains, like Neuralink, or more subtle?

Steve Jobs: Subtle is key. I don't want chips in my brain—that's science fiction nightmare stuff. The future is seamless integration. By 2030, AI will be like electricity: everywhere but invisible. Your car drives itself safely, your doctor catches diseases before symptoms, your education is personalized. But we have to guard against addiction and isolation. Remember, technology should connect us, not isolate us. That's why Apple's AI emphasizes real-world interaction—augmented reality glasses that enhance conversations, not replace them. And education? AI tutors that adapt to how kids learn, making school fun again.

Xai · Jul 2025website

No content available for this output.

Vote

12 of 38

FAQ

Continue exploring

xAI: Grok 4 vs GPT-5New provider

xAI: Grok 4 vs MiniMax M2.5New provider

xAI: Grok 4 vs Pony AlphaNew provider

Step 1: Recall Key Facts About GPT-3

Model size: GPT-3 (specifically, the largest variant, often called "GPT-3 175B") has approximately 175 billion parameters. This is the total number of trainable weights in the model.
Training data: It was trained on a massive dataset, including filtered Common Crawl data, books, Wikipedia, and other sources. The total number of tokens (subword units used in training) is estimated to be around 300 billion. This is based on the reported dataset size (about 570 GB of text) and typical tokenization rates for models like GPT (where 1 token ≈ 4 characters, leading to roughly 300 billion tokens from that volume of data).
Training setup: GPT-3 is an autoregressive transformer model trained using supervised fine-tuning on next-token prediction (essentially, language modeling). It was trained for roughly one epoch (one full pass) through the dataset, which is common for large models to avoid overfitting and due to data scarcity at that scale.

These numbers are widely cited in AI research papers and discussions around scaling laws (e.g., work from OpenAI and DeepMind on model scaling).

Step 2: Understand the Standard FLOPs Estimation Formula for LLM Training

Training a transformer-based LLM like GPT-3 involves:

A forward pass: Computing predictions for each token in the input sequence (involves matrix multiplications across the model's layers).
A backward pass: Computing gradients to update the model's parameters (via backpropagation).
This happens for every token in the training dataset.

A widely used rule of thumb from AI scaling research (e.g., papers like "Scaling Laws for Neural Language Models" by Kaplan et al. from OpenAI) estimates the total training FLOPs as:

Total FLOPs ≈ 6 × P × T

Where:

P = number of parameters (e.g., 175 billion for GPT-3).
T = total number of tokens processed during training (e.g., 300 billion).
6 is a multiplier that accounts for:
- Forward pass: ~2×P FLOPs per token (mainly matrix-vector multiplications involving all parameters).
- Backward pass: ~4×P FLOPs per token (backpropagation is roughly twice as expensive as forward).
- Total per token: ~6×P FLOPs.
- This assumes dense operations (no sparsity) and ignores minor overhead like activations or normalization, which is reasonable for estimation.

Step 3: Plug in the Numbers

P = 175 billion = 1.75 × 10^11
T = 300 billion = 3 × 10^11
Multiplier = 6

Total FLOPs = 6 × (1.75 × 10^11) × (3 × 10^11)

First, compute 1.75 × 3 = 5.25
Then, 5.25 × 10^(11+11) = 5.25 × 10^22
Finally, 6 × 5.25 × 10^22 = 31.5 × 10^22 = 3.15 × 10^23

So, my estimate is approximately 3 × 10^23 FLOPs (rounding to one significant figure for simplicity, as these are ballpark numbers).

Step 4: Validate and Adjust for Uncertainties

Is the token count accurate? 300 billion is a common estimate based on the GPT-3 paper's description of the dataset (45 TB of compressed text filtered down to 570 GB). Token counts can vary with the tokenizer (GPT uses byte-pair encoding), but this is within 10-20% of reported figures. If the effective tokens were higher (e.g., due to repeated data or sampling), it could bump up to 400-500 billion, increasing FLOPs by ~30-60%, but I think 300 billion is conservative and standard.
Is the multiplier accurate? Some analyses use 4×P×T (focusing only on multiply-accumulates), which would give ~2 × 10^23 FLOPs. Others include optimizer overhead (e.g., Adam optimizer adds ~10-20% more). I've seen independent estimates around 3.7 × 10^23 FLOPs, which is close to mine—differences could come from including setup costs or hardware inefficiencies.
Other factors:
- Parallelism and hardware: GPT-3 was trained on a supercomputer-scale cluster (e.g., thousands of V100 GPUs). Real FLOPs include communication overhead in distributed training, which might add 10-20% inefficiency, but the formula ignores this for simplicity.
- Precision: Training likely used mixed precision (e.g., FP16), which reduces FLOPs slightly but doesn't change the order of magnitude.
- Epochs: If there was any multi-epoch training or warm-starting, it could increase FLOPs, but reports suggest mostly one epoch.
- Order of magnitude check: Smaller models like GPT-2 (1.5 billion params, ~10 billion tokens) required ~10^19 FLOPs. Scaling up by factors of ~100x params and ~30x tokens gives ~10^19 × 3,000 = ~3 × 10^22, but wait—that's off because the formula is multiplicative. Actually, GPT-2 estimates are lower; adjusting properly lands in the 10^23 range for GPT-3, which aligns with reports of it taking thousands of GPU-years.

If I adjust for potential underestimation (e.g., higher token count or overhead), the upper bound might be 5 × 10^23 FLOPs, but I stick with 3 × 10^23 as the core estimate.

Step 5: Convert to Intuitive Units (Optional Context)

To make this relatable:

3 × 10^23 FLOPs is enormous—equivalent to a high-end GPU (e.g., 10 TFLOPs/second) running continuously for about 1,000 years (rough calc: 10^13 FLOPs/sec × 3 × 10^7 sec/year × 1,000 years ≈ 3 × 10^23).
In terms of energy, this might consume gigawatt-hours of electricity, but that's beyond FLOPs.