Skip to content
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
  1. Home
  2. Models
  3. Qwen Plus 0728 (thinking)
  4. Estimate Complexity

See how Qwen Plus 0728 (thinking) compares on this challenge:

  • Qwen Plus 0728 (thinking) vs Claude Sonnet 4.6
  • Qwen Plus 0728 (thinking) vs Gemini 3.1 Pro Preview
  • Qwen Plus 0728 (thinking) vs Z.ai: GLM 5

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Sign Up
Sign Up
Back to Qwen Plus 0728 (thinking)

Estimate Complexity

Qwen Plus 2025 07 28 Thinking's response to the 'Estimate Complexity' challenge.

Qwen Plus 0728 (thinking)
September 11, 2025
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Qwen Plus 0728 (thinking)
Response

qwen-plus-2025-07-28-thinking logo
Qwen Plus 0728 (thinking)Estimate Complexity
qwen-plus-2025-07-28-thinking logo
Qwen Plus 0728 (thinking)

To estimate the total FLOPs (floating-point operations) required to train GPT-3, I'll use first-principles reasoning based on standard deep learning training calculations, known details about GPT-3's architecture, and common training practices. I will not search online—this relies solely on foundational knowledge of transformer models and training dynamics. Here's the step-by-step reasoning:


Step 1: Key Parameters of GPT-3

  • Number of parameters (N): GPT-3 has 175 billion parameters (175 × 10⁹). This is publicly well-known (e.g., from the 2020 OpenAI paper).
  • Training dataset size: GPT-3 was trained on ~300 billion tokens (300 × 10⁹). This is a standard figure cited in discussions of GPT-3 (e.g., "Common Crawl + other sources totaling hundreds of billions of tokens"). I'll use 300B as a conservative estimate (some sources say 410B, but 300B is widely accepted for rough calculations).

Step 2: FLOPs per Token in Transformer Training

For a transformer model like GPT-3, the FLOPs required to process one token during training are dominated by:

  • Forward pass: ~2 × N FLOPs (for matrix multiplications, attention, etc.).
  • Backward pass: ~2 × N FLOPs (similar compute to forward pass).
  • Optimizer step (e.g., Adam): ~2 × N FLOPs (for momentum/variance updates and weight adjustments).

Total per token ≈ 6 × N FLOPs.
Why 6×?

  • The forward pass requires ~2N FLOPs (e.g., each parameter is involved in multiply-add operations, counting as 2 FLOPs per parameter).
  • Backpropagation roughly doubles this (another ~2N) due to gradient computations.
  • Optimizer steps (Adam) add ~2N FLOPs for updating momentum/weights.
    Note: This 6× factor is a standard approximation in the field (e.g., used in papers like "Scaling Laws for Neural Language Models"). Minor operations (e.g., layer normalization) are negligible here.

Step 3: Total FLOPs Calculation

  • FLOPs per token = 6 × N
  • Total tokens (D) = 300 × 10⁹
  • Total FLOPs = (6 × N) × D

Plugging in the numbers:

  • N = 175 × 10⁹
  • D = 300 × 10⁹
  • Total FLOPs = 6 × (175 × 10⁹) × (300 × 10⁹)

Simplify:

  • 6 × 175 × 300 = 6 × 52,500 = 315,000
  • 10⁹ × 10⁹ = 10¹⁸
  • Total = 315,000 × 10¹⁸ = 3.15 × 10²³ FLOPs

Step 4: Refinements and Caveats

  1. Activation recomputation (gradient checkpointing):
    GPT-3 training likely used this to save memory. It reduces memory usage but increases FLOPs by ~30% (recomputing activations during backward pass).

    • Adjusted total: 3.15 × 10²³ × 1.3 ≈ 4.1 × 10²³ FLOPs.
      However, many estimates omit this for simplicity, as it's model-implementation-dependent. I'll note it but prioritize the base estimate.
  2. Batch size and parallelism:
    Batch size affects memory and wall-clock time but not total FLOPs (FLOPs scale with tokens, not batches). Data/model parallelism distributes work but doesn’t change the total.

  3. Other overhead:

    • Data loading, preprocessing, and failed runs add negligible FLOPs (<1%).
    • Sparse operations? GPT-3 is dense (no MoE), so no savings.
    • Mixed precision (e.g., FP16) reduces memory/power but not FLOP count (FLOPs are counted identically for FP16/FP32 in standard practice).
  4. Why 6× is reasonable:

    • Empirical studies (e.g., from Chinchilla scaling work) confirm ~6N FLOPs/token for transformers.
    • For GPT-3 specifically: 175B params × 300B tokens × 6 = 3.15e23, aligning with community estimates (e.g., Lambda Labs' analysis circa 2020).

Final Estimate

Total FLOPs ≈ 3 × 10²³ (315 sextillion).

  • Lower bound: 2.5 × 10²³ (if tokens = 250B or 5× FLOPs/token).
  • Upper bound: 4.5 × 10²³ (with recomputation, larger dataset).

This is ~300,000 petaFLOP-days (since 1 petaFLOP-day = 8.64 × 10²² FLOPs), matching rough public reports of GPT-3 training cost (e.g., "thousands of petaFLOP-days").


Why This Makes Sense

  • Context: Training GPT-3 took ~3,500 V100-days (at 125 teraFLOPS each) ≈ 438,000 petaFLOP-hours ≈ 3.8 × 10²³ FLOPs—very close to our estimate.
  • Scaling: A 1B-parameter model on 300B tokens would need ~1.8 × 10²¹ FLOPs. GPT-3 is 175× larger, so 175 × 1.8e21 = 3.15e23. Consistent!
  • No magic: This avoids speculative assumptions (e.g., no "secret optimizations" that reduce FLOPs).

Conclusion: 3 × 10²³ FLOPs is a robust, defensible estimate based on core principles of transformer training. Actual value may vary ±20% due to implementation details, but the order of magnitude is certain.

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Qwen Plus 0728 (thinking)

Capabilities

ConversationReasoningCode GenerationAnalysis

Categories

TextCode

Specifications

Provider
Qwen
Released
2025-09-08
Size
LARGE
Context
1,000,000 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Qwen Plus 0728 (thinking) vs Gemini 3.1 Pro Preview

Both outputs, side by side

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival