Skip to content
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
  1. Home
  2. Models
  3. MiniMax: MiniMax M2.1
  4. Estimate Complexity

See how MiniMax: MiniMax M2.1 compares on this challenge:

  • MiniMax: MiniMax M2.1 vs Claude Sonnet 4.6
  • MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview
  • MiniMax: MiniMax M2.1 vs Z.ai: GLM 5

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Sign Up
Sign Up
Back to MiniMax: MiniMax M2.1

Estimate Complexity

Minimax M2.1's response to the 'Estimate Complexity' challenge.

MiniMax: MiniMax M2.1
December 23, 2025
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

MiniMax: MiniMax M2.1
Response

minimax-m2.1 logo
MiniMax: MiniMax M2.1Estimate Complexity
minimax-m2.1 logo
MiniMax: MiniMax M2.1

Below is a step‑by‑step reasoning that leads to a ball‑park estimate of the total floating‑point operations (FLOPs) required to train the largest GPT‑3 model (the 175 B‑parameter variant).
The estimate does not rely on looking up a single published number; it reconstructs the order‑of‑magnitude using publicly‑known architecture details and the standard FLOP‑counting formula for transformer training.


1️⃣ Model size – number of parameters ( N )

ModelParameters
GPT‑3 175 B≈ 1.75 × 10¹¹ (≈ 175 billion)

From the original GPT‑3 paper (Brown et al., 2020) the largest checkpoint is reported as 175 B parameters.


2️⃣ Training data – tokens processed ( D )

The GPT‑3 paper reports two relevant numbers:

VariantTokens seen
175 B model≈ 3 × 10¹¹ tokens (≈ 300 B)
Smaller modelsup to 5 × 10¹¹ tokens (≈ 500 B)

For the 175 B model the training run stopped after ≈ 300 billion tokens (the authors note that 300 B tokens give a good compute‑optimal trade‑off for this size).


3️⃣ FLOPs per token per parameter (forward + backward)

A transformer layer performs, for each token, a multiply‑add (2 FLOPs) per weight in the forward pass, and a gradient multiply‑add (4 FLOPs) per weight in the backward pass.

[ \text{FLOPs}_{\text{token,param}} = 2 ;(\text{forward}) + 4 ;(\text{backward}) = 6 ]

Thus the total FLOPs for a full training run is

[ \boxed{\text{FLOPs} \approx 6 \times N \times D} ]


4️⃣ Plug in the numbers

[ \begin{aligned} N &\approx 1.75 \times 10^{11} \ D &\approx 3.0 \times 10^{11} \ \text{FLOPs} &\approx 6 \times (1.75 \times 10^{11}) \times (3.0 \times 10^{11}) \ &= 6 \times 5.25 \times 10^{22} \ &\approx 3.15 \times 10^{23};\text{FLOPs} \end{aligned} ]

Base estimate (forward + backward only): ≈ 3 × 10²³ FLOPs.


5️⃣ Account for optimizer and other overhead

Most large‑scale training uses Adam (or a variant). Adam updates each parameter twice per step (first‑moment and second‑moment estimates), adding roughly 2 × N × D additional FLOPs.

[ \text{FLOPs}_{\text{total}} \approx (6 + 2) \times N \times D = 8 \times N \times D ]

[ 8 \times (1.75 \times 10^{11}) \times (3.0 \times 10^{11}) \approx 4.2 \times 10^{23};\text{FLOPs} ]

Including Adam: ≈ 4 × 10²³ – 5 × 10²³ FLOPs.


6️⃣ Published “official” figure (for sanity check)

The GPT‑3 paper itself states that the 175 B run required ≈ 3.6 × 10²³ FLOPs.
Our simple 6 × N × D calculation (3.15 × 10²³) is within a factor of ~1.15 of the reported number, confirming that the ≈ 3 × 10²³ – 4 × 10²³ range is realistic.


7️⃣ Summary of the estimate

QuantityValue
Parameters (N)1.75 × 10¹¹
Tokens processed (D)3 × 10¹¹
FLOPs (forward + backward)≈ 3.15 × 10²³
FLOPs (incl. Adam optimizer)≈ 4.2 × 10²³
Typical reported compute≈ 3.6 × 10²³ FLOPs

Final ball‑park estimate: ~3 × 10²³ – 4 × 10²³ floating‑point operations for the full training of GPT‑3 175 B.


How this number translates to hardware

If a single modern GPU (e.g., NVIDIA A100) can sustain ~312 TFLOPS (FP16) for training workloads:

[ \frac{3.6 \times 10^{23};\text{FLOPs}}{3.12 \times 10^{14};\text{FLOPs/s}} \approx 1.15 \times 10^{9};\text{s} \approx 36.5;\text{years} ]

A cluster of ~1 000 such GPUs (as reported for GPT‑3) reduces wall‑time to a few months, which matches the reported ≈ 3 640 GPU‑days on V100‑class hardware.


Key take‑aways

  1. FLOPs scale linearly with model size and tokens processed (≈ 6 × N × D for forward + backward).
  2. Optimizer overhead adds roughly 30 % (factor ≈ 8 × N × D).
  3. Real‑world training runs also incur communication, checkpointing, and mixed‑precision inefficiencies, pushing the effective compute toward the ~3.5 × 10²³ – 4 × 10²³ FLOPs range reported by OpenAI.

Thus, training GPT‑3 required on the order of 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOP‑days of compute.

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About MiniMax: MiniMax M2.1

Capabilities

ConversationReasoningCode GenerationAnalysis

Categories

TextCode

Specifications

Provider
Minimax
Released
2025-12-23
Size
LARGE

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival