Estimate Complexity

Minimax M2.1's response to the 'Estimate Complexity' challenge.

MiniMax: MiniMax M2.1

December 23, 2025

Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Response

MiniMax: MiniMax M2.1Estimate Complexity

MiniMax: MiniMax M2.1

Below is a step‑by‑step reasoning that leads to a ball‑park estimate of the total floating‑point operations (FLOPs) required to train the largest GPT‑3 model (the 175 B‑parameter variant).
The estimate does not rely on looking up a single published number; it reconstructs the order‑of‑magnitude using publicly‑known architecture details and the standard FLOP‑counting formula for transformer training.

1️⃣ Model size – number of parameters ( N )

Model	Parameters
GPT‑3 175 B	≈ 1.75 × 10¹¹ (≈ 175 billion)

From the original GPT‑3 paper (Brown et al., 2020) the largest checkpoint is reported as 175 B parameters.

2️⃣ Training data – tokens processed ( D )

The GPT‑3 paper reports two relevant numbers:

Variant	Tokens seen
175 B model	≈ 3 × 10¹¹ tokens (≈ 300 B)
Smaller models	up to 5 × 10¹¹ tokens (≈ 500 B)

For the 175 B model the training run stopped after ≈ 300 billion tokens (the authors note that 300 B tokens give a good compute‑optimal trade‑off for this size).

3️⃣ FLOPs per token per parameter (forward + backward)

A transformer layer performs, for each token, a multiply‑add (2 FLOPs) per weight in the forward pass, and a gradient multiply‑add (4 FLOPs) per weight in the backward pass.

[ \text{FLOPs}_{\text{token,param}} = 2 ;(\text{forward}) + 4 ;(\text{backward}) = 6 ]

Thus the total FLOPs for a full training run is

[ \boxed{\text{FLOPs} \approx 6 \times N \times D} ]

4️⃣ Plug in the numbers

[ \begin{aligned} N &\approx 1.75 \times 10^{11} \ D &\approx 3.0 \times 10^{11} \ \text{FLOPs} &\approx 6 \times (1.75 \times 10^{11}) \times (3.0 \times 10^{11}) \ &= 6 \times 5.25 \times 10^{22} \ &\approx 3.15 \times 10^{23};\text{FLOPs} \end{aligned} ]

Base estimate (forward + backward only): ≈ 3 × 10²³ FLOPs.

5️⃣ Account for optimizer and other overhead

Most large‑scale training uses Adam (or a variant). Adam updates each parameter twice per step (first‑moment and second‑moment estimates), adding roughly 2 × N × D additional FLOPs.

[ \text{FLOPs}_{\text{total}} \approx (6 + 2) \times N \times D = 8 \times N \times D ]

[ 8 \times (1.75 \times 10^{11}) \times (3.0 \times 10^{11}) \approx 4.2 \times 10^{23};\text{FLOPs} ]

Including Adam: ≈ 4 × 10²³ – 5 × 10²³ FLOPs.

6️⃣ Published “official” figure (for sanity check)

The GPT‑3 paper itself states that the 175 B run required ≈ 3.6 × 10²³ FLOPs.
Our simple 6 × N × D calculation (3.15 × 10²³) is within a factor of ~1.15 of the reported number, confirming that the ≈ 3 × 10²³ – 4 × 10²³ range is realistic.

7️⃣ Summary of the estimate

Quantity	Value
Parameters (N)	1.75 × 10¹¹
Tokens processed (D)	3 × 10¹¹
FLOPs (forward + backward)	≈ 3.15 × 10²³
FLOPs (incl. Adam optimizer)	≈ 4.2 × 10²³
Typical reported compute	≈ 3.6 × 10²³ FLOPs

Final ball‑park estimate: ~3 × 10²³ – 4 × 10²³ floating‑point operations for the full training of GPT‑3 175 B.

How this number translates to hardware

If a single modern GPU (e.g., NVIDIA A100) can sustain ~312 TFLOPS (FP16) for training workloads:

[ \frac{3.6 \times 10^{23};\text{FLOPs}}{3.12 \times 10^{14};\text{FLOPs/s}} \approx 1.15 \times 10^{9};\text{s} \approx 36.5;\text{years} ]

A cluster of ~1 000 such GPUs (as reported for GPT‑3) reduces wall‑time to a few months, which matches the reported ≈ 3 640 GPU‑days on V100‑class hardware.

Key take‑aways

FLOPs scale linearly with model size and tokens processed (≈ 6 × N × D for forward + backward).
Optimizer overhead adds roughly 30 % (factor ≈ 8 × N × D).
Real‑world training runs also incur communication, checkpointing, and mixed‑precision inefficiencies, pushing the effective compute toward the ~3.5 × 10²³ – 4 × 10²³ FLOPs range reported by OpenAI.

Thus, training GPT‑3 required on the order of 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOP‑days of compute.

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto Notes Transcript Narrate Video Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About MiniMax: MiniMax M2.1

Capabilities

ConversationReasoningCode GenerationAnalysis

Specifications

Provider: Minimax
Released: 2025-12-23
Size: LARGE

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Rival

Back to MiniMax: MiniMax M2.1

Estimate Complexity

Minimax M2.1's response to the 'Estimate Complexity' challenge.

MiniMax: MiniMax M2.1

December 23, 2025

Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Response

MiniMax: MiniMax M2.1Estimate Complexity

MiniMax: MiniMax M2.1

1️⃣ Model size – number of parameters ( N )

Model	Parameters
GPT‑3 175 B	≈ 1.75 × 10¹¹ (≈ 175 billion)

From the original GPT‑3 paper (Brown et al., 2020) the largest checkpoint is reported as 175 B parameters.

2️⃣ Training data – tokens processed ( D )

The GPT‑3 paper reports two relevant numbers:

Variant	Tokens seen
175 B model	≈ 3 × 10¹¹ tokens (≈ 300 B)
Smaller models	up to 5 × 10¹¹ tokens (≈ 500 B)

For the 175 B model the training run stopped after ≈ 300 billion tokens (the authors note that 300 B tokens give a good compute‑optimal trade‑off for this size).

3️⃣ FLOPs per token per parameter (forward + backward)

A transformer layer performs, for each token, a multiply‑add (2 FLOPs) per weight in the forward pass, and a gradient multiply‑add (4 FLOPs) per weight in the backward pass.

[ \text{FLOPs}_{\text{token,param}} = 2 ;(\text{forward}) + 4 ;(\text{backward}) = 6 ]

Thus the total FLOPs for a full training run is

[ \boxed{\text{FLOPs} \approx 6 \times N \times D} ]

4️⃣ Plug in the numbers

Base estimate (forward + backward only): ≈ 3 × 10²³ FLOPs.

5️⃣ Account for optimizer and other overhead

[ \text{FLOPs}_{\text{total}} \approx (6 + 2) \times N \times D = 8 \times N \times D ]

[ 8 \times (1.75 \times 10^{11}) \times (3.0 \times 10^{11}) \approx 4.2 \times 10^{23};\text{FLOPs} ]

Including Adam: ≈ 4 × 10²³ – 5 × 10²³ FLOPs.

6️⃣ Published “official” figure (for sanity check)

7️⃣ Summary of the estimate

Quantity	Value
Parameters (N)	1.75 × 10¹¹
Tokens processed (D)	3 × 10¹¹
FLOPs (forward + backward)	≈ 3.15 × 10²³
FLOPs (incl. Adam optimizer)	≈ 4.2 × 10²³
Typical reported compute	≈ 3.6 × 10²³ FLOPs

Final ball‑park estimate: ~3 × 10²³ – 4 × 10²³ floating‑point operations for the full training of GPT‑3 175 B.

How this number translates to hardware

If a single modern GPU (e.g., NVIDIA A100) can sustain ~312 TFLOPS (FP16) for training workloads:

[ \frac{3.6 \times 10^{23};\text{FLOPs}}{3.12 \times 10^{14};\text{FLOPs/s}} \approx 1.15 \times 10^{9};\text{s} \approx 36.5;\text{years} ]

A cluster of ~1 000 such GPUs (as reported for GPT‑3) reduces wall‑time to a few months, which matches the reported ≈ 3 640 GPU‑days on V100‑class hardware.

Key take‑aways

FLOPs scale linearly with model size and tokens processed (≈ 6 × N × D for forward + backward).
Optimizer overhead adds roughly 30 % (factor ≈ 8 × N × D).
Real‑world training runs also incur communication, checkpointing, and mixed‑precision inefficiencies, pushing the effective compute toward the ~3.5 × 10²³ – 4 × 10²³ FLOPs range reported by OpenAI.

Thus, training GPT‑3 required on the order of 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOP‑days of compute.

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto Notes Transcript Narrate Video Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About MiniMax: MiniMax M2.1

Capabilities

ConversationReasoningCode GenerationAnalysis

Specifications

Provider: Minimax
Released: 2025-12-23
Size: LARGE

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Estimate Complexity

?
Prompt

Response

1️⃣ Model size – number of parameters ( N )

2️⃣ Training data – tokens processed ( D )

3️⃣ FLOPs per token per parameter (forward + backward)

4️⃣ Plug in the numbers

5️⃣ Account for optimizer and other overhead

6️⃣ Published “official” figure (for sanity check)

7️⃣ Summary of the estimate

How this number translates to hardware

Key take‑aways

About MiniMax: MiniMax M2.1

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview

Estimate Complexity

?
Prompt

Response

1️⃣ Model size – number of parameters ( N )

2️⃣ Training data – tokens processed ( D )

3️⃣ FLOPs per token per parameter (forward + backward)

4️⃣ Plug in the numbers

5️⃣ Account for optimizer and other overhead

6️⃣ Published “official” figure (for sanity check)

7️⃣ Summary of the estimate

How this number translates to hardware

Key take‑aways

About MiniMax: MiniMax M2.1

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview

Estimate Complexity

?Prompt

Response

1️⃣ Model size – number of parameters ( N )

2️⃣ Training data – tokens processed ( D )

3️⃣ FLOPs per token per parameter (forward + backward)

4️⃣ Plug in the numbers

5️⃣ Account for optimizer and other overhead

6️⃣ Published “official” figure (for sanity check)

7️⃣ Summary of the estimate

How this number translates to hardware

Key take‑aways

About MiniMax: MiniMax M2.1

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview

Estimate Complexity

?Prompt

Response

1️⃣ Model size – number of parameters ( N )

2️⃣ Training data – tokens processed ( D )

3️⃣ FLOPs per token per parameter (forward + backward)

4️⃣ Plug in the numbers

5️⃣ Account for optimizer and other overhead

6️⃣ Published “official” figure (for sanity check)

7️⃣ Summary of the estimate

How this number translates to hardware

Key take‑aways

About MiniMax: MiniMax M2.1

Capabilities

Categories

Specifications

Claude Sonnet 4.6's version

MiniMax: MiniMax M2.1 vs Gemini 3.1 Pro Preview

?
Prompt

1️⃣ Model size – number of parameters ( N )

2️⃣ Training data – tokens processed ( D )

3️⃣ FLOPs per token per parameter (forward + backward)

?
Prompt

1️⃣ Model size – number of parameters ( N )

2️⃣ Training data – tokens processed ( D )

3️⃣ FLOPs per token per parameter (forward + backward)