Skip to content
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
  1. Home
  2. Models
  3. GPT OSS 20B
  4. Estimate Complexity

See how GPT OSS 20B compares on this challenge:

  • GPT OSS 20B vs Claude Sonnet 4.6
  • GPT OSS 20B vs Gemini 3.1 Pro Preview
  • GPT OSS 20B vs Z.ai: GLM 5

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Sign Up
Sign Up
Back to GPT OSS 20B

Estimate Complexity

Gpt Oss 20B's response to the 'Estimate Complexity' challenge.

GPT OSS 20B
August 5, 2025
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

GPT OSS 20B
Response

gpt-oss-20b logo
GPT OSS 20BEstimate Complexity
gpt-oss-20b logo
GPT OSS 20B

Below is a step‑by‑step, “hand‑calculation” of how many floating‑point operations (FLOPs) GPT‑3 had to do to learn from its training data.
I’ll show the logic, the numbers I use, and how they combine to give the widely‑quoted “≈ 300 exaFLOPs” figure. (All the math can be done in a spreadsheet or a simple calculator.)


1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

  1. Linear projections for queries (Q), keys (K) and values (V) – 3 times.
  2. Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
  3. Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.


2. Key hyper‑parameters of GPT‑3 175 B

ParameterValue
Hidden dimension (d_{\text{model}})12 288
Number of layers96
Attention heads96
Head size128
Context window (sequence length) (L)2 048
Tokens processed (overall)≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.


3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

  • FLOPs per projection = (2 \times d_{\text{model}}^2)
    (one multiply + one add per weight).
  • Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

  1. Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
    FLOPs per token = (2 \times L \times d_{\text{model}}).
    Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).

  2. Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

  • FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
    = (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]


4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]


5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]


6. Convert to conventional units

  • (10^{18}) = exa
  • (10^{21}) = zetta
  • (10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

  • Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
  • Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
  • Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

When you fold in these real‑world efficiencies, the estimate slides down to the reported ≈ 300 exaFLOPs. The hand‑calculation above is a first‑order sanity check that shows the order of magnitude is correct.


Bottom‑Line

  • Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
  • Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
  • Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
  • Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About GPT OSS 20B

Capabilities

ConversationReasoningCode GenerationAnalysisFunction CallingTool UsePlanning

Categories

TextCode

Specifications

Provider
Openai
Released
2025-08-05
Size
LARGE
Parameters
21B total (3.6B active per forward pass)
Context
131,072 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

GPT OSS 20B vs Gemini 3.1 Pro Preview

Both outputs, side by side

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival