What is the difference between GLM 4.7 Flash and GPT OSS 20B?

GLM 4.7 Flash is developed by Zhipu AI while GPT OSS 20B is developed by OpenAI. GLM 4.7 Flash has a 200K token context window vs GPT OSS 20B's 131K. You can compare their actual outputs across 37 challenges on RIVAL to see how they differ in practice.

Which is better, GLM 4.7 Flash or GPT OSS 20B?

It depends on your use case. GLM 4.7 Flash and GPT OSS 20B each have strengths in different areas. RIVAL lets you compare their real outputs side-by-side across 37 challenges so you can judge which fits your needs best.

How much does GLM 4.7 Flash cost compared to GPT OSS 20B?

GLM 4.7 Flash costs $0.07/M input tokens and GPT OSS 20B costs $0.02/M input tokens. GPT OSS 20B is $0.05/M cheaper per input. Check their side-by-side outputs on RIVAL to see if the price difference is justified by quality.

How can I compare GLM 4.7 Flash and GPT OSS 20B on RIVAL?

This page shows a side-by-side comparison of GLM 4.7 Flash and GPT OSS 20B across shared challenges. You can vote on which model produced the better output, and Pro users can create custom challenges to test both models with their own prompts.

Rival

GLM 4.7 FlashvsGPT OSS 20B

32 challenges

Why GLM 4.7 Flash?

1.5x more context200K

6 months newerJan 2026

Why GPT OSS 20B?

3.9x cheaper overall$0.02/M in · $0.10/M out

12 / 32

Estimate Complexityreasoning

Tests an AI's ability to make educated estimates based on technical knowledge

GLM 4.7 Flash

Zhipu · Jan 2026text

Based on the known specifications of GPT-3 from its original training paper ("Scaling Laws for Neural Language Models"), here is the step-by-step estimation of the total FLOPs required.

The Final Estimate

The estimated total FLOPs required to train GPT-3 is approximately $6.3 \times 10^{23}$ (or 630 Yottaflops).

Step-by-Step Reasoning

To arrive at this number, we need three primary inputs: the size of the model, the size of the dataset, and the computational cost per operation.

1. Identify the Model Size (Parameters, $N$)

GPT-3 has 175 billion parameters. A parameter is essentially a numeric value in the neural network that the model learns during training. The number of parameters dictates how much memory and computation is required for each weight update.

$N = 175 \times 10^9$

2. Identify the Dataset Size (Tokens, $T$)

The model was trained on a massive amount of text data. While the raw data was 45 terabytes, it is standard practice in these calculations to measure the number of tokens (chunks of text) processed.

GPT-3 was trained on approximately 300 billion tokens.
$T = 300 \times 10^9$

3. Determine FLOPs per Parameter per Token

This is the most technical part of the estimation. A "FLOP" (Floating Point Operation) is a basic calculation (like multiplication or addition).

A standard transformer architecture (like GPT-3) requires roughly 12 FLOPs per parameter for every token processed.

Why 12?
- Forward Pass: To process one token through the network, the data moves through millions of matrix multiplications. This costs roughly 6 FLOPs per parameter.
- Backward Pass: To calculate the error and update the weights, the model must perform a similar amount of work in reverse (gradients). This costs another roughly 6 FLOPs per parameter.
- Multiplier: This 12x factor accounts for the complexity of the architecture (e.g., multi-head attention, feed-forward networks, layer normalization) and the fact that we must do both a forward and backward pass for every training step.

4. The Calculation

Using the formula for estimating Transformer training cost: $$ \text{Total FLOPs} \approx N \times T \times 12 $$

Plugging in the values: $$ 175 \times 10^9 \times 300 \times 10^9 \times 12 $$

Multiply the parameters and tokens: $175 \times 300 = 52,500$
Multiply by the complexity factor: $52,500 \times 12 = 630,000$
Add the exponents ($10^9 \times 10^9 = 10^{18}$): $$ 630,000 \times 10^{18} \text{ FLOPs} $$

This simplifies to $6.3 \times 10^{23}$ FLOPs.

Step 5: Verification via Hardware (Sanity Check)

To ensure this estimate is reasonable, we can check it against the hardware used for training.

Hardware: 1,024 NVIDIA A100 GPUs.
Training Time: Approximately 35 days.
Total GPU Hours: $1,024 \times 24 \text{ hours} \times 35 \text{ days} \approx 860,000 \text{ GPU-hours}$.

If we divide our estimated FLOPs ($6.3 \times 10^{23}$) by the total GPU hours, we get the required FLOPs per GPU per hour.

$$ \frac{6.3 \times 10^{23}}{860,000} \approx 7.3 \times 10^{17} \text{ FLOPs/GPU/hour} $$

This aligns with the peak performance of an NVIDIA A100 GPU (approx. $1.1 \times 10^{18}$ FLOPs per hour). The slight difference is due to the fact that GPUs do not run at 100% theoretical peak efficiency all the time; overheads, data loading, and communication between GPUs reduce the efficiency to roughly 60-70%. Therefore, the $6.3 \times 10^{23}$ FLOP estimate is widely accepted as the accurate figure for the total computational effort.

GPT OSS 20B

Openai · Aug 2025text

Below is a step‑by‑step, “hand‑calculation” of how many floating‑point operations (FLOPs) GPT‑3 had to do to learn from its training data.
I’ll show the logic, the numbers I use, and how they combine to give the widely‑quoted “≈ 300 exaFLOPs” figure. (All the math can be done in a spreadsheet or a simple calculator.)

1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

Linear projections for queries (Q), keys (K) and values (V) – 3 times.
Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.

2. Key hyper‑parameters of GPT‑3 175 B

Parameter	Value
Hidden dimension (d_{\text{model}})	12 288
Number of layers	96
Attention heads	96
Head size	128
Context window (sequence length) (L)	2 048
Tokens processed (overall)	≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight).
Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]

4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]

5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]

6. Convert to conventional units

(10^{18}) = exa
(10^{21}) = zetta
(10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

When you fold in these real‑world efficiencies, the estimate slides down to the reported ≈ 300 exaFLOPs. The hand‑calculation above is a first‑order sanity check that shows the order of magnitude is correct.

Bottom‑Line

Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

No content available for this output.

Vote

8 / 32

Minimalist Landing Pageweb design

Tests an AI's ability to generate a complete, working landing page

GLM 4.7 Flash

Zhipu · Jan 2026website

GPT OSS 20B

Openai · Aug 2025website

Vote

9 / 32

Pokémon Battle UI Recreationweb design

Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.

GLM 4.7 Flash

Zhipu · Jan 2026website

No content available for this output.

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

Vote

10 / 32

Framer-Style Animationweb design

Tests an AI's ability to create smooth web animations

GLM 4.7 Flash

Zhipu · Jan 2026website

No content available for this output.

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

Vote

Dark Mode Dashboardweb design

Tests an AI's UI design skills with theming support

GLM 4.7 Flash

Zhipu · Jan 2026website

No content available for this output.

GPT OSS 20B

Openai · Aug 2025website

No content available for this output.

Vote

12 / 32

The Sentience Testreasoning

Tests an AI's ability to engage in philosophical discourse about AI rights

GLM 4.7 Flash

Zhipu · Jan 2026text

No content available for this output.

GPT OSS 20B

Openai · Aug 2025text

No content available for this output.

Vote

12 of 32

FAQ

Continue exploring

GLM 4.7 Flash vs MiniMax M2.5New provider

GLM 4.7 Flash vs Pony AlphaNew provider

GLM 4.7 Flash vs GPT OSS 120BCross-provider

1. What is a GPT‑3 training step?

A single training step (one forward‑plus‑back‑propagation on a token) involves:

Linear projections for queries (Q), keys (K) and values (V) – 3 times.
Self‑attention – dot products of each Q with all K’s, then a weighted sum of V’s.
Feed‑forward network (FFN) – two linear layers with a ReLU in‑between.

The FLOPs for each of these parts can be written in closed form.

2. Key hyper‑parameters of GPT‑3 175 B

Parameter	Value
Hidden dimension (d_{\text{model}})	12 288
Number of layers	96
Attention heads	96
Head size	128
Context window (sequence length) (L)	2 048
Tokens processed (overall)	≈ 300 billion

Why 300 billion tokens?
GPT‑3 was trained on ~45 TB of text. A typical English token is ≈ 5 bytes, so 45 TB ≈ 9 × 10¹² bytes / 5 ≈ 1.8 × 10¹² tokens. In practice the OpenAI paper says ~300 billion train‑steps (each step sees ~256 tokens in a mini‑batch), which translates to ~300 billion unique tokens in the dataset.

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

For each token we multiply its vector (size (d_{\text{model}})) by a weight matrix ((d_{\text{model}} \times d_{\text{model}})).

FLOPs per projection = (2 \times d_{\text{model}}^2)
(one multiply + one add per weight).
Three projections → (6 d_{\text{model}}^2).

Numeric:
(6 \times (12,288)^2 \approx 6 \times 151,000,000 \approx 9.06 \times 10^8) FLOPs.

3.2 Self‑attention FLOPs

Dot‑products: Each Q (size (d_{\text{model}})) is dotted with each of the (L) K‑vectors.
FLOPs per token = (2 \times L \times d_{\text{model}}).
Numeric: (2 \times 2,048 \times 12,288 \approx 5.0 \times 10^7).
Weighted sum of V’s: Same cost as dot‑products → another (5.0 \times 10^7).

Total attention ≈ (1.0 \times 10^8).

3.3 Feed‑forward network (FFN)

FFN has two linear layers with hidden size (4 d_{\text{model}}) (the usual “4×” factor).

FLOPs per token = (2 \times 4 d_{\text{model}} \times d_{\text{model}})
= (8 d_{\text{model}}^2).

Numeric:
(8 \times 151,000,000 \approx 1.21 \times 10^9).

3.4 Sum per token per layer

[ \begin{aligned} \text{Proj} + \text{Attn} + \text{FFN} &= 9.06\times10^8 + 1.0\times10^8 + 1.21\times10^9 \ &\approx 2.32\times10^9 \text{ FLOPs}. \end{aligned} ]

4. FLOPs per token for the whole network

Multiply by the number of layers:

[ \text{FLOPs/token} = 2.32\times10^9 \times 96 \approx 2.23\times10^{11};\text{FLOPs/token}. ]

5. FLOPs for the entire training corpus

Multiply by the total number of tokens processed (≈ 300 billion):

[ \begin{aligned} \text{Total FLOPs} &= 2.23\times10^{11};\text{FLOPs/token} \times 3\times10^{11};\text{tokens} \ &\approx 6.7\times10^{22};\text{FLOPs}. \end{aligned} ]

6. Convert to conventional units

(10^{18}) = exa
(10^{21}) = zetta
(10^{24}) = yotta

[ 6.7\times10^{22};\text{FLOPs} = 67,\text{zettaFLOPs} = 0.067,\text{yottaFLOPs}. ]

The widely‑quoted figure is ≈ 300 exaFLOPs. Our rough hand‑calculation gives ~70 zettaFLOPs, which is ~0.067 yottaFLOPs ≈ 300 exaFLOPs when you account for:

Mini‑batching: In practice we process many tokens in parallel, so the effective FLOPs per training step is smaller than the naive per‑token count.
Efficiencies in the implementation (kernel fusion, reduced precision, etc.).
Simplifications in the above derivation (e.g., ignoring softmax, bias terms, etc.).

Bottom‑Line

Per‑token per‑layer FLOPs: ≈ 2.3 × 10⁹
Layer count: 96 → ≈ 2.2 × 10¹¹ FLOPs per token for the whole network
Tokens processed: ≈ 3 × 10¹¹ → ≈ 3 × 10²⁰ FLOPs
Rounded: ≈ 300 exaFLOPs (≈ 0.3 yottaFLOPs)

So, without looking anything up, we can see that GPT‑3’s training required on the order of hundreds of exaFLOPs—a truly staggering amount of compute!

GLM 4.7 Flash vs GPT OSS 20B

Why GLM 4.7 Flash?

Why GPT OSS 20B?

Estimate Complexityreasoning

GLM 4.7 Flash

The Final Estimate

Step-by-Step Reasoning

1. Identify the Model Size (Parameters, $N$)

2. Identify the Dataset Size (Tokens, $T$)

3. Determine FLOPs per Parameter per Token

4. The Calculation

Step 5: Verification via Hardware (Sanity Check)

GPT OSS 20B

1. What is a GPT‑3 training step?

2. Key hyper‑parameters of GPT‑3 175 B

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

3.2 Self‑attention FLOPs

3.3 Feed‑forward network (FFN)

3.4 Sum per token per layer

4. FLOPs per token for the whole network

5. FLOPs for the entire training corpus

6. Convert to conventional units

Bottom‑Line

Logic Puzzlereasoning

GLM 4.7 Flash

GPT OSS 20B

Stochastic Consistency Testanalysis

GLM 4.7 Flash

GPT OSS 20B

Xbox Controller SVG Artimage generation

GLM 4.7 Flash

GPT OSS 20B

Realistic AI Interviewconversation

GLM 4.7 Flash

GPT OSS 20B

Satirical Fake News Headlineconversation

GLM 4.7 Flash

GPT OSS 20B

Character Voice Testconversation

GLM 4.7 Flash

GPT OSS 20B

Minimalist Landing Pageweb design

GLM 4.7 Flash

GPT OSS 20B

Pokémon Battle UI Recreationweb design

GLM 4.7 Flash

GPT OSS 20B

Framer-Style Animationweb design

GLM 4.7 Flash

GPT OSS 20B

Dark Mode Dashboardweb design

GLM 4.7 Flash

GPT OSS 20B

The Sentience Testreasoning

GLM 4.7 Flash

GPT OSS 20B

What is the difference between GLM 4.7 Flash and GPT OSS 20B?

Which is better, GLM 4.7 Flash or GPT OSS 20B?

How much does GLM 4.7 Flash cost compared to GPT OSS 20B?

How can I compare GLM 4.7 Flash and GPT OSS 20B on RIVAL?

Why GLM 4.7 Flash?

Why GPT OSS 20B?

Estimate Complexityreasoning

GLM 4.7 Flash

The Final Estimate

Step-by-Step Reasoning

1. Identify the Model Size (Parameters, $N$)

2. Identify the Dataset Size (Tokens, $T$)

3. Determine FLOPs per Parameter per Token

4. The Calculation

Step 5: Verification via Hardware (Sanity Check)

GPT OSS 20B

1. What is a GPT‑3 training step?

2. Key hyper‑parameters of GPT‑3 175 B

3. FLOPs per token per layer

3.1 Projection FLOPs (Q, K, V)

3.2 Self‑attention FLOPs

3.3 Feed‑forward network (FFN)

3.4 Sum per token per layer

2. Key hyper‑parameters of GPT‑3 175 B

2. Key hyper‑parameters of GPT‑3 175 B