Skip to content
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
  1. Home
  2. Models
  3. Mistral Medium 3.1
  4. Estimate Complexity

See how Mistral Medium 3.1 compares on this challenge:

  • Mistral Medium 3.1 vs Claude Sonnet 4.6
  • Mistral Medium 3.1 vs Gemini 3.1 Pro Preview
  • Mistral Medium 3.1 vs Z.ai: GLM 5

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Sign Up
Sign Up
Back to Mistral Medium 3.1

Estimate Complexity

Mistral Medium 3.1's response to the 'Estimate Complexity' challenge.

Mistral Medium 3.1
August 13, 2025
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Mistral Medium 3.1
Response

mistral-medium-3.1 logo
Mistral Medium 3.1Estimate Complexity
mistral-medium-3.1 logo
Mistral Medium 3.1

Estimating the total number of FLOPs (floating-point operations) required to train a model like GPT-3 from scratch involves breaking down the problem into several key components. Here's a step-by-step reasoning process:


1. Model Parameters

GPT-3 is known to have 175 billion parameters. This is our starting point.


2. FLOPs per Forward/Backward Pass

For a transformer-based model like GPT-3, the dominant computational cost comes from the matrix multiplications in the attention layers and the fully connected (MLP) layers. The number of FLOPs per forward pass can be approximated as follows:

a) Attention Layers

For a transformer with:

  • Sequence length (S): GPT-3 uses a context window of up to 2048 tokens.
  • Embedding dimension (d): Typically, ( d = 12288 ) for GPT-3 (since ( 175B ) parameters are divided across layers, and ( d ) is derived from the hidden size).
  • Number of heads (h): GPT-3 uses 96 heads, so ( d/h = 128 ) (head dimension).

The self-attention operation involves:

  1. Query, Key, Value projections: ( 3 \times S \times d \times d ) FLOPs (since each token's embedding is projected to Q, K, V).
  2. Attention scores: ( S \times S \times d ) FLOPs (softmax over ( QK^T )).
  3. Attention output: ( S \times S \times d ) FLOPs (weighted sum of V).

Total for attention per layer: ( \approx 2S^2d + 4Sd^2 ).

For ( S = 2048 ) and ( d = 12288 ):

  • ( 2S^2d = 2 \times 2048^2 \times 12288 \approx 1.02 \times 10^{11} ) FLOPs.
  • ( 4Sd^2 = 4 \times 2048 \times 12288^2 \approx 1.22 \times 10^{12} ) FLOPs.
  • Total per layer: ( \approx 1.32 \times 10^{12} ) FLOPs.

GPT-3 has 96 layers, so total for attention: ( 96 \times 1.32 \times 10^{12} \approx 1.27 \times 10^{14} ) FLOPs.

b) MLP Layers

Each MLP layer has:

  • Two matrix multiplications: ( d \times 4d ) (expand) and ( 4d \times d ) (contract).
  • FLOPs per MLP layer: ( 2 \times S \times d \times 4d = 8Sd^2 ).

For ( S = 2048 ) and ( d = 12288 ):

  • ( 8Sd^2 = 8 \times 2048 \times 12288^2 \approx 2.44 \times 10^{12} ) FLOPs per layer.
  • Total for 96 layers: ( 96 \times 2.44 \times 10^{12} \approx 2.34 \times 10^{14} ) FLOPs.

c) Other Operations

  • Layer normalization, residual connections, and softmax contribute relatively fewer FLOPs (e.g., ( \approx 10% ) of the total).
  • Total for other ops: ( \approx 0.1 \times (1.27 + 2.34) \times 10^{14} \approx 3.61 \times 10^{13} ) FLOPs.

d) Total FLOPs per Forward Pass

Summing up:

  • Attention: ( 1.27 \times 10^{14} ).
  • MLP: ( 2.34 \times 10^{14} ).
  • Other: ( 0.36 \times 10^{14} ).
  • Total forward pass: ( \approx 3.97 \times 10^{14} ) FLOPs.

e) Backward Pass

The backward pass is typically 2-3x the cost of the forward pass due to gradient computations. Assuming 2x:

  • Total backward pass: ( \approx 7.94 \times 10^{14} ) FLOPs.

f) Total FLOPs per Training Step

  • Forward + backward: ( 3.97 \times 10^{14} + 7.94 \times 10^{14} \approx 1.19 \times 10^{15} ) FLOPs.

3. Training Tokens

GPT-3 was trained on ~300 billion tokens. Assuming:

  • Batch size: 3.2 million tokens (as reported for GPT-3).
  • Number of steps: ( 300 \text{ billion} / 3.2 \text{ million} \approx 93,750 ) steps.

4. Total Training FLOPs

Multiply FLOPs per step by the number of steps:

  • ( 1.19 \times 10^{15} \text{ FLOPs/step} \times 93,750 \text{ steps} \approx 1.12 \times 10^{20} ) FLOPs.

5. Adjustments

  • Mixed Precision: GPT-3 likely used mixed-precision (FP16/FP32), which can reduce FLOPs by ~2x (since some ops are in FP16). However, the theoretical FLOP count remains the same (hardware just executes faster).
  • Activation Recomputation: Some frameworks recompute activations during the backward pass to save memory, which can increase FLOPs by ~20-30%. Adding 25%:
    • ( 1.12 \times 10^{20} \times 1.25 \approx 1.40 \times 10^{20} ) FLOPs.
  • Optimizer Overhead: Adam optimizer adds ~10-20% FLOPs. Adding 15%:
    • ( 1.40 \times 10^{20} \times 1.15 \approx 1.61 \times 10^{20} ) FLOPs.

6. Final Estimate

After accounting for all factors, the total FLOPs for training GPT-3 is approximately: ~1.6 × 10²⁰ FLOPs (160 exaFLOPs).

This aligns with rough estimates from other sources (e.g., the original GPT-3 paper implies a similar order of magnitude).


Key Assumptions:

  1. Sequence length ( S = 2048 ).
  2. Hidden dimension ( d = 12288 ) (derived from 175B parameters and 96 layers).
  3. Batch size = 3.2M tokens.
  4. 300B total training tokens.
  5. Mixed-precision and activation recomputation adjustments.
Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Mistral Medium 3.1

Capabilities

ConversationReasoningCode GenerationAnalysis

Categories

TextCodeMultimodal

Specifications

Provider
Mistral
Released
2025-08-13
Size
LARGE
Context
131,072 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Mistral Medium 3.1 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival