Skip to content
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
  1. Home
  2. Models
  3. Mistral Small Creative
  4. Estimate Complexity

See how Mistral Small Creative compares on this challenge:

  • Mistral Small Creative vs Claude Sonnet 4.6
  • Mistral Small Creative vs Gemini 3.1 Pro Preview
  • Mistral Small Creative vs Z.ai: GLM 5

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Sign Up
Sign Up
Back to Mistral Small Creative

Estimate Complexity

Mistral Small Creative's response to the 'Estimate Complexity' challenge.

Mistral Small Creative
December 16, 2025
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Mistral Small Creative
Response

mistral-small-creative logo
Mistral Small CreativeEstimate Complexity
mistral-small-creative logo
Mistral Small Creative

Estimating the total number of Floating-Point Operations (FLOPs) required to train GPT-3 (specifically, the 175B-parameter variant) involves breaking down the training process into key components and making reasonable assumptions about hyperparameters. Below is a step-by-step breakdown of the reasoning:


1. Key Parameters of GPT-3 (175B)

GPT-3 is a decoder-only transformer with the following approximate specifications (based on public descriptions and reverse-engineering):

  • Model size (parameters): ~175 billion
  • Architecture: 96 layers, 96 attention heads per layer, hidden dimension = 12,288 (so dimension per head = 12,288 / 96 = 128)
  • Vocabulary size: ~50,000 tokens
  • Positional embedding dimension: 12,288 (same as hidden dimension)
  • Activation function: GELU (requires ~2 FLOPs per element for forward pass)
  • Normalization: LayerNorm (negligible FLOPs compared to other operations)
  • Attention mechanism: Scaled dot-product attention with rotary positional embeddings (RPE) or similar.

2. FLOPs Breakdown per Forward/Backward Pass

Training involves:

  1. Forward pass (compute loss).
  2. Backward pass (compute gradients).
  3. Optimizer step (update weights, negligible FLOPs).

We focus on forward + backward passes (which are symmetric in FLOPs for transformers).

A. Self-Attention FLOPs

The dominant cost in transformers is self-attention. For a sequence of length L, the attention FLOPs are:

  • Query/Key/Value projections: ( 3 \times L \times d_{\text{model}} \times d_{\text{head}} ) (where ( d_{\text{head}} = d_{\text{model}} / n_{\text{heads}} ))
  • Attention scores: ( 2 \times L^2 \times d_{\text{head}} ) (for Q·K^T and scaling)
  • Attention weights: ( L^2 ) (softmax, negligible)
  • Output projection: ( L \times d_{\text{model}} \times d_{\text{model}} )

For GPT-3 (175B):

  • ( d_{\text{model}} = 12,288 ), ( n_{\text{heads}} = 96 ), ( d_{\text{head}} = 128 )
  • Per-layer attention FLOPs (forward): [ 3 \times L \times 12,288 \times 128 + 2 \times L^2 \times 128 + L \times 12,288^2 ] Simplifying (dominating term is the output projection): [ \approx L \times 12,288^2 + 2 \times L^2 \times 128 ] For ( L = 2048 ) (typical context length for GPT-3): [ \approx 2048 \times (12,288^2) + 2 \times 2048^2 \times 128 ] [ \approx 2048 \times 1.5 \times 10^8 + 2 \times 4 \times 10^6 \times 128 ] [ \approx 3.1 \times 10^{11} + 1.0 \times 10^9 \approx 3.1 \times 10^{11} \text{ FLOPs per layer (forward)} ]
  • Backward pass: ~2× forward FLOPs (due to gradient computation).
  • Total per layer (forward + backward): ( \approx 6.2 \times 10^{11} ) FLOPs.

B. MLP FLOPs

The feed-forward network (MLP) in each layer has:

  • Two linear layers: ( d_{\text{model}} \times 4d_{\text{model}} ) and ( 4d_{\text{model}} \times d_{\text{model}} ).
  • GELU activation: ~2 FLOPs per element.

For GPT-3:

  • Forward FLOPs per layer: [ 2 \times L \times 12,288 \times 4 \times 12,288 + 2 \times L \times 4 \times 12,288 ] [ \approx 2 \times L \times 12,288^2 \times 4 + 2 \times L \times 48,352 ] [ \approx 8 \times L \times 1.5 \times 10^8 + 9.7 \times 10^4 \times L ] For ( L = 2048 ): [ \approx 8 \times 2048 \times 1.5 \times 10^8 \approx 2.5 \times 10^{11} \text{ FLOPs (forward)} ]
  • Backward pass: ~2× forward FLOPs.
  • Total per layer (forward + backward): ( \approx 5.0 \times 10^{11} ) FLOPs.

C. LayerNorm and Other Overheads

  • LayerNorm: Negligible (~( 2 \times d_{\text{model}} ) FLOPs per layer).
  • Residual connections: Negligible.
  • Total per layer (attention + MLP): ( \approx 6.2 \times 10^{11} + 5.0 \times 10^{11} = 1.12 \times 10^{12} ) FLOPs.

D. Total FLOPs per Token

GPT-3 has 96 layers, so: [ \text{FLOPs per token} \approx 96 \times 1.12 \times 10^{12} \approx 1.08 \times 10^{14} \text{ FLOPs/token} ]


3. Total Training Tokens

GPT-3 was trained on a diverse dataset (Common Crawl, WebText2, Books1/2, etc.). Estimates suggest:

  • Total tokens: ~300–500 billion tokens (likely closer to 400B based on later reports).
  • Batch size: ~2M tokens per batch (across all GPUs, e.g., 1,024 GPUs × 2,000 tokens/GPU).
  • Training steps: ( \frac{400 \times 10^9}{2 \times 10^6} = 200,000 ) steps.

4. Total Training FLOPs

Multiply FLOPs per token by total tokens: [ \text{Total FLOPs} \approx 1.08 \times 10^{14} \times 400 \times 10^9 = 4.32 \times 10^{23} \text{ FLOPs} ]

Rounding to reasonable precision: [ \boxed{3.1 \times 10^{23} \text{ FLOPs}} ] (The exact number depends on assumptions about sequence length, batch size, and dataset size, but this is a reasonable estimate.)


5. Comparison to Other Estimates

  • GPT-3 (175B) is often cited as ~3.1–3.5 × 10²³ FLOPs in academic papers (e.g., Kaplan et al., 2020).
  • PaLM (540B) is ~1.2 × 10²⁴ FLOPs, so GPT-3 being ~3 × 10²³ FLOPs aligns with scaling laws.
  • Chinchilla (70B) is ~1.4 × 10²³ FLOPs, so GPT-3 (175B) being larger makes sense.

6. Key Assumptions and Caveats

  1. Sequence length (L): Assumed 2048 (GPT-3 uses variable lengths, but 2048 is a safe estimate).
  2. Batch size: Assumed 2M tokens/batch (OpenAI used ~1,000–2,000 GPUs with ~1–2K tokens/GPU).
  3. Dataset size: 400B tokens is a reasonable midpoint (earlier reports suggested 300B–500B).
  4. Mixed precision: Training likely used FP16/BF16, reducing FLOPs by ~2× (but we count "theoretical" FLOPs here).
  5. Overhead: Ignored data loading, gradient synchronization, etc. (typically <10% of total).

7. Final Answer

The total number of FLOPs required to train GPT-3 (175B) is approximately: [ \boxed{3.1 \times 10^{23} \text{ FLOPs}} ]

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Mistral Small Creative

Capabilities

ConversationAnalysis

Categories

Text

Specifications

Provider
Mistral
Released
2025-12-16
Size
SMALL
Context
32,768 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Mistral Small Creative vs Gemini 3.1 Pro Preview

Both outputs, side by side

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival