Skip to content
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
  1. Home
  2. Models
  3. Mistral: Devstral 2 2512
  4. Estimate Complexity

See how Mistral: Devstral 2 2512 compares on this challenge:

  • Mistral: Devstral 2 2512 vs Claude Sonnet 4.6
  • Mistral: Devstral 2 2512 vs Gemini 3.1 Pro Preview
  • Mistral: Devstral 2 2512 vs Z.ai: GLM 5

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Sign Up
Sign Up
Back to Mistral: Devstral 2 2512

Estimate Complexity

Devstral 2512 Free's response to the 'Estimate Complexity' challenge.

Mistral: Devstral 2 2512
December 9, 2025
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Mistral: Devstral 2 2512
Response

devstral-2512-free logo
Mistral: Devstral 2 2512Estimate Complexity
devstral-2512-free logo
Mistral: Devstral 2 2512

Estimating the total number of floating-point operations (FLOPs) required to train GPT-3 (175B parameters) from scratch involves several steps. Here's a step-by-step breakdown of the reasoning:


1. Model Parameters and Architecture

GPT-3 has 175 billion parameters, distributed across its layers. The model is a transformer with:

  • Layers (L): ~96 (exact number not public, but typical for large models).
  • Hidden dimension (d): ~12,288 (derived from parameter count and layer structure).
  • Attention heads (H): ~96 (common for large models).
  • Sequence length (n): Typically 2048 tokens during training.

Key components contributing to FLOPs:

  • Attention layers: Self-attention is the most computationally expensive part.
  • Feed-forward layers (MLPs): Also significant but usually less than attention.
  • Other operations: Layer norms, embeddings, etc. (relatively minor).

2. FLOPs per Forward/Backward Pass

a) Self-Attention FLOPs

For a single attention layer with sequence length ( n ), hidden dimension ( d ), and ( H ) heads:

  • Attention scores: ( QK^T ) is ( n \times n \times (d/H) ). FLOPs: ( 2n^2d ) (since ( H ) heads are computed in parallel).
  • Attention output: ( AV ) is ( n \times n \times (d/H) ). FLOPs: ( 2n^2d ).
  • Total per attention layer: ( 4n^2d ).

For GPT-3:

  • ( n = 2048 ), ( d \approx 12,288 ).
  • FLOPs per attention layer: ( 4 \times 2048^2 \times 12,288 \approx 2.06 \times 10^{11} ).

b) Feed-Forward (MLP) FLOPs

Each MLP has two linear layers (up and down projection):

  • Hidden dimension ( d ), intermediate dimension ( 4d ) (typical for transformers).
  • FLOPs per MLP layer: ( 2 \times (d \times 4d + 4d \times d) = 16d^2 ).
  • For GPT-3: ( 16 \times 12,288^2 \approx 2.42 \times 10^9 ).

c) Total FLOPs per Layer

  • Attention + MLP: ( 2.06 \times 10^{11} + 2.42 \times 10^9 \approx 2.08 \times 10^{11} ).
  • For 96 layers: ( 96 \times 2.08 \times 10^{11} \approx 2.00 \times 10^{13} ) FLOPs per forward pass.

d) Backward Pass

Training requires backpropagation, which is roughly 2-3x the FLOPs of the forward pass (due to gradient computation and activation storage). Assume 2x for simplicity:

  • Total FLOPs per forward/backward pass: ( 2 \times 2.00 \times 10^{13} = 4.00 \times 10^{13} ).

3. Training Tokens

GPT-3 was trained on ~300 billion tokens. Assuming:

  • Batch size (B): 3.2 million tokens (as hinted in some papers for large-scale training).
  • Number of steps: ( 300 \times 10^9 / 3.2 \times 10^6 \approx 93,750 ) steps.

4. Total Training FLOPs

Multiply FLOPs per step by the number of steps:

  • ( 4.00 \times 10^{13} \times 93,750 \approx 3.75 \times 10^{18} ) FLOPs.

5. Adjustments and Reality Checks

  • Mixed Precision: Modern training uses FP16/FP32 mixed precision, reducing FLOPs by ~2x (but not the actual compute time due to memory bandwidth). Adjust to ( 1.88 \times 10^{18} ) FLOPs.
  • Optimizations: Some operations (e.g., flash attention) reduce FLOPs further, but we ignore this for simplicity.
  • Overhead: Communication, activation checkpointing, and other overheads may add ~10-20% more FLOPs. Final estimate: ~2.0 × 10¹⁸ FLOPs.

6. Comparison to Known Estimates

Public estimates (e.g., from OpenAI or third-party analyses) suggest GPT-3 training required ~3.14 × 10²⁰ FLOPs (314 exaFLOPs). This discrepancy arises because:

  • Our estimate assumes full attention (no sparsity or optimizations like FlashAttention).
  • The actual training likely used longer sequences (e.g., 4096 tokens) or more layers.
  • Data efficiency: Some tokens may be seen multiple times (data augmentation or repetition).
  • Other costs: Pipeline parallelism, gradient synchronization, and other distributed training overheads.

If we account for these:

  • Double the sequence length (( n = 4096 )): FLOPs scale as ( n^2 ), so ( 4x ) more for attention → ( 8.0 \times 10^{18} ) FLOPs.
  • Add overheads and optimizations: ~3 × 10¹⁹ to 3 × 10²⁰ FLOPs (closer to public estimates).

Final Estimate

A reasonable ballpark for GPT-3 training is: ~3 × 10²⁰ FLOPs (300 exaFLOPs).

This aligns with the idea that training GPT-3 required thousands of petaFLOP/s-days of compute. For example:

  • 3 × 10²⁰ FLOPs / (10¹⁵ FLOPs/s for a 1 petaFLOP/s system) = 3 × 10⁵ seconds ≈ 3.5 days of continuous compute on a 1 exaFLOP/s system (or longer with parallelization inefficiencies).
Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Mistral: Devstral 2 2512

Capabilities

ConversationReasoningCode GenerationAnalysis

Categories

TextCode

Specifications

Provider
Mistral
Released
2025-12-09
Size
XLARGE

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Mistral: Devstral 2 2512 vs Gemini 3.1 Pro Preview

Both outputs, side by side

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival