Skip to content
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
Rival
Models
CompareBest ForArena
Sign Up
Sign Up

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival
  1. Home
  2. Models
  3. Polaris Alpha
  4. Estimate Complexity

See how Polaris Alpha compares on this challenge:

  • Polaris Alpha vs Claude Sonnet 4.6
  • Polaris Alpha vs Gemini 3.1 Pro Preview
  • Polaris Alpha vs Z.ai: GLM 5

4 models have responded to the "Estimate Complexity" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.

Rival
Models
CompareBest ForArena
Sign Up
Sign Up
Back to Polaris Alpha

Estimate Complexity

Polaris Alpha's response to the 'Estimate Complexity' challenge.

Polaris Alpha
November 7, 2025
Text

?
Prompt

Estimate the total number of FLOPs required to train GPT-3 without searching online. Explain your reasoning step by step.

Polaris Alpha
Response

polaris-alpha logo
Polaris AlphaEstimate Complexity
polaris-alpha logo
Polaris Alpha

Great question—and nice constraint.

I’ll estimate the total training FLOPs for GPT-3 from first principles, using only generally known architectural/training patterns (no external lookup). I’ll walk through:

  • Model assumptions
  • FLOPs per token
  • FLOPs per step
  • Total training tokens
  • Multiply it out
  • Sanity checks

We’ll target the 175B-parameter GPT-3-like model.

  1. Model assumptions

We need reasonable approximations:

  • Parameters: ~175B
  • Architecture: decoder-only Transformer
    • Layers: ~96
    • Hidden size: ~12,288
    • Attention heads: e.g., 96 (128-dim per head)
  • Context length: 2048 tokens
  • Vocabulary size: ~50k
  • Data: on the order of 300 billion tokens (common rough figure)
  • Training objective: standard autoregressive LM
  • Precision: assume training in something like mixed precision (but we’ll count math ops independently of dtype).

We’ll estimate multiply-adds as 2 FLOPs (1 mult + 1 add). Some conventions count a fused multiply-add as 1 FLOP; I’ll be explicit where it matters.

  1. FLOPs per token (Transformer forward)

We’ll approximate only the dominant costs:

Per Transformer layer and per token, the main pieces:

  1. QKV projections:
  • Input dimension: d_model
  • Three linear maps: Q, K, V
  • Cost: 3 * d_model^2 multiplications (ignoring bias)
  • FLOPs (with mult+add): ≈ 2 * 3 * d_model^2 = 6 d_model^2
  1. Attention scores and weighting: For sequence length L:
  • Compute QK^T:
    • Per head: L * d_k by L → ~L^2 * d_k multiplications
    • Summed over all heads: L^2 * d_model
    • FLOPs ≈ 2 * L^2 * d_model
  • Apply softmax (cheaper; we’ll ignore or fold into constants).
  • Multiply by V:
    • Same order: L^2 * d_model
    • FLOPs ≈ 2 * L^2 * d_model

Total attention mixing ≈ 4 * L^2 * d_model FLOPs per layer (forward).

  1. Output projection from concatenated heads:
  • One linear: d_model x d_model
  • FLOPs ≈ 2 * d_model^2
  1. MLP (feed-forward): Typical config: d_ff ≈ 4 * d_model
  • First linear: d_model → d_ff
    • FLOPs ≈ 2 * d_model * d_ff
  • Activation (e.g., GeLU): relatively small; we can approximate or fold in.
  • Second linear: d_ff → d_model
    • FLOPs ≈ 2 * d_ff * d_model

Total MLP FLOPs ≈ 4 * d_model * d_ff ≈ 4 * d_model * (4 d_model) = 16 d_model^2

Now sum dominant per-layer forward FLOPs (per token vs. sequence-level is easy to mix up, so let’s be precise):

Careful: QK^T and AV are sequence-wide operations; their cost is per sequence, not per token. So let’s compute per sequence of length L, then divide.

Per layer, per sequence (forward):

  • QKV projections: 6 d_model^2 * L
  • Output projection: 2 d_model^2 * L
  • MLP: 16 d_model^2 * L
  • Attention mixes:
    • QK^T: 2 L^2 d_model
    • AV: 2 L^2 d_model
    • Total attention mixing: 4 L^2 d_model

Total per layer per sequence (forward):

F_layer_fwd ≈ (6 + 2 + 16) d_model^2 L + 4 L^2 d_model = 24 d_model^2 L + 4 L^2 d_model

Now plug GPT-3-like numbers:

  • d_model = 12,288
  • L = 2048
  • Layers N = 96

First term: 24 d_model^2 L

  • d_model^2 = 12,288^2 ≈ 1.51 × 10^8
  • 24 * d_model^2 ≈ 24 * 1.51e8 ≈ 3.62e9
  • Multiply by L = 2048:
    • 3.62e9 * 2.048e3 ≈ 7.41e12

So ≈ 7.4 × 10^12 FLOPs from the dense parts per layer per sequence.

Second term: 4 L^2 d_model

  • L^2 = 2048^2 = 4,194,304 ≈ 4.19e6
  • 4 * L^2 = 1.68e7
  • Multiply by d_model (1.2288e4):
    • 1.68e7 * 1.2288e4 ≈ 2.06e11

So attention mixing ≈ 2.1 × 10^11 FLOPs per layer per sequence.

Compare:

  • Dense (24 d_model^2 L): ~7.4e12
  • Attention mixing (4 L^2 d_model): ~2.1e11

Dense terms dominate for this configuration (large d_model, moderate L).

Total per layer per sequence (forward): ≈ 7.4e12 + 0.21e12 ≈ 7.6e12 FLOPs

Now with N = 96 layers:

  • F_fwd_per_seq ≈ 96 * 7.6e12 ≈ 7.3e14 FLOPs

Add output softmax / embedding:

  • Final projection: d_model × vocab ≈ 12,288 * 50,000 ≈ 6.14e8 weights
  • FLOPs per token for logits: ≈ 2 * 6.14e8
  • Per sequence (L tokens): ≈ 2 * 6.14e8 * 2048 ≈ 2.5e12 FLOPs

This is small compared to 7.3e14, so we can ignore it at this scale.

So forward pass per 2048-token sequence: ≈ 7 × 10^14 FLOPs (rounding slightly).

  1. Include backward pass

Training requires:

  • Forward pass
  • Backprop through all layers

Backprop cost is roughly:

  • ≈ 2–3× the forward cost (common rough rule: total training ≈ 3× forward if you count forward+backward+grad computations).

I’ll use 3× forward for “forward + backward” to be conservative.

Per sequence (train step on one sequence): F_train_per_seq ≈ 3 * 7 × 10^14 ≈ 2.1 × 10^15 FLOPs

  1. Relate sequences to total tokens

Assume total training tokens T_total ≈ 300 billion = 3 × 10^11 tokens.

Each sequence has L = 2048 tokens:

  • Number of sequences: S = T_total / L ≈ (3 × 10^11) / 2048

2048 ≈ 2.048 × 10^3

S ≈ (3 × 10^11) / (2.048 × 10^3) ≈ (3 / 2.048) × 10^8 ≈ 1.46 × 10^8 sequences

  1. Total training FLOPs

Total FLOPs:

F_total ≈ F_train_per_seq * S ≈ (2.1 × 10^15) * (1.46 × 10^8)

First multiply mantissas: 2.1 * 1.46 ≈ 3.066

Add exponents: 10^15 * 10^8 = 10^23

So: F_total ≈ 3.1 × 10^23 FLOPs

  1. Sanity checks and refinement

We got ≈ 3 × 10^23 FLOPs.

Let’s cross-check using simpler rules-of-thumb from the reasoning itself (still not “looking up”):

  • A common internal heuristic: training FLOPs ≈ 6 * N_params * N_tokens Explanation: about 2–3 passes of work (forward+backward) over each parameter per token; factor 6 is in the right ballpark for multiply-add accounting/glue.

Using:

  • N_params ≈ 1.75 × 10^11
  • N_tokens ≈ 3 × 10^11

Then: F_total ≈ 6 * (1.75 × 10^11) * (3 × 10^11) ≈ 6 * 5.25 × 10^22 ≈ 3.15 × 10^23 FLOPs

This aligns extremely well with our more detailed layer-by-layer estimate (3.1 × 10^23).

So both independent approaches converge on:

Answer:

A reasonable estimate for the total compute required to train GPT-3 (175B) is on the order of:

  • ~3 × 10^23 floating-point operations.

If you’d like, I can break down what that implies in terms of GPU-days for specific hardware.

Estimate Complexity

Turn this model response into notes, narration, or a short video

Auto NotesTranscriptNarrateVideo Cut

Partner link

This output is part of our open dataset — 5K+ AI responses across 200 models

About Polaris Alpha

Capabilities

ConversationReasoningCode GenerationAnalysisTool Use

Categories

TextCodeMultimodal

Specifications

Provider
Openrouter
Released
2025-11-06
Size
LARGE
Context
256,000 tokens

Keep exploring

SAME PROMPT

Claude Sonnet 4.6's version

Same prompt, different result

COMPARE

Polaris Alpha vs Gemini 3.1 Pro Preview

Both outputs, side by side

Compare AI vibes, not scores. Side-by-side outputs across the world's best models.

@rival_tips

Explore

  • Compare Models
  • All Models
  • Leaderboard
  • Challenges

Discover

  • AI Creators
  • AI Tools
  • The Graveyard

Developers

  • Developer Hub
  • MCP Server
  • .llmignore
  • Badges
  • RIVAL Datasets

Connect

  • Methodology
  • Sponsor
  • Partnerships
  • Privacy Policy
  • Terms
  • RSS Feed
© 2026 Rival