Loading share button...

Aurora AlphavsOpenAI o3

45 challenges

Why Aurora Alpha?

10 months newerFeb 2026

Why OpenAI o3?

No standout advantages
3 / 45

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Aurora Alpha

Aurora Alpha

Step‑by‑step back‑of‑the‑envelope estimate

Below is a simple calculation that uses only publicly‑known high‑level facts about GPT‑3 (size of the model, length of the training data, and the basic arithmetic cost of a Transformer layer).
All numbers are rounded to make the math easy to follow; the goal is to get an order‑of‑magnitude figure rather than an exact count.


1. What we know about GPT‑3

PropertyApproximate value
Number of parameters (the “175 B” model)1.75 × 10¹¹
Hidden dimension d (the size of each token representation)≈ 12 288
Number of Transformer layers L96
Feed‑forward inner dimension dff* (usually 4 × d)≈ 49 152
Size of the training corpus (tokens)≈ 3 × 10¹¹ tokens (≈ 300 B)
Training uses Adam‑style optimizer, mixed‑precision, etc. – we ignore those details for the FLOP count.

These numbers are taken from the model card and the paper; they are the only “hard” data we need.


2. FLOPs per token for a single Transformer layer

A standard Transformer layer consists of two main sub‑blocks:

Sub‑blockMain operations (per token)Approx. FLOPs
Self‑attention (Q, K, V projections + attention scores + weighted sum)3 × d × d (proj) + d × d (QK) + d × d (AV)≈ 5 × d²
Feed‑forward (two linear layers, activation in between)d × d _ff + d × d × 4 (first linear) + 4 × d × d (second linear)≈ 8 × d²

Summing them gives roughly

[ \text{FLOPs per token per layer} ;\approx; (5+8),d^{2}=13,d^{2}. ]

Why the factor 13?

  • The three projection matrices (Q, K, V) each cost d × d → 3 d².
  • Computing the attention scores (QKᵀ) costs d × d.
  • Multiplying the scores by V costs another d × d.
  • The two feed‑forward linear layers each cost 4 d² (because the inner dimension is 4 × d), giving 8 d² total.

3. Plug in the hidden dimension

[ d = 12,288 \quad\Rightarrow\quad d^{2} \approx 1.51\times10^{8}. ]

[ 13,d^{2} \approx 13 \times 1.51\times10^{8} \approx 1.96\times10^{9};\text{FLOPs per token per layer}. ]


4. Include all layers

[ \text{FLOPs per token (forward pass)} = L \times 13,d^{2} \approx 96 \times 1.96\times10^{9} \approx 1.88\times10^{11}. ]

So a single forward pass of one token through the full 96‑layer model costs ≈ 2 × 10¹¹ FLOPs.


5. Account for the whole training corpus

[ \text{Tokens processed} = 3\times10^{11}. ]

[ \text{Forward FLOPs total} = 3\times10^{11} \times 1.88\times10^{11} \approx 5.6\times10^{22}. ]


6. Add the backward pass

During training we must compute gradients. A common rule of thumb is that the backward pass costs about 2 × the forward pass, so the total compute (forward + backward) is roughly 3 × the forward cost.

[ \text{Total training FLOPs} \approx 3 \times 5.6\times10^{22} \approx 1.7\times10^{23}. ]


7. Result (order‑of‑magnitude)

[ \boxed{\text{≈ 10^{23}\ \text{FLOPs (a few × 10^{23})}} ]

In words: training the 175‑billion‑parameter GPT‑3 model required on the order of a hundred sextillion floating‑point operations.

This figure is consistent with the “exascale” compute budgets that have been reported for the largest language‑model training runs.


8. Sources of uncertainty

Source of errorEffect on the estimate
Exact token count (the corpus size is reported as 300 B but the actual number of training steps may be slightly higher)± 10 %
The factor 13 for FLOPs per layer is a simplification (real implementations have extra ops for layer‑norm, bias terms, dropout, etc.)+ ~ 20 %
Mixed‑precision training reduces FLOPs for some operations, but the count above assumes full‑precision arithmetic– ~ 10 %
Over‑head from optimizer state updates, gradient accumulation, checkpointing, etc.+ ~ 10 %

Even allowing for these variations, the estimate stays within 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOPs.


Bottom line: Using only the model’s size, the number of layers, and the token count, a straightforward arithmetic model yields a total training compute on the order of 10²³ floating‑point operations for GPT‑3. This is the magnitude of compute that modern “large‑scale” language models demand.

OpenAI o3

OpenAI o3

Below is a simple "back-of-the-envelope" estimate that uses only a few public facts about GPT-3 and very generic rules of thumb for counting operations.

Step 1. Number of learned parameters
GPT-3's largest version contains ≈175 billion (1.75 × 10¹¹) trainable parameters.

Step 2. How much data was seen during training
OpenAI reports that the model was trained on about 300 billion tokens (3 × 10¹¹).

Step 3. How many floating-point operations does one parameter cost per token?
A practical rule of thumb for Transformer language models is
 Training FLOPs ≈ 6 × (# parameters) × (# training tokens)

Where does the factor 6 come from?

• Forward pass: every parameter is involved in one multiply-accumulate → 2 FLOPs
• Back-prop gradient computation: roughly another 2 FLOPs
• Parameter-update (optimizer) and miscellaneous overhead: ≈2 FLOPs
Total ≈6 FLOPs per parameter per token.

Step 4. Plug in the numbers
Total FLOPs = 6 × (1.75 × 10¹¹) × (3 × 10¹¹)
      = 6 × 5.25 × 10²²
      ≈ 3.15 × 10²³ FLOPs.

Step 5. Express the answer in more familiar units
3.15 × 10²³ ≈ 315 zettaFLOPs (1 ZFLOP = 10²¹ FLOPs).

Step 6. Sanity check with hardware-time intuition
An NVIDIA A100 delivers ≈ 312 teraFLOP/s (0.312 PFLOP/s) in mixed precision.
Wall-clock time on a single A100:
 3.15 × 10²³ / 3.12 × 10¹⁴ ≈ 1.0 × 10⁹ s ≈ 32 years.
Running on 1,000 A100s in parallel would take ≈12 days, which is the right ballpark for a large industrial training run.

Final estimate
Training GPT-3 (175 B parameters, 300 B tokens) requires on the order of 3 × 10²³ floating-point operations, i.e. roughly 300 zettaFLOPs.

Vote
2 / 45

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Aurora Alpha

Aurora Alpha

OpenAI o3

OpenAI o3

Vote
3 / 45

Logic Puzzle

reasoning·medium

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

Aurora Alpha

Aurora Alpha

Sally has one sister.

Reasoning

  • Sally herself is a girl.
  • She has 3 brothers.
  • Each brother is said to have 2 sisters.
    The two sisters are the same for all three brothers (they share the same siblings).

Thus the total number of sisters in the family is 2. One of those sisters is Sally, leaving one other sister for her. So Sally has one sister.

OpenAI o3

OpenAI o3

Let
• S = number of sisters in the family (girls)
• B = number of brothers in the family (boys)

Given

  1. Sally has 3 brothers ⇒ B = 3.
  2. Each brother has 2 sisters ⇒ every boy sees exactly 2 girls in the sibling group.
    Therefore S = 2.

Since Sally herself is one of those 2 sisters, the number of sisters Sally has (other than herself) is

2 – 1 = 1.

Sally has 1 sister.

Vote
Loading