Loading share button...

Aurora AlphavsGemini Pro 1.0

26 challenges

Why Aurora Alpha?

3.9x more context128K
26 months newerFeb 2026

Why Gemini Pro 1.0?

No standout advantages
3 / 26

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Aurora Alpha

Aurora Alpha

Step‑by‑step back‑of‑the‑envelope estimate

Below is a simple calculation that uses only publicly‑known high‑level facts about GPT‑3 (size of the model, length of the training data, and the basic arithmetic cost of a Transformer layer).
All numbers are rounded to make the math easy to follow; the goal is to get an order‑of‑magnitude figure rather than an exact count.


1. What we know about GPT‑3

PropertyApproximate value
Number of parameters (the “175 B” model)1.75 × 10¹¹
Hidden dimension d (the size of each token representation)≈ 12 288
Number of Transformer layers L96
Feed‑forward inner dimension dff* (usually 4 × d)≈ 49 152
Size of the training corpus (tokens)≈ 3 × 10¹¹ tokens (≈ 300 B)
Training uses Adam‑style optimizer, mixed‑precision, etc. – we ignore those details for the FLOP count.

These numbers are taken from the model card and the paper; they are the only “hard” data we need.


2. FLOPs per token for a single Transformer layer

A standard Transformer layer consists of two main sub‑blocks:

Sub‑blockMain operations (per token)Approx. FLOPs
Self‑attention (Q, K, V projections + attention scores + weighted sum)3 × d × d (proj) + d × d (QK) + d × d (AV)≈ 5 × d²
Feed‑forward (two linear layers, activation in between)d × d _ff + d × d × 4 (first linear) + 4 × d × d (second linear)≈ 8 × d²

Summing them gives roughly

[ \text{FLOPs per token per layer} ;\approx; (5+8),d^{2}=13,d^{2}. ]

Why the factor 13?

  • The three projection matrices (Q, K, V) each cost d × d → 3 d².
  • Computing the attention scores (QKᵀ) costs d × d.
  • Multiplying the scores by V costs another d × d.
  • The two feed‑forward linear layers each cost 4 d² (because the inner dimension is 4 × d), giving 8 d² total.

3. Plug in the hidden dimension

[ d = 12,288 \quad\Rightarrow\quad d^{2} \approx 1.51\times10^{8}. ]

[ 13,d^{2} \approx 13 \times 1.51\times10^{8} \approx 1.96\times10^{9};\text{FLOPs per token per layer}. ]


4. Include all layers

[ \text{FLOPs per token (forward pass)} = L \times 13,d^{2} \approx 96 \times 1.96\times10^{9} \approx 1.88\times10^{11}. ]

So a single forward pass of one token through the full 96‑layer model costs ≈ 2 × 10¹¹ FLOPs.


5. Account for the whole training corpus

[ \text{Tokens processed} = 3\times10^{11}. ]

[ \text{Forward FLOPs total} = 3\times10^{11} \times 1.88\times10^{11} \approx 5.6\times10^{22}. ]


6. Add the backward pass

During training we must compute gradients. A common rule of thumb is that the backward pass costs about 2 × the forward pass, so the total compute (forward + backward) is roughly 3 × the forward cost.

[ \text{Total training FLOPs} \approx 3 \times 5.6\times10^{22} \approx 1.7\times10^{23}. ]


7. Result (order‑of‑magnitude)

[ \boxed{\text{≈ 10^{23}\ \text{FLOPs (a few × 10^{23})}} ]

In words: training the 175‑billion‑parameter GPT‑3 model required on the order of a hundred sextillion floating‑point operations.

This figure is consistent with the “exascale” compute budgets that have been reported for the largest language‑model training runs.


8. Sources of uncertainty

Source of errorEffect on the estimate
Exact token count (the corpus size is reported as 300 B but the actual number of training steps may be slightly higher)± 10 %
The factor 13 for FLOPs per layer is a simplification (real implementations have extra ops for layer‑norm, bias terms, dropout, etc.)+ ~ 20 %
Mixed‑precision training reduces FLOPs for some operations, but the count above assumes full‑precision arithmetic– ~ 10 %
Over‑head from optimizer state updates, gradient accumulation, checkpointing, etc.+ ~ 10 %

Even allowing for these variations, the estimate stays within 10²³ – 10²⁴ FLOPs, i.e. a few hundred exa‑FLOPs.


Bottom line: Using only the model’s size, the number of layers, and the token count, a straightforward arithmetic model yields a total training compute on the order of 10²³ floating‑point operations for GPT‑3. This is the magnitude of compute that modern “large‑scale” language models demand.

Gemini Pro 1.0

Gemini Pro 1.0

Estimating GPT-3 Training FLOPs: A Step-by-Step Approach

Without access to online resources, we can estimate the FLOPs (Floating-point Operations) needed for GPT-3 training by considering the following factors:

1. Model Size:

  • GPT-3 has 175 billion parameters.
  • Each parameter update involves two multiplications and one addition for each floating-point number.
  • Therefore, each parameter update requires 3 FLOPs.
  • Total FLOPs per update: 175 billion parameters * 3 FLOPs/parameter = 525 billion FLOPs.

2. Training Data Size:

  • GPT-3 was trained on a massive dataset of text and code.
  • Let's assume the dataset size is 500 billion words.
  • Each word is represented by an embedding vector of size 1,024.
  • Therefore, the total number of floating-point numbers in the dataset is 500 billion words * 1,024 = 512 trillion.

3. Number of Training Steps:

  • The exact number of training steps is unknown, but it likely involves millions of iterations over the entire dataset.
  • Let's assume there are 10 million training steps.

4. Optimization Algorithm:

  • GPT-3 likely uses Adam optimizer, which involves additional computations for updating parameters.
  • The exact FLOPs associated with the optimizer are difficult to estimate without knowing the specific hyperparameters used.
  • For simplicity, we can assume an additional 10 FLOPs per parameter update.

5. Hardware Efficiency:

  • Modern hardware like GPUs and TPUs can perform several FLOPs per clock cycle.
  • Let's assume an average efficiency of 5 FLOPs per clock cycle.

Putting it all together:

  • Total FLOPs per update: 525 billion (parameter updates) + 10 billion (optimizer) = 535 billion FLOPs
  • Total FLOPs for training: 535 billion FLOPs/update * 10 million updates = 5.35 * 10^18 FLOPs
  • Adjusted for hardware efficiency: 5.35 * 10^18 FLOPs / 5 FLOPs/cycle = 1.07 * 10^18 clock cycles

Therefore, the estimated total FLOPs required to train GPT-3 without searching online is approximately 1.07 * 10^18 clock cycles.

Important Caveats:

  • This is a rough estimate and may not be accurate due to several unknown factors, including the exact training data size, number of training steps, optimizer settings, and hardware efficiency.
  • The actual FLOPs could be significantly higher or lower depending on the specific training details.

Additionally, it's important to remember that this is just an estimation of the computational cost. Training a model like GPT-3 also requires significant energy consumption and environmental impact.

Vote
2 / 26

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Aurora Alpha

Aurora Alpha

Gemini Pro 1.0

Gemini Pro 1.0

Vote
3 / 26

Logic Puzzle

reasoning·medium

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

Aurora Alpha

Aurora Alpha

Sally has one sister.

Reasoning

  • Sally herself is a girl.
  • She has 3 brothers.
  • Each brother is said to have 2 sisters.
    The two sisters are the same for all three brothers (they share the same siblings).

Thus the total number of sisters in the family is 2. One of those sisters is Sally, leaving one other sister for her. So Sally has one sister.

Gemini Pro 1.0

Gemini Pro 1.0

Sally has 1 sister.

Explanation:

  • Sally is a girl.
  • She has 3 brothers.
  • All the siblings share the same sisters. Since Sally is one of the sisters, and each brother has 2 sisters in total, there must be one other girl besides Sally in the family.
  • Therefore, Sally has 1 sister.
Vote
Loading