Loading share button...

Z.AI: GLM 4.5vsGrok 3

27 challenges

Why Z.AI: GLM 4.5?

5 months newerJul 2025

Why Grok 3?

Leads 1 of 2 benchmarks
Stronger on LiveCodeBench79.4% vs 63.3%
3 / 27

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Z.AI: GLM 4.5

Z.AI: GLM 4.5

To estimate the total number of FLOPs (floating-point operations) required to train GPT-3 without searching online, we rely on known architectural details and standard computational models for transformer training. Here's the step-by-step reasoning:

Step 1: Key Parameters of GPT-3

  • Number of parameters (P): GPT-3 has 175 billion parameters (175 × 10⁹).
  • Training dataset size (D): The model was trained on approximately 300 billion tokens (300 × 10⁹ tokens). This is a standard figure associated with GPT-3.

Step 2: FLOPs per Token Calculation

Training a transformer model involves two main operations per token:

  1. Forward pass: Computes the model's output.
  2. Backward pass: Computes gradients for backpropagation.

For transformer architectures, the combined FLOPs for forward and backward passes per token is empirically estimated as 6 × P. This is derived from:

  • Forward pass: Requires ~2 × P FLOPs per token (matrix multiplications dominate, with each parameter used in multiply-add operations).
  • Backward pass: Typically requires ~2× the FLOPs of the forward pass due to gradient calculations (~4 × P FLOPs per token).
  • Total per token: 2P (forward) + 4P (backward) = 6P FLOPs per token.

Step 3: Total FLOPs for the Entire Dataset

The total FLOPs scales linearly with the dataset size:

  • Total FLOPs = FLOPs per token × Number of tokens.
  • Substituting the values:
    Total FLOPs = (6 × P) × D = 6 × (175 × 10⁹) × (300 × 10⁹).

Step 4: Calculation

  • Compute the product:
    6 × 175 × 300 × 10⁹ × 10⁹ = 6 × 52,500 × 10¹⁸.
  • Simplify:
    6 × 52,500 = 315,000.
    Total FLOPs = 315,000 × 10¹⁸ = 3.15 × 10²³ FLOPs.

Step 5: Validation with Known Compute

  • GPT-3 training reportedly used ~3,640 petaflop-days of compute.
  • Convert petaflop-days to FLOPs:
    1 petaflop-day = 10¹⁵ FLOPs × 86,400 seconds/day = 8.64 × 10¹⁹ FLOPs.
    3,640 petaflop-days = 3,640 × 8.64 × 10¹⁹ ≈ 3.14 × 10²³ FLOPs.
  • This closely matches our estimate of 3.15 × 10²³ FLOPs, confirming the calculation.

Step 6: Batch Size and Steps Consideration

  • Batch size: GPT-3 used a batch size of 3.2 million tokens per step (80 batches × 4096 tokens/batch).
  • Number of training steps:
    Steps = Total tokens / Batch size = 300 × 10⁹ / 3.2 × 10⁶ = 93,750 steps.
  • FLOPs per step:
    Per step FLOPs = 6 × P × Batch size = 6 × 175 × 10⁹ × 3.2 × 10⁶ = 3.36 × 10¹⁸ FLOPs/step.
  • Total FLOPs:
    93,750 steps × 3.36 × 10¹⁸ FLOPs/step ≈ 3.15 × 10²³ FLOPs.
    This consistency reinforces the estimate.

Step 7: Negligible Factors

  • Parameter updates: Optimizer steps (e.g., Adam) add O(P) FLOPs per step. For 93,750 steps:
    Update FLOPs ≈ 93,750 × 175 × 10⁹ ≈ 1.64 × 10¹⁶ FLOPs (negligible vs. 3.15 × 10²³).
  • Embedding/softmax layers: Included in the 6P per-token estimate.
  • Attention mechanisms: Captured by the 6P heuristic (validated for transformers).

Final Estimate

The total FLOPs required to train GPT-3 is 3.15 × 10²³ FLOPs. This is derived from:

  • Standard transformer FLOPs per token (6P).
  • GPT-3 parameters (175B) and dataset size (300B tokens).
  • Cross-validation with reported compute usage (3,640 petaflop-days).
Grok 3

Grok 3

To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we’ll need to make some reasonable assumptions based on what’s generally known about large language models like GPT-3, its architecture, and typical training procedures. GPT-3, developed by OpenAI, is a massive model with 175 billion parameters, and while exact training details aren’t fully public, we can piece together an estimate using standard practices in machine learning.

Step 1: Understand the Key Components

The number of FLOPs to train a model depends on:

  • Number of parameters (P): GPT-3 has 175 billion parameters.
  • Training dataset size (D): The number of tokens the model is trained on.
  • Number of training steps (S): How many times the model processes the data (related to epochs and batch size).
  • FLOPs per parameter update: How many operations are performed per parameter per token.

Step 2: Estimate FLOPs per Forward and Backward Pass

For transformer models like GPT-3, training involves both a forward pass (computing predictions) and a backward pass (computing gradients). A common heuristic in the field is that:

  • A single forward pass through a dense neural network layer requires about 2 FLOPs per parameter per token (multiply and add for each weight).
  • The backward pass typically takes about twice as many FLOPs as the forward pass due to gradient computations, so roughly 4 FLOPs per parameter per token.
  • Total: ~6 FLOPs per parameter per token for one full training step (forward + backward).

For GPT-3 with 175 billion parameters, processing one token requires: [ 6 imes 175 imes 10^9 = 1.05 imes 10^{12} ext{ FLOPs per token} ]

Step 3: Estimate the Training Dataset Size

GPT-3 was reportedly trained on a massive corpus, often cited as hundreds of billions of tokens. Let’s assume 300 billion tokens, a plausible figure based on discussions about large language model datasets (e.g., Common Crawl, books, and other sources). This is the total number of tokens processed during training, factoring in how data is sampled over epochs.

Step 4: Calculate Total FLOPs for One Pass

If the model processes 300 billion tokens once, the total FLOPs would be: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 300 imes 10^9 ext{ tokens} = 3.15 imes 10^{23} ext{ FLOPs} ] This assumes one full pass through the dataset, but training typically involves multiple passes or iterations over the data, adjusted by batch size and optimization.

Step 5: Adjust for Training Dynamics

Training isn’t a single pass—it involves multiple updates to converge. The number of training steps depends on:

  • Batch size (B): GPT-3 likely used a large batch size (e.g., 3.2 million tokens, as seen in some large-scale models).
  • Total tokens processed: For optimization, models process tokens multiple times. A rule of thumb is that total tokens processed ≈ 2–3× the dataset size, accounting for multiple epochs or sampling.

Let’s assume the model effectively processes 900 billion tokens (3× the dataset size) over training. Then: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 900 imes 10^9 ext{ tokens} = 9.45 imes 10^{23} ext{ FLOPs} ]

Step 6: Account for Model Architecture Specifics

GPT-3 is a sparse transformer in some variants (e.g., earlier GPT models used sparsity), but the 175B version is dense. The 6 FLOPs/parameter/token heuristic holds for dense models. We’re not factoring in optimizations like mixed-precision training (FP16), which reduces compute slightly but doesn’t change the order of magnitude significantly.

Step 7: Final Estimate

Based on this reasoning, the total FLOPs to train GPT-3 is approximately: [ 9.45 imes 10^{23} ext{ FLOPs} ] or roughly ( 10^{24} ) FLOPs when rounded for simplicity.

Reasoning Check

  • Comparison to known models: Smaller models like BERT (110M parameters, ~10B tokens) require ~10^19 FLOPs. Scaling to GPT-3’s size (175B parameters, ~900B tokens) suggests a 10,000× increase, aligning with ~10^23–10^24 FLOPs.
  • Hardware context: Training GPT-3 took months on thousands of GPUs. A V100 GPU performs ~10^14 FLOPs/second; 10^24 FLOPs over 3 months (~10^7 seconds) with 10,000 GPUs is plausible.

Conclusion

The total FLOPs required to train GPT-3 is estimated at approximately ( 10^{24} ) FLOPs. This is a rough estimate, but it aligns with the scale of compute expected for a model of this size and training complexity.

Vote
2 / 27

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Z.AI: GLM 4.5

Z.AI: GLM 4.5

Grok 3

Grok 3

Vote
3 / 27

Math Misconception Test

reasoning·medium

Tests an AI's understanding of number representation

Z.AI: GLM 4.5

Z.AI: GLM 4.5

No, 9.11 is not greater than 9.9. To compare the two numbers, consider their decimal values with the same number of decimal places for clarity:

  • 9.11 can be written as 9.11.
  • 9.9 is equivalent to 9.90 (since adding a trailing zero does not change its value).

Now, compare digit by digit from left to right:

  • Units place: Both are 9 (equal).
  • Tenths place: 1 (from 9.11) vs. 9 (from 9.90). Since 9 is greater than 1, 9.90 is larger.

Therefore, 9.9 (or 9.90) is greater than 9.11. This can also be confirmed by subtraction:

  • 9.9 - 9.11 = 0.79 (positive result, so 9.9 is larger).
  • 9.11 - 9.9 = -0.79 (negative result, so 9.11 is smaller).

This comparison is based on standard decimal notation and numerical value, not on contextual interpretations like dates.

Grok 3

Grok 3

No, 9.11 is not greater than 9.9. In decimal comparison, 9.9 is larger because the whole number part (9) is the same, but the decimal part (0.9) is greater than (0.11). Numerically, 9.9 > 9.11.

Vote
Loading