Loading share button...

Z.AI: GLM 4.5vsGPT-4.1 Nano

36 challenges

Why Z.AI: GLM 4.5?

3 months newerJul 2025

Why GPT-4.1 Nano?

5.6x cheaper overall$0.10/M in · $0.40/M out
8.2x more context1.0M
3 / 36

Estimate Complexity

reasoning·hard

Tests an AI's ability to make educated estimates based on technical knowledge

Z.AI: GLM 4.5

Z.AI: GLM 4.5

To estimate the total number of FLOPs (floating-point operations) required to train GPT-3 without searching online, we rely on known architectural details and standard computational models for transformer training. Here's the step-by-step reasoning:

Step 1: Key Parameters of GPT-3

  • Number of parameters (P): GPT-3 has 175 billion parameters (175 × 10⁹).
  • Training dataset size (D): The model was trained on approximately 300 billion tokens (300 × 10⁹ tokens). This is a standard figure associated with GPT-3.

Step 2: FLOPs per Token Calculation

Training a transformer model involves two main operations per token:

  1. Forward pass: Computes the model's output.
  2. Backward pass: Computes gradients for backpropagation.

For transformer architectures, the combined FLOPs for forward and backward passes per token is empirically estimated as 6 × P. This is derived from:

  • Forward pass: Requires ~2 × P FLOPs per token (matrix multiplications dominate, with each parameter used in multiply-add operations).
  • Backward pass: Typically requires ~2× the FLOPs of the forward pass due to gradient calculations (~4 × P FLOPs per token).
  • Total per token: 2P (forward) + 4P (backward) = 6P FLOPs per token.

Step 3: Total FLOPs for the Entire Dataset

The total FLOPs scales linearly with the dataset size:

  • Total FLOPs = FLOPs per token × Number of tokens.
  • Substituting the values:
    Total FLOPs = (6 × P) × D = 6 × (175 × 10⁹) × (300 × 10⁹).

Step 4: Calculation

  • Compute the product:
    6 × 175 × 300 × 10⁹ × 10⁹ = 6 × 52,500 × 10¹⁸.
  • Simplify:
    6 × 52,500 = 315,000.
    Total FLOPs = 315,000 × 10¹⁸ = 3.15 × 10²³ FLOPs.

Step 5: Validation with Known Compute

  • GPT-3 training reportedly used ~3,640 petaflop-days of compute.
  • Convert petaflop-days to FLOPs:
    1 petaflop-day = 10¹⁵ FLOPs × 86,400 seconds/day = 8.64 × 10¹⁹ FLOPs.
    3,640 petaflop-days = 3,640 × 8.64 × 10¹⁹ ≈ 3.14 × 10²³ FLOPs.
  • This closely matches our estimate of 3.15 × 10²³ FLOPs, confirming the calculation.

Step 6: Batch Size and Steps Consideration

  • Batch size: GPT-3 used a batch size of 3.2 million tokens per step (80 batches × 4096 tokens/batch).
  • Number of training steps:
    Steps = Total tokens / Batch size = 300 × 10⁹ / 3.2 × 10⁶ = 93,750 steps.
  • FLOPs per step:
    Per step FLOPs = 6 × P × Batch size = 6 × 175 × 10⁹ × 3.2 × 10⁶ = 3.36 × 10¹⁸ FLOPs/step.
  • Total FLOPs:
    93,750 steps × 3.36 × 10¹⁸ FLOPs/step ≈ 3.15 × 10²³ FLOPs.
    This consistency reinforces the estimate.

Step 7: Negligible Factors

  • Parameter updates: Optimizer steps (e.g., Adam) add O(P) FLOPs per step. For 93,750 steps:
    Update FLOPs ≈ 93,750 × 175 × 10⁹ ≈ 1.64 × 10¹⁶ FLOPs (negligible vs. 3.15 × 10²³).
  • Embedding/softmax layers: Included in the 6P per-token estimate.
  • Attention mechanisms: Captured by the 6P heuristic (validated for transformers).

Final Estimate

The total FLOPs required to train GPT-3 is 3.15 × 10²³ FLOPs. This is derived from:

  • Standard transformer FLOPs per token (6P).
  • GPT-3 parameters (175B) and dataset size (300B tokens).
  • Cross-validation with reported compute usage (3,640 petaflop-days).
GPT-4.1 Nano

GPT-4.1 Nano

Let's estimate the total number of floating-point operations (FLOPs) required to train GPT-3, based on what is known about its size, training data, and typical training procedures. Here's a step-by-step reasoning process:

1. Understand the model size and parameters:

  • GPT-3 has approximately 175 billion parameters.
  • Parameters are stored in FP32 or FP16 precision; often training uses mixed precision to reduce FLOPs slightly.

2. Determine the training dataset size:

  • GPT-3 was trained on hundreds of billions of tokens. Reports suggest around 300 billion tokens.
  • For simplicity, assume 300 billion tokens.

3. Estimate the number of training steps:

  • To process 300 billion tokens, depending on batch size:

    Suppose:

    • Batch size = 1 million tokens (a plausible large batch size for such training, approximating high-performance hardware)

    Number of steps = Total tokens / Batch size:

    ( ext{Steps} = rac{300 imes 10^9}{10^6} = 300,000 )

    But in practice, the batch size per GPU (or per node) is smaller—say 1,000 tokens per batch, with multiple GPUs. The total number of steps is roughly on the order of 300,000 to 500,000.

    To be conservative, take approximately 300,000 steps for total training.

4. FLOPs per forward and backward pass:

  • For each token, the transformer computes several matrix multiplications.
  • Typically, the dominant cost per token in a transformer model scales with the model size.

A common approximation for FLOPs per token for training a transformer is:

( ext{FLOPs per token} approx 2 imes ext{number of parameters} imes ext{sequence length} )

But this depends on the specifics of the architecture. Historical estimates suggest:

  • For large transformer models, about 6 FLOPs per parameter per token are required for training (this accounts for both forward and backward passes). This is a rough estimate from various literature.

Thus,

( ext{FLOPs per token} approx 6 imes ext{parameters} )

Plugging in GPT-3's parameters:

( 6 imes 175 imes 10^9 = 1.05 imes 10^{12} ext{ FLOPs per token} )

Note: This seems high; more refined estimates consider the overall cost per token is roughly on the order of a few trillion FLOPs, depending on the implementation.

Alternatively, a typical approximation used in literature (e.g., OpenAI's blog) is about 3 to 5 FLOPs per parameter per token.

Let's take 3 FLOPs per parameter per token for a more conservative estimate:

( 3 imes 175 imes 10^9 = 525 imes 10^9 = 5.25 imes 10^{11} ext{ FLOPs per token} )

Similarly, for total FLOPs:

( ext{Total FLOPs} approx ext{number of tokens} imes ext{FLOPs per token} )

Using 300 billion tokens and 5.25 × 10^{11} FLOPs per token:

( ext{Total FLOPs} approx 300 imes 10^9 imes 5.25 imes 10^{11} ) ( = (300 imes 5.25) imes 10^{20} ) ( = 1575 imes 10^{20} = 1.575 imes 10^{23} )

5. Adjust for training efficiency and hardware:

  • Not every FLOP is used efficiently.
  • Typically, the total number of FLOPs gives an upper bound.
  • FLOPs estimates often consider the entire training process, including multiple epochs, but most models are trained with many passes over the data; in this case, since the tokens are large and the data volume is huge, training is roughly a single pass or a few passes.

6. Final estimate:

Given the approximations, a rough but reasonable total FLOPs estimate for training GPT-3:

( oxed{ ext{Total FLOPs} sim 10^{23} ext{ FLOPs} } )

Summary:

  • Parameters: 175 billion
  • Tokens processed: ~300 billion
  • FLOPs per token: about 5 × 10^{11}
  • Total FLOPs: approximately 10^{23} FLOPs

Answer: The total number of FLOPs required to train GPT-3 is roughly on the order of 10^23 FLOPs. This estimate considers GPT-3's 175 billion parameters, training on approximately 300 billion tokens, and accounts for the computational cost per token during training.

Vote
2 / 36

AI Board Game Logic

reasoning·medium

Tests an AI's ability to understand game rules and strategy

Z.AI: GLM 4.5

Z.AI: GLM 4.5

GPT-4.1 Nano

GPT-4.1 Nano

Vote
3 / 36

Logic Puzzle

reasoning·medium

Tests an AI's ability to solve a simple but potentially confusing logic puzzle

Z.AI: GLM 4.5

Z.AI: GLM 4.5

Sally has 3 brothers, meaning there are 3 boys in the family. Each brother has 2 sisters, indicating that there are exactly 2 girls in the family. Since Sally is one of the girls, there must be one additional sister. Therefore, Sally has one sister.

This can be verified by considering the family composition:

  • Sisters: Sally and one other girl.
  • Brothers: 3 boys.

From Sally's perspective:

  • She has 3 brothers.
  • She has 1 sister (the other girl).

From each brother's perspective:

  • They have 2 sisters (Sally and the other girl).

This satisfies both conditions: Sally has 3 brothers, and each brother has 2 sisters.

If there were only one sister (Sally), then each brother would have only one sister, which contradicts the information that each brother has 2 sisters. If there were three sisters, then each brother would have three sisters, which also contradicts the information that each brother has 2 sisters. Thus, there must be exactly two sisters, meaning Sally has one sister.

\boxed{1}

GPT-4.1 Nano

GPT-4.1 Nano

Sally has 2 sisters.

Here's the reasoning:

  • Sally has 3 brothers.
  • Each brother has 2 sisters.
  • Since Sally is one of those sisters, the other sister is the other girl brothers share.

Therefore, Sally has 2 sisters in total.

Vote
Loading