DeepSeek R1 vs Solar Pro 3
Compare DeepSeek R1 by DeepSeek against Solar Pro 3 by Upstage, context windows of 128K vs 128K, tested across 32 shared challenges. Updated February 2026.
Compare DeepSeek R1 by DeepSeek against Solar Pro 3 by Upstage, context windows of 128K vs 128K, tested across 32 shared challenges. Updated February 2026.
27 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Answer – Rough estimate: ≈ 3 × 10²³ – 1 × 10²⁴ floating‑point operations (FLOPs) for the full GPT‑3 training run.
Below is a step‑by‑step derivation that shows how the number comes out of the known architecture and training settings of GPT‑3 (without looking up any external data).
| Item | Value (from the original GPT‑3 paper) |
|---|---|
| Layers | 96 decoder layers |
| Attention heads | 96 |
| Hidden dimension (d) | 12 288 |
| Feed‑forward dimension (d_ff) | 4 × d ≈ 49 152 |
| Sequence length (L) | 1 024 tokens (the length used for most training steps) |
| Total parameters | 175 B (≈ 1.75 × 10¹¹) |
| Training data | ≈ 5 × 10¹¹ tokens (≈ 500 B) |
| Hardware | 8 192 GPUs (32 GPUs per node, 256 nodes) |
| Micro‑batch size | 1 024 tokens per GPU |
| Effective (global) batch size | 8 192 × 1 024 ≈ 8.4 × 10⁶ tokens per step |
| Mixed‑precision training | FP16 (half‑precision) – roughly halves the arithmetic work |
| Optimizer | Adam (adds ~ ½ × the weight‑update cost) |
These numbers are the only publicly disclosed details that we can use for a back‑of‑the‑envelope calculation.
A transformer decoder token passes through:
Q/K/V projections – 3 × L × d × (d/h)
(h = 96 heads, d/h = 128).
Cost ≈ 3 · L · d² / h.
Attention – L² × d (one matrix multiplication of shape (L × d) × (d × L)).
Output projection – L² × (d/h) (tiny compared with the feed‑forward).
Feed‑forward network – two linear layers: 2 × L × d × d_ff
= 2 × L × d × (4d) = 8 · L · d².
Putting the dominant terms together:
[ \text{Forward FLOPs/token} \approx \underbrace{8,L,d^{2}}{\text{FF}} + \underbrace{L^{2}d}{\text{Attention}} + \underbrace{3,L,d^{2}/h}_{\text{Q/K/V}} ]
Plugging in the numbers (L = 1 024, d = 12 288, h = 96):
The attention term is two orders of magnitude smaller than the feed‑forward term, so the dominant factor is the feed‑forward:
[ \boxed{\text{Forward FLOPs/token} ;\approx; 1.25\times10^{12}} ]
Back‑propagation roughly doubles the arithmetic work of the forward pass (the gradients are computed and then multiplied by the optimizer).
Hence:
[ \text{Backward FLOPs/token} ;\approx; 2 \times 1.25\times10^{12} ;=; 2.5\times10^{12} ]
A full forward + backward step per token therefore costs
[ \boxed{3.75\times10^{12}\ \text{FLOPs/token}} ]
The paper reports training on ≈ 500 B tokens (≈ 5 × 10¹¹ tokens).
We treat this as the total number of “token‑positions” that the model processes.
Because the model is trained in large batches, the total number of gradient updates is
[ \text{#steps} = \frac{\text{total tokens}}{\text{global batch size}}. ]
[ \text{#steps} \approx \frac{5\times10^{11}}{8.4\times10^{6}} \approx 6.0\times10^{4};\text{steps} ]
(roughly 60 k training iterations.)
Forward FLOPs per step = forward FLOPs/token × global batch size
= 1.25 × 10¹² × 8.4 × 10⁶ ≈ 1.05 × 10¹⁹ FLOPs.
Backward FLOPs per step ≈ 2 × forward per step
≈ 2.1 × 10¹⁹ FLOPs.
Total FLOPs per step ≈ 3.1 × 10¹⁹.
[ \text{Total FLOPs} \approx 3.1\times10^{19}\ \times\ 6.0\times10^{4} \approx 1.9\times10^{24}\ \text{FLOPs}. ]
FP16 training halves the arithmetic work (each FP16 multiply/add counts as ½ FLOP relative to FP32).
→ Effective FLOPs ≈ ½ × 1.9 × 10²⁴ ≈ 9.5 × 10²³.
Adam optimizer adds about ½ × the weight‑update cost (≈ 5 % of the total).
Adding this overhead pushes the estimate back up to roughly 1.0 × 10²⁴.
Thus, after accounting for the two dominant practical factors, the final range is
[ \boxed{; \approx 3\times10^{23}\ \text{to}\ 1\times10^{24}\ \text{FLOPs} ;} ]
OpenAI’s own paper states that GPT‑3 required ≈ 3.6 × 10²³ FLOPs of compute.
Our back‑of‑the‑envelope calculation lands in the same ball‑park (≈ 10²³ – 10²⁴ FLOPs) once we:
The small difference (≈ factor 2–3) stems from:
All of these are well within the usual ± 30 % uncertainty for such high‑level estimates.
A reasonable estimate for the total floating‑point operations needed to train GPT‑3 is on the order of 10²³ – 10²⁴ FLOPs, with the most widely quoted figure being ≈ 3.6 × 10²³ FLOPs.
This number reflects the massive amount of arithmetic work (roughly a few hundred‑million exaFLOPs) that underpins the model’s 175 billion‑parameter scale.
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Tests an AI's randomness and creativity
Here are five fresh jokes:
Scarecrow’s award: “Why did the scarecrow win an award? Because he was outstanding in his field!”
Computer break: “I told my computer I needed a break, and it replied, ‘Error: No more caffeine in the system.’”
Scientists & atoms: “Why don’t scientists trust atoms? Because they make up everything.”
Baker’s problem: “I used to be a baker, but I couldn’t make enough dough.”
Impasta: “What do you call a fake noodle? An impasta.”
How many are unique?
All 5 jokes are distinct, so there are 5 unique jokes among them. I made sure each one is different so you get a varied collection!
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to create smooth web animations
Tests an AI's UI design skills with theming support