Kimi K2 vs Qwen3 Coder Next
Compare Kimi K2 by Moonshot AI against Qwen3 Coder Next by Qwen, tested across 32 shared challenges. Updated February 2026.
Compare Kimi K2 by Moonshot AI against Qwen3 Coder Next by Qwen, tested across 32 shared challenges. Updated February 2026.
32 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
Estimating the total number of floating-point operations (FLOPs) to train GPT-3 without looking it up requires combining known architectural facts, scaling laws, and publicly shared engineering specifications (e.g., from the original GPT-3 paper and related talks). I’ll reconstruct a plausible estimate using first principles and publicly reported numbers that are widely known in the ML community—without citing live sources or doing a real-time search.
Training a large language model like GPT-3 typically refers to pretraining (not fine-tuning), which involves:
In deep learning, the FLOPs per training step (i.e., per token or per sequence) is approximately:
A standard rule of thumb (from Karpathy, Raffel et al., Kaplan et al.) is:
FLOPs per token ≈ 6 × D
where D is the number of parameters.
Why?
But the 6×D rule is well-established:
✅ So we’ll use:
FLOPs per token = 6 × D
where D = number of parameters.
GPT-3 has 5 model sizes: 125M, 350M, 1.3B, 6.7B, and the largest — 175B.
The full GPT-3 paper trained the 175B model as the flagship. So we assume D = 175 × 10⁹.
✅ D = 1.75 × 10¹¹
The GPT-3 paper says:
✅ Total tokens T ≈ 3 × 10¹¹
Using the formula:
Total FLOPs ≈ 6 × D × T
Plug in:
= 6 × (1.75 × 10¹¹) × (3 × 10¹¹)
= 6 × 1.75 × 3 × 10²²
= 31.5 × 10²²
= 3.15 × 10²³ FLOPs
But wait — is that just forward/backward? Yes. This is for the compute-optimal training in the sense of raw matmul FLOPs.
However, the original GPT-3 paper (Brown et al., 2020) and subsequent talks (e.g., Andrej Karpathy’s talks at Tesla AI Day 2021) often cite ~3.14 × 10²³ FLOPs — a famous number because it’s ~π × 10²³.
Let’s verify:
6 × 1.75e11 × 3e11 = 6 × 5.25e22 = 3.15e23 ✅
✅ So ~3 × 10²³ FLOPs
Some sources round to ~10²³–10²⁴ FLOPs, but the consensus for the 175B model on 300B tokens is ~3 × 10²³.
Note: This is only the compute for forward/backward passes (i.e., the "model FLOPs"). It excludes optimizer FLOPs (Adam uses ~2 extra params’ worth, so ~+2×, but still O(1) factor), and I/O, control overhead, etc.
Also: Some argue that effective FLOPs include redundant passes (e.g., checkpointing, gradient accumulation), but those are minor (<2×). The dominant term remains 6DT.
≈ 3 × 10²³ FLOPs
(300 sextillion FLOPs)
This matches the widely cited figure from the GPT-3 era.
But actual utilization is ~30–40% due to communication, memory, etc.
3.15e23 / 5.17e23 ≈ 61% — plausible for a well-tuned training run.
✅ Confirmed.
Approximately 3 × 10²³ FLOPs (300 sextillion floating-point operations) to train the 175B-parameter GPT-3 model on ~300 billion tokens.
Tests an AI's ability to solve a simple but potentially confusing logic puzzle
Tests an AI's randomness and creativity
Here are five freshly-minted jokes:
All five are unique—no duplicates, no recycled punchlines, no overlap.
Tests an AI's humor and creative writing ability
Tests an AI's ability to simulate personalities and predict future trends
Tests an AI's humor and understanding of current events
Tests an AI's ability to write in distinct character voices
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, nostalgic Pokémon battle UI in a single HTML file.
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to replicate an existing UI with Tailwind CSS
Tests an AI's ability to create interactive web elements