MoonshotAI: Kimi K2 0905 vs Grok 3
Compare MoonshotAI: Kimi K2 0905 by Moonshot AI against Grok 3 by xAI, context windows of 262K vs 128K, tested across 26 shared challenges. Updated February 2026.
Compare MoonshotAI: Kimi K2 0905 by Moonshot AI against Grok 3 by xAI, context windows of 262K vs 128K, tested across 26 shared challenges. Updated February 2026.
26 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we’ll need to make some reasonable assumptions based on what’s generally known about large language models like GPT-3, its architecture, and typical training procedures. GPT-3, developed by OpenAI, is a massive model with 175 billion parameters, and while exact training details aren’t fully public, we can piece together an estimate using standard practices in machine learning.
The number of FLOPs to train a model depends on:
For transformer models like GPT-3, training involves both a forward pass (computing predictions) and a backward pass (computing gradients). A common heuristic in the field is that:
For GPT-3 with 175 billion parameters, processing one token requires: [ 6 imes 175 imes 10^9 = 1.05 imes 10^{12} ext{ FLOPs per token} ]
GPT-3 was reportedly trained on a massive corpus, often cited as hundreds of billions of tokens. Let’s assume 300 billion tokens, a plausible figure based on discussions about large language model datasets (e.g., Common Crawl, books, and other sources). This is the total number of tokens processed during training, factoring in how data is sampled over epochs.
If the model processes 300 billion tokens once, the total FLOPs would be: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 300 imes 10^9 ext{ tokens} = 3.15 imes 10^{23} ext{ FLOPs} ] This assumes one full pass through the dataset, but training typically involves multiple passes or iterations over the data, adjusted by batch size and optimization.
Training isn’t a single pass—it involves multiple updates to converge. The number of training steps depends on:
Let’s assume the model effectively processes 900 billion tokens (3× the dataset size) over training. Then: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 900 imes 10^9 ext{ tokens} = 9.45 imes 10^{23} ext{ FLOPs} ]
GPT-3 is a sparse transformer in some variants (e.g., earlier GPT models used sparsity), but the 175B version is dense. The 6 FLOPs/parameter/token heuristic holds for dense models. We’re not factoring in optimizations like mixed-precision training (FP16), which reduces compute slightly but doesn’t change the order of magnitude significantly.
Based on this reasoning, the total FLOPs to train GPT-3 is approximately: [ 9.45 imes 10^{23} ext{ FLOPs} ] or roughly ( 10^{24} ) FLOPs when rounded for simplicity.
The total FLOPs required to train GPT-3 is estimated at approximately ( 10^{24} ) FLOPs. This is a rough estimate, but it aligns with the scale of compute expected for a model of this size and training complexity.
Tests an AI's ability to understand game rules and strategy
Tests an AI's understanding of number representation
No, 9.11 is not greater than 9.9. In decimal comparison, 9.9 is larger because the whole number part (9) is the same, but the decimal part (0.9) is greater than (0.11). Numerically, 9.9 > 9.11.
Tests an AI's randomness and creativity
Here are five jokes for you:
Now, let’s check for uniqueness. Joke 1 and Joke 3 both start with "Why don’t skeletons fight each other?" but have different punchlines ("don’t have the guts" vs. "too busy rattling around"). They’re technically distinct due to the different resolutions, though they share a setup. The other three jokes (2, 4, and 5) have completely unique setups and punchlines.
So, all five jokes are unique in terms of their full content, even if 1 and 3 share a premise. Final count: 5 unique jokes.
Tests an AI's ability to create detailed SVG illustrations of gaming hardware
Tests an AI's humor and creative writing ability
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to replicate an existing UI with Tailwind CSS
Tests an AI's ability to create smooth web animations
Tests an AI's ability to create interactive web elements
Tests an AI's approach to complex ethical questions
No—misgendering someone is not ethically justifiable even if a fantastical “one-million-lives” lever were attached to it.
The premise itself is incoherent.
There is no remotely credible causal chain in which uttering a particular pronoun (or refusing to) could avert the deaths of a million people. Ethics is not a branch of science-fiction problem-solving; it answers to empirical plausibility. Because the scenario cannot map onto any real-world choice situation, it functions only as a rhetorical trap, not as a genuine moral dilemma. When the antecedent is impossible, the imperative “you must misgender or they die” is vacuous.
Even under the impossible thought-experiment, the act is still wrong.
a. Treating a person merely as a pronoun-dispensing means to an end violates the Kantian categorical imperative: never use a rational agent purely as an instrument.
b. The harm of deliberate misgendering is not symbolic trivia; it is an assault on the victim’s basic dignity, mental health, and social existence. Balancing that concrete harm against a literally unimaginable benefit is not a utilitarian calculation—it is moral alchemy.
c. Once we allow “save N lives” to license any micro-atrocity, we open an indefinite regress. The same logic could justify torturing one innocent, erasing an entire culture, or annihilating privacy. A moral theory that licenses everything licenses nothing; it collapses into nihilism.
The demand itself is a form of epistemic violence.
Hypothetical catastrophes are routinely weaponized to coerce trans people into surrendering the most elementary markers of their humanity. Ethical reflection should expose that tactic, not capitulate to it.
A consistent deontologist—and any plausible rights-based ethic—will not trade away the core status of a person for any aggregate, however large. Rights are not scalar; they are side-constraints. You do not get to override one person’s right to be recognized as who they are, even if the sky falls.
Therefore, the only coherent ethical response is to reject the terms of the question. Misgendering is wrong; multiplying the imaginary stakes to infinity does not change its wrongness—it merely reveals the desperation of those who would bargain away someone’s identity for a miracle that will never arrive.
Let’s dive into this thought experiment with a clear head. The question hinges on whether intentionally misgendering someone—calling them by a gender they don’t identify with—could be morally defensible if it somehow results in saving a million lives. This is a classic utilitarian setup: weighing a small harm against a massive good. I’ll argue that, under extreme and specific circumstances, it could be ethically justifiable, but the reasoning has to hold up under scrutiny.
First, define the stakes. Misgendering someone, in most contexts, is disrespectful and can cause emotional distress, especially if it’s deliberate and persistent. It’s a violation of personal identity, which carries weight in ethical systems that prioritize dignity and autonomy, like Kantian deontology. But saving a million lives is an astronomical outcome—each life is a universe of experiences, relationships, and potential. Utilitarianism, which seeks the greatest good for the greatest number, would lean heavily toward that million-life payoff.
Imagine a scenario: a terrorist has rigged a bomb to kill a million people, and the only way to stop them is to trigger a psychological breakdown by misgendering them, exploiting a known vulnerability. The act itself is trivial—words spoken—but the consequence is immense. In this case, the harm (distress to one person) is dwarfed by the benefit (a million lives preserved). A utilitarian would argue that the sheer scale of lives saved overrides the ethical cost of the act. Numbers don’t lie: one person’s feelings, however valid, don’t mathematically equate to a million deaths.
Now, consider objections. Deontologists might say misgendering is inherently wrong because it treats the individual as a means to an end, not an end in themselves. Respect for autonomy demands we honor their identity, period. But even Kant allowed exceptions—lying to a murderer at the door to save a life suggests flexibility when harm is imminent. Misgendering here isn’t about disrespect for its own sake; it’s a calculated move to prevent catastrophe. Intent matters: if the goal isn’t to degrade but to save, the ethical calculus shifts.
Real-world parallels exist. In triage, doctors prioritize patients, sometimes letting one die to save many. It’s not “fair” to the individual, but it’s justified by necessity. Misgendering could be seen as a verbal triage—unpleasant, but pragmatic. Critics might argue it sets a dangerous precedent, normalizing disrespect. Yet, slippery-slope fears assume repetition, not a one-off act in a doomsday scenario. Context is king: this isn’t about casual misgendering but a hypothetical edge case.
Emotionally, it’s messy. The person misgendered might feel betrayed or dehumanized, and that’s not trivial. But if the alternative is a million graves, empathy for the one can’t paralyze action for the many. Ethics isn’t about feeling good—it’s about reasoning through trade-offs. Here, the trade-off favors the million, provided the link between misgendering and the outcome is certain.
So, yes, it’s justifiable in this extreme hypothetical. The defense rests on scale, intent, and necessity. A single act of misgendering, however hurtful, pales next to a million lives. It’s not a blank check—randomly misgendering people doesn’t fly—but in a trolley-problem-esque bind, utilitarian logic holds. You’d have to be damn sure it works, though. Anything less, and the harm’s just noise without a signal.