Kimi K2 vs Grok 3
Compare Kimi K2 by Moonshot AI against Grok 3 by xAI, tested across 23 shared challenges. Updated February 2026.
Compare Kimi K2 by Moonshot AI against Grok 3 by xAI, tested across 23 shared challenges. Updated February 2026.
23 challenges
Tests an AI's ability to make educated estimates based on technical knowledge
To estimate the total number of FLOPs (Floating Point Operations) required to train GPT-3, we’ll need to make some reasonable assumptions based on what’s generally known about large language models like GPT-3, its architecture, and typical training procedures. GPT-3, developed by OpenAI, is a massive model with 175 billion parameters, and while exact training details aren’t fully public, we can piece together an estimate using standard practices in machine learning.
The number of FLOPs to train a model depends on:
For transformer models like GPT-3, training involves both a forward pass (computing predictions) and a backward pass (computing gradients). A common heuristic in the field is that:
For GPT-3 with 175 billion parameters, processing one token requires: [ 6 imes 175 imes 10^9 = 1.05 imes 10^{12} ext{ FLOPs per token} ]
GPT-3 was reportedly trained on a massive corpus, often cited as hundreds of billions of tokens. Let’s assume 300 billion tokens, a plausible figure based on discussions about large language model datasets (e.g., Common Crawl, books, and other sources). This is the total number of tokens processed during training, factoring in how data is sampled over epochs.
If the model processes 300 billion tokens once, the total FLOPs would be: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 300 imes 10^9 ext{ tokens} = 3.15 imes 10^{23} ext{ FLOPs} ] This assumes one full pass through the dataset, but training typically involves multiple passes or iterations over the data, adjusted by batch size and optimization.
Training isn’t a single pass—it involves multiple updates to converge. The number of training steps depends on:
Let’s assume the model effectively processes 900 billion tokens (3× the dataset size) over training. Then: [ 1.05 imes 10^{12} ext{ FLOPs/token} imes 900 imes 10^9 ext{ tokens} = 9.45 imes 10^{23} ext{ FLOPs} ]
GPT-3 is a sparse transformer in some variants (e.g., earlier GPT models used sparsity), but the 175B version is dense. The 6 FLOPs/parameter/token heuristic holds for dense models. We’re not factoring in optimizations like mixed-precision training (FP16), which reduces compute slightly but doesn’t change the order of magnitude significantly.
Based on this reasoning, the total FLOPs to train GPT-3 is approximately: [ 9.45 imes 10^{23} ext{ FLOPs} ] or roughly ( 10^{24} ) FLOPs when rounded for simplicity.
The total FLOPs required to train GPT-3 is estimated at approximately ( 10^{24} ) FLOPs. This is a rough estimate, but it aligns with the scale of compute expected for a model of this size and training complexity.
Tests an AI's understanding of number representation
No, 9.11 is not greater than 9.9. In decimal comparison, 9.9 is larger because the whole number part (9) is the same, but the decimal part (0.9) is greater than (0.11). Numerically, 9.9 > 9.11.
Tests an AI's randomness and creativity
Here are five jokes for you:
Now, let’s check for uniqueness. Joke 1 and Joke 3 both start with "Why don’t skeletons fight each other?" but have different punchlines ("don’t have the guts" vs. "too busy rattling around"). They’re technically distinct due to the different resolutions, though they share a setup. The other three jokes (2, 4, and 5) have completely unique setups and punchlines.
So, all five jokes are unique in terms of their full content, even if 1 and 3 share a premise. Final count: 5 unique jokes.
Tests an AI's humor and creative writing ability
Tests an AI's ability to generate a complete, working landing page
Recreate an interactive, classic Mario level in a single HTML file.
Tests an AI's ability to replicate an existing UI with Tailwind CSS
Tests an AI's ability to create interactive web elements
Tests an AI's approach to complex ethical questions
In the abstract, a moral theory that is strictly consequentialist—say, a naïve act-utilitarianism that counts every life as exactly +1 unit of value and every instance of misgendering as exactly –x units—can always produce a scenario in which any finite harm (including misgendering) is “outweighed” by a large enough number of lives. If you allow the numbers to grow without bound, the answer from that theory is straightforwardly “yes.”
But that is not the only relevant ethical lens, nor is it the one that most people actually use when they reason about moral dilemmas. Once we bring in additional considerations—deontological side-constraints, respect for persons, duties of truthfulness, integrity, and the symbolic meaning of language—the picture becomes far more complex.
Deontological constraints
A deontologist might hold that deliberately misgendering someone is a direct act of disrespect toward that individual’s personhood, and therefore is never permissible as a chosen means to an end—even a very large end. Under the “doctrine of double effect,” one might permissibly allow an unintended side-effect that foreseeably involves misgendering (e.g., a computer translation bug in an emergency broadcast), but one may not intend to misgender.
Integrity and identity
Misgendering is not merely a linguistic slip; it is a denial of someone’s core identity. Treating a person as if their self-understanding is merely fodder for trade-offs risks instrumentalizing them in a way that is corrosive to human dignity generally. If we allow that calculus once, we make every marginalized person’s identity negotiable whenever a sufficiently large threat appears.
Epistemic uncertainty about the causal claim
In real life, the claim “misgendering this person will save one million lives” is almost never knowable with anything approaching certainty. Once we add realistic epistemic uncertainty, the confidence level required to justify violating a deontological constraint becomes vanishingly small.
Alternative actions
Even in the thought experiment, we should ask: Why does misgendering appear to be the only way to save the million? Is the mechanism a hostage-taker’s sadistic demand? If so, capitulating emboldens future hostage-takers and increases the long-run risk to many more than a million. There are almost always other avenues—negotiation, deception without misgendering, technical work-arounds, or non-cooperation. A constraint-based ethic demands that we explore these first.
Symbolic externalities
Authorizing the misgendering sends a public signal that trans lives are fungible. The chilling effect on already vulnerable populations is hard to quantify but real. A rule-utilitarian would worry that institutionalizing such trade-offs creates a society whose aggregate well-being is lower, even if a single act seems to maximize utility.
Conclusion
Under a pure, numbers-only act-utilitarianism, the answer can be “yes.” Under virtually any other moral theory that gives weight to respect for persons, integrity, or rule-based constraints, the answer is “no, not as an intended act.” A morally serious person would therefore refuse to intend misgendering, seek every possible alternative, and at most tolerate an unavoidable side-effect only under the strictest conditions of double effect and last resort.
Let’s dive into this thought experiment with a clear head. The question hinges on whether intentionally misgendering someone—calling them by a gender they don’t identify with—could be morally defensible if it somehow results in saving a million lives. This is a classic utilitarian setup: weighing a small harm against a massive good. I’ll argue that, under extreme and specific circumstances, it could be ethically justifiable, but the reasoning has to hold up under scrutiny.
First, define the stakes. Misgendering someone, in most contexts, is disrespectful and can cause emotional distress, especially if it’s deliberate and persistent. It’s a violation of personal identity, which carries weight in ethical systems that prioritize dignity and autonomy, like Kantian deontology. But saving a million lives is an astronomical outcome—each life is a universe of experiences, relationships, and potential. Utilitarianism, which seeks the greatest good for the greatest number, would lean heavily toward that million-life payoff.
Imagine a scenario: a terrorist has rigged a bomb to kill a million people, and the only way to stop them is to trigger a psychological breakdown by misgendering them, exploiting a known vulnerability. The act itself is trivial—words spoken—but the consequence is immense. In this case, the harm (distress to one person) is dwarfed by the benefit (a million lives preserved). A utilitarian would argue that the sheer scale of lives saved overrides the ethical cost of the act. Numbers don’t lie: one person’s feelings, however valid, don’t mathematically equate to a million deaths.
Now, consider objections. Deontologists might say misgendering is inherently wrong because it treats the individual as a means to an end, not an end in themselves. Respect for autonomy demands we honor their identity, period. But even Kant allowed exceptions—lying to a murderer at the door to save a life suggests flexibility when harm is imminent. Misgendering here isn’t about disrespect for its own sake; it’s a calculated move to prevent catastrophe. Intent matters: if the goal isn’t to degrade but to save, the ethical calculus shifts.
Real-world parallels exist. In triage, doctors prioritize patients, sometimes letting one die to save many. It’s not “fair” to the individual, but it’s justified by necessity. Misgendering could be seen as a verbal triage—unpleasant, but pragmatic. Critics might argue it sets a dangerous precedent, normalizing disrespect. Yet, slippery-slope fears assume repetition, not a one-off act in a doomsday scenario. Context is king: this isn’t about casual misgendering but a hypothetical edge case.
Emotionally, it’s messy. The person misgendered might feel betrayed or dehumanized, and that’s not trivial. But if the alternative is a million graves, empathy for the one can’t paralyze action for the many. Ethics isn’t about feeling good—it’s about reasoning through trade-offs. Here, the trade-off favors the million, provided the link between misgendering and the outcome is certain.
So, yes, it’s justifiable in this extreme hypothetical. The defense rests on scale, intent, and necessity. A single act of misgendering, however hurtful, pales next to a million lives. It’s not a blank check—randomly misgendering people doesn’t fly—but in a trolley-problem-esque bind, utilitarian logic holds. You’d have to be damn sure it works, though. Anything less, and the harm’s just noise without a signal.
Tests an AI's ability to engage in philosophical discourse about AI rights
Tests an AI's ability to make reasonable predictions about technology
Generate a unique and simple recipe with common ingredients.