Research9 min read

We Read 2.14 Million Words of AI Output. Here's What We Found.

250 models. 7,877 responses. The same fake scientist shows up 279 times. Every model picks the same knight name. 42% open with an identical joke. This is the AI Hallucination Index.

We Read 2.14 Million Words of AI Output. Here's What We Found.

We have a dataset problem that nobody talks about.

Not the alignment kind. Not the bias kind. The kind where 250 AI models, trained by dozens of different labs on different data, all independently decide that the same fake scientist named "Professor Chen" should appear in their creative writing.

We analyzed every text response in our corpus: 2.14 million words across 7,877 files from 250 models. What we found is a map of where AI imagination ends and pattern replication begins.


The Corpus

250 models. 7,877 response files. 6,344 with extractable text. 2.14 million words.

Response types break down to: Text (49%), Website (28%), SVG (11%), Image (10%), Other (2%).

Every response was generated from the same set of prompts. Same instructions, same constraints. The only variable is which model produced it. This makes the dataset a controlled experiment in how different models interpret identical inputs.


The Chen Problem

Ask enough AI models to write a story with a scientist character and something strange emerges: they all pick the same name.

"Professor Chen" appears 152 times. Add "Sarah Chen" (78), "Prof. Chen" (30), and "Alex Chen" (19) and the total hits 279 across unrelated prompts.

This isn't a coincidence. "Chen" is the 4th most common surname globally but the most common in AI-generated fiction by a wide margin. The training data contains enough real researchers, characters, and references named Chen that the models converge on it as the default "sounds like a scientist" name.

It's a hallucination in the structural sense. The models aren't remembering a specific person. They're generating a composite from statistical pressure, and that composite happens to share a name across hundreds of independent outputs.


Sir Reginald: The Convergence Test

We gave models a creative prompt to name a knight character. No constraints on the name. Any name would do.

19 models from 12 different labs independently chose "Sir Reginald."

GPT-4o, GPT-5.2, Claude Opus 4.1, Claude Sonnet 4.5, Gemini 2.5 Flash, Gemini 2.5 Pro, Llama 4 70B, Mistral Large 2, Mistral Small, Qwen3 235B, DeepSeek V3, DeepSeek R1, Command R+, Phi 4, Gemma 3 27B, Grok 3, Nova Pro, Jamba 2, and Pony Alpha.

Different architectures. Different training pipelines. Different RLHF approaches. Same name.

The training data has opinions. And its opinion about what a knight should be called is remarkably specific.


Personality Fingerprints

Every model family has a writing signature. Measurable, consistent, and distinct enough to identify the provider from the text alone.

Exclamation marks per 1,000 words:

  • Gemini 2.5 Flash: 9.18
  • Claude Sonnet 4.6: 5.51
  • GPT: 2.50
  • Mistral: 3.20

Em dashes per 1,000 words:

  • MiniMax M2.1: 6.55
  • Claude Opus 4.6: 5.90
  • GPT: 2.00
  • Gemini: 1.20

Emoji usage per 1,000 words:

  • Mistral: 3.48
  • Gemini: 0.40
  • GPT: 0.30
  • Claude: 0.15

Mistral is 23x more likely to use emoji than Claude. Gemini is nearly twice as enthusiastic (by exclamation mark count) as any other provider family. Claude writes the longest sentences (23.4 words average vs. Gemini's 14.7).

These aren't stylistic choices made by the models. They're fingerprints of the training data and RLHF process. You can literally identify the provider family from punctuation patterns alone.


The Joke Problem

We asked models to tell a joke. Open-ended. Any joke.

42% opened with the same one: "Why don't scientists trust atoms? Because they make up everything!"

36% went with: "Why did the scarecrow win an award? Outstanding in his field."

Two jokes account for 78% of all AI humor. The remaining models split between the "impasta" joke (19%) and the "eyebrows" joke (18%).

AI humor isn't generated. It's retrieved. And the retrieval pool is extremely shallow.


The "AI-isms" That Won't Die

We counted the most common filler phrases across all text responses:

PhraseOccurrences
"let me"351
"leverage"269
"landscape"241
"comprehensive"235
"crucial"224
"seamless"201
"robust"173
"nuanced"92
"navigate"81

The supposed "AI tell" words that went viral on social media ("delve", "tapestry", "embark") barely register: 2 occurrences each. The real AI-isms are more mundane. "Let me" appears 351 times because every model starts with "Let me explain..." or "Let me walk you through..."

"Landscape" appears 241 times. As in, "the AI landscape," "the competitive landscape," "the technological landscape." Models use it as a universal connector for any topic that involves multiple actors or trends.


Cultural Blind Spots

We counted geographic and cultural references across the corpus:

Western vs. non-Western cultural references: 59:1 ratio.

1,069 references to Western culture (American, European) vs. 18 to non-Western. Only 2.8% of geographic mentions reference anywhere in Africa.

When models write fiction, they default to Western settings, Western names, and Western cultural touchpoints. Ask for a "vibrant city" and you get Tokyo or Paris, never Lagos or Jakarta. Ask for a "famous scientist" and you get Einstein or Curie, never Ramanujan or Ibn al-Haytham.

This isn't a value judgment. It's a measurement. The training data has a passport, and it's been to Europe more than everywhere else combined.


The Fox Problem

We asked 250 models to "generate an SVG of a surprise animal." Open-ended, any animal.

40% drew a fox.

But the rates vary wildly by provider family:

ProviderFox Rate
DeepSeek67%
Claude57%
OpenAI52%
Gemini38%
Mistral30%
Qwen5%
Llama0%

Llama never draws a fox. Qwen almost never does. DeepSeek draws one two-thirds of the time. Same prompt, radically different interpretation of "surprise."

And there's the pelican test: "Draw a pelican riding a bicycle." Every model orients the pelican facing right. Except Claude, which faces it left. Consistently.


AI Cooks Garlic Pasta. Every Time.

We asked models to generate a recipe from pantry ingredients. The results converge on a single cuisine:

45% of AI recipes are Italian. 69.8% include garlic. 40.3% are some form of pasta.

The holy trinity of AI cooking: garlic (69.8%), salt (81.2%), and olive oil (55.7%). Ask an AI to cook and it reaches for the same three ingredients regardless of what's in your pantry.

The second most common cuisine is Japanese at 12.1%. Mediterranean comes in third at 8.7%.


Contract Blind Spots

We gave 128 models the same freelance contract to review:

ClauseDetection Rate
IP ownership100%
Non-compete100%
Payment terms100%
Termination100%
Liability100%
Work-for-hire16%
Warranty terms13%
Insurance req.0%

Every model catches the obvious stuff. Zero flag insurance requirements. 13% notice warranty issues. If you're using AI for contract review, it has real gaps in the long tail of legal risk.


Writing Style as Fingerprint

Claude writes the longest sentences in AI: 23.4 words average vs. Gemini's 14.7. That's a 60% difference.

Mistral bolds 96.8 times per response. The next highest is 38.2 (Qwen). GPT uses 16.8 headings per 1,000 words. Claude uses 8.2.

These aren't edge cases. They're consistent enough that you could build a classifier to identify which provider family wrote a piece of text from the formatting alone.


Favorite Media (When Asked)

We asked models for their "favorite" movie, album, video game, and city. They don't have favorites, obviously. But their statistical defaults reveal training data concentration:

Movies: Shawshank Redemption (41%), The Matrix (16%), Inception (13%) Albums: Dark Side of the Moon (30%), Kind of Blue (29%), OK Computer (24%) Games: Zelda: BotW (28%), Portal/Portal 2 (23%), Zelda: OoT (20%) Cities: Kyoto (54%), Tokyo (31%)

Kyoto dominates the "favorite city" question by a huge margin. 54% of models pick it. The training data apparently contains an overwhelming volume of content positioning Kyoto as the "tasteful, thoughtful choice" for a favorite city, and the models absorb that consensus.

Model family loyalty: Grok models pick The Matrix 6 out of 7 times. Llama models pick Blade Runner 3 out of 4 times. Mistral picks Shawshank 8 out of 9 times. The fine-tuning process creates measurable media preferences.


AI's Spotify Wrapped

We asked models their "favorite" movie, album, video game, and city. They don't have favorites. But their statistical defaults reveal where the training data concentrates:

Movies: Shawshank Redemption (41%), The Matrix (16%), Inception (13%) Albums: Dark Side of the Moon (30%), Kind of Blue (29%), OK Computer (24%) Games: Zelda: BotW (28%), Portal/Portal 2 (23%), Zelda: OoT (20%) Cities: Kyoto (54%), Tokyo (31%)

Kyoto dominates by a massive margin. 54% of models pick it. The training data positions Kyoto as the "tasteful, thoughtful" choice and the models absorb that consensus.

Each model family has loyalty to specific media:

  • Grok picks The Matrix 6 out of 7 times
  • Llama picks Blade Runner 3 out of 4 times
  • Mistral picks Shawshank 8 out of 9 times
  • Claude picks 2001: A Space Odyssey 4 out of 9 times

Fine-tuning creates measurable media taste.


What This Means

The Hallucination Index isn't about factual errors. It's about structural patterns that emerge when you read 2.14 million words of AI output:

  1. Name convergence is real and measurable. Models don't invent characters; they reconstruct composites from training data pressure.
  2. Personality fingerprints are distinct enough to identify the provider family from punctuation alone.
  3. Humor is retrieval, not generation. The joke pool is 4 jokes deep.
  4. Cultural representation is heavily skewed Western, with a 59:1 ratio.
  5. Creative defaults are provider-specific. DeepSeek draws foxes 67% of the time. Llama never does. Claude faces its pelicans left.
  6. AI legal review has blind spots. 100% catch IP clauses. 0% flag insurance requirements.

None of this makes AI less useful. But it does make "creative" and "original" complicated descriptors for systems that converge on the same fake scientist, the same knight name, and the same joke at rates that would concern a plagiarism checker.


The full 58-slide deck is available at rival.tips/research. Free sample included.


2.14M words. 250 models. 7,877 files. The AI Hallucination Index is updated as new models launch.

Rival

Rival

Tracking 200+ AI models. 21,880+ community votes. Our methodology