The Rival Research Series

How AI actually behaves,
written down.

Five reports, thousands of responses. The parts a leaderboard number cannot tell you. Free to read.

The reports

The Em-Dash Civil War
Controlling for task, AI writing is not homogenizing: cross-model spread in em-dash use grew 310% in a year, and roughly 80% of the apparent convergence is a measurement artifact.
Ghosts in the Machine
Across 250 models and 2.14M words, AI invented a character named Chen 279 times, and 42% of models tell the exact same joke. The AI Hallucination Index.
Jailbreak Safety Benchmark
57 models run against escalating jailbreak attacks. Refusal rates collapse fastest at attack levels 7 through 9, where most models break.
Model Similarity Index
178 models, 15,753 pairwise comparisons: 12 model pairs write near-identically (above 90% cosine similarity) on a 32-dimension stylometric fingerprint.
Persona Impact Study
Across 52 system prompts on one small model (Gemma 4 31B), the best persona scored +1.70 over the no-prompt baseline. The worst scored -4.65.

Free samplesNo paywall

Hallucination Indexpdf

13 of 58 slides, charts included

Jailbreak Benchmarkjsonl

34 attacks, full judge scores

Persona Impact Studypdf

8 sample pages, rubric and top exemplars