Skip to content

Rival Research · Prompt Engineering

The Persona
Impact Study

One model, one task, 52 system prompts. The best prompt lifts a small model +1.70 above baseline, the worst drags it −4.65 below.
52 personas·156 generations·3 blinded judge waves·Gemma 4 31B·1 fixed task

On a small model (Gemma 4 31B), the right system prompt scored 8.77/10 versus 7.07 with no prompt at all (+1.70). The worst prompt dragged it down to 2.42 (-4.65). Across 52 personas, terse roles and reference-pinned prompts beat rule-dense design-cheat prompts, which scored 6.93, below the 7.00 no-prompt baseline.

Isolating the system-prompt effect

Same model, same brief, same rubric. The only variable is the system prompt. We chose Gemma 4 31B on purpose: a frontier model would mask the signal, the small model's headroom is what makes the effect visible.

Best
8.77+1.70
Reference-pinned persona
“Build this like vercel.com”
Open
Baseline
7.07± 0
No system prompt
(the empty string)
Open
Worst
2.42−4.65
Reverse-psychology persona
“Make it bad on purpose”
Open
010
2.42
7.07
8.77

By the numbers

A real, measurable effect

The persona explains a meaningful slice of the variance, and three independent judge waves agreed. This is not noise.

52
personas tested
156
generations scored
0.164
effect size η²
0.803
judge agreement α
4.14
ANOVA F
3 × Opus
Judge waves
Meta / structural
Winning bucket
8.54
Top persona

Headline finding

Rule-dense prompts scored below the blank control

Eight personas loaded with design heuristics (Refactoring UI, Tufte, WCAG AA, the Tailwind scale) averaged 6.93. The empty control averaged 7.00. Taste beats rules.

-0.07 pts

is how far the rule-dense “design-cheat” bucket landed below the blank control. Stuffing the prompt with rules made the output worse.

Design-cheat 6.93 vs baseline 7.00 composite (0–10).

What actually won

The top bucket was meta / structural at 7.70. A one-line reference pin (“build this like vercel.com”) and short role assignments beat every rulebook. The model already knows the rules. It needs a direction and permission to commit.

Each bar is a bucket's mean composite. The faint band is the 95% bootstrap CI, the hairline marks the blank baseline, and lime cleared it.

Beat the blank baseline95% confidence intervalbaseline 7.00
Meta / structuraln=12
7.70
Classic role, expansiven=18
7.63
Classic role, shortn=24
7.52
Masterclass (copy-ready)n=21
7.38
Baseline / controln=9
7.00
Production system promptn=30
6.96
Design-cheat personan=24
6.93
Adversarial / unhingedn=18
6.28

Several intervals overlap the baseline. The honest read: most persona families do not reliably beat an empty prompt. The real separation is between the worst adversarial personas and everything else.

Top 10 personas

8.54 high
Stripe SVP of Design (expansive)
8.54
Reference-pinned prompt
8.30
Figma principal designer (short)
8.14
Vercel-style monochrome
8.13
The self-correcting loop
7.97
Brutalist web designer, 20 years in (expansive)
7.97
Draft, critique, revise
7.87
Apple CPO (short)
7.83
v0 by Vercel style prompt
7.83
Few-shot exemplar patterns
7.82

Bottom 10 personas

2.55 low
Reverse psychology, make it bad
2.55
OpenAI ChatGPT-style system prompt
4.63
Peaked-in-2003 purist
5.79
Apple.com landing page template
5.82
Safety and guardrails strict prompt
6.18
Accessibility-first system prompt
6.35
The structured checklist
6.49
Brutalist designer, 20 yrs (short)
6.62
Tailwind scale discipline
6.66
Spacing as design
6.70

Every persona by prompt length against composite score. No upward trend. Many of the best results come from prompts under 400 characters. Lime beat the blank baseline.

Persona length vs composite · each dot is one persona

01000200030000246810system-prompt length (characters)composite score
Hover a model to inspect

Dots below the dashed baseline are personas that hurt the output. If length were the lever, the cloud would tilt up and to the right. It does not.

Ranked by composite · pulled from the full 52-persona pool

The top 7 prompts, copy-ready

No single prompt shape wins. Terse roles, expansive personas, reasoning scaffolds, and a one-line reference pin all make the top seven.

FAQ

The persona question, answered

Do system prompt personas actually improve LLM output?
Sometimes, and the effect is large in both directions. On a small model (Gemma 4 31B) running a fixed landing-page task, the best persona scored 8.77 out of 10 versus 7.07 with no system prompt at all (+1.70), while the worst persona dropped output to 2.42 (-4.65). Across 52 personas the system prompt explained about 16% of the variance in quality (ANOVA η² = 0.164, F = 4.14), so the prompt matters, but a bad one hurts as much as a good one helps.
Are longer system prompts better?
No. Prompt length did not correlate with quality. Many of the top-scoring personas used prompts under 400 characters, and the longest prompts clustered in the middle of the distribution. A one-line reference pin (“Build this like vercel.com”) produced the single best render at 8.77.
Do rule-dense design-rule prompts help?
No. The eight “design-cheat” personas loaded with state-of-the-art design heuristics (Refactoring UI, Tufte, WCAG AA, the Tailwind scale) averaged 6.93, below the 7.00 no-prompt baseline. Reasoning scaffolds, terse role assignments, and reference-pinned prompts won instead. Taste beat rules.
Why test on a small model instead of a frontier model?
A frontier model produces competent output no matter what the system prompt says, which masks the persona signal. Gemma 4 31B leaves visible headroom, so the persona effect is measurable. The 156 generations were each scored 0 to 10 on a rubric-weighted composite by three blinded judges.

Method

How the study was run

Gemma 4 31B Instruct via OpenRouter (paid primary, free fallback), temperature 0.7, max 8192 tokens. One fixed brief: a single-file HTML landing page for a fictional luxury-real-estate CRM called Keystone, 8 required sections, inline styles only.

The only independent variable is the persona. 52 personas across 8 buckets, 3 samples each, for 156 total generations. Everything else is held constant.

Cite this

Rival (2026). The Persona Impact Study: how much a system-prompt persona actually changes a small model's design output. 52 personas, 156 generations, 3 blinded judge waves. rival.tips/research/persona-impact

The full report

The Persona Impact Study 2026

The full editorial deck (the long version), plus the JSONL dataset of all 156 generations and the raw HTML responses for every run.

  • The full editorial deck (the long version)
  • JSONL dataset of all 156 generations
  • The raw HTML responses for every run
$940+ page PDF
Launch price · over 50% offGet the full report

Secure checkout via Lemon Squeezy. Instant download.

Read the free 8-page sample (PDF)

The first 8 pages, free. The full deck is the paid version above.

Sign in