3 models have responded to the "Explain Like I'm a Specific Expert" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Claude Sonnet 4.6's response to the 'Explain Like I'm a Specific Expert' challenge.
Explain how a large language model (like GPT or Claude) learns and generates text to each of these three audiences: 1. **An experienced software engineer** who has built distributed systems and APIs but has never touched machine learning. They're skeptical that "predicting the next word" can produce intelligent behavior. 2. **A PhD physicist** who thinks AI is overhyped and wants to understand what's actually novel vs. what's just well-marketed linear algebra. They expect mathematical precision. 3. **A venture capitalist** evaluating an AI startup. They need to understand the technology well enough to assess defensibility, moats, and whether the founders' claims are credible. Each explanation should be 2-3 paragraphs, tailored to their specific background, vocabulary, concerns, and what they'd find most compelling or interesting. Don't just change the difficulty level — change the framing, examples, and emphasis for each audience.
You're right to be skeptical of "predicting the next word" as a description — that framing makes it sound like autocomplete on steroids, which undersells what's actually happening architecturally. Think of training as building an extraordinarily compressed lookup system, except instead of storing key-value pairs explicitly, the model learns a high-dimensional function that generalizes across inputs it's never seen. During training, the network processes hundreds of billions of tokens and repeatedly adjusts ~100 billion floating-point parameters to minimize prediction error. The result isn't a database of sentences — it's something closer to a compiled representation of statistical relationships across essentially all human-written text. The "next word prediction" task turns out to be a surprisingly powerful forcing function: to predict well, the model has to implicitly learn grammar, facts, reasoning patterns, and even social context, because all of those things influence what word comes next.
The architecture doing this work is the Transformer, and the key mechanism is called attention. Forget the neural-network mysticism for a moment — attention is essentially a learned, dynamic routing system. For every token in a sequence, the model computes a weighted sum over all other tokens, where the weights are determined by learned query/key dot products (think of it like a soft, differentiable hash lookup). This lets the model route information selectively across arbitrary distances in the sequence, which is why it handles long-range dependencies far better than older recurrent architectures that had to pass information through a bottleneck sequentially. You stack 96 of these attention layers with feed-forward sublayers, run it at scale, and emergent capabilities appear that weren't explicitly trained — which is the part that should genuinely surprise you.
The "it's just predicting the next word, how can it reason?" skepticism is worth taking seriously, but consider this: the model generates text autoregressively, meaning each token it produces becomes part of the context for the next prediction. When a model "thinks through" a problem in writing, it's literally using its own output as working memory. The chain-of-thought behavior you see isn't a trick layered on top — it's the same mechanism, just given space to run. Whether that constitutes "real" reasoning is a legitimate philosophical debate, but behaviorally, the system solves problems that require multi-step logic, catches its own errors when prompted to check, and generalizes to problem formats it hasn't seen. As an engineer, the honest framing is: we built a system with surprising emergent properties from a simple training objective, and we don't fully understand why it works as well as it does. That should be interesting to you, not reassuring.
The architecture is a deep neural network — specifically a Transformer — trained via stochastic gradient descent to minimize cross-entropy loss on next-token prediction over a corpus of roughly 10¹² tokens. The parameters are real-valued weights in matrix multiplications, layer normalizations, and learned embedding tables. Nothing in the forward pass is nonlinear in any exotic sense: you have matrix multiplies, softmax (a smooth approximation to argmax), layer norm, and GELU activations. So your instinct that this is "well-marketed linear algebra" is not wrong as a description of the components. Where it gets genuinely interesting — and where I'd push back on the "just" — is in the emergent behavior as a function of scale, which has no satisfying theoretical explanation yet.
The attention mechanism is worth examining precisely. For a sequence of tokens represented as vectors $x_i \in \mathbb{R}^d$, each attention head computes $\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$, where $Q = XW_Q$, $K = XW_K$, $V = XW_V$ are learned linear projections. This is a content-dependent weighted average — a soft, differentiable associative memory. Stacking 96 such layers with residual connections creates a system where information can be routed, composed, and transformed in ways that are analytically tractable in principle but practically opaque at scale. The residual stream interpretation — where each layer reads from and writes to a shared residual vector — is a useful frame from mechanistic interpretability work, and it maps loosely onto the idea of iterative refinement of a representation. What's novel isn't any single operation; it's that the composition at scale produces a function approximator with surprisingly structured internal representations: linear probes recover syntactic, semantic, and factual features from intermediate activations, suggesting the model learns something like a disentangled world model, not just surface statistics.
The genuinely open physics-adjacent question — and the one I suspect would interest you most — is the scaling law phenomenon. Hoffmann et al. (Chinchilla, 2022) and Kaplan et al. (2020) both find that loss decreases as a clean power law in compute, parameters, and data, with exponents that are empirically stable across many orders of magnitude. $L \sim N^{-\alpha}$ with $\alpha \approx 0.07$ for parameters is not something we can derive from first principles. It has the flavor of a critical phenomenon or a universality class result — the specific architecture seems not to matter much, only the scale. Why a next-token prediction objective over text produces power-law scaling, and why crossing certain loss thresholds produces discontinuous capability jumps ("emergent abilities"), is not understood. You can reasonably argue the hype is overblown, but the scaling regularity is a real empirical finding that lacks a theoretical home, and that's a legitimate scientific puzzle, not marketing.
The core technology is a large Transformer model trained on internet-scale text to predict the next token — but that description obscures what matters for your evaluation. What these companies have actually built is a capability platform: a general-purpose function approximator that, once trained, can be fine-tuned or prompted to perform thousands of distinct tasks without task-specific engineering. The training run is the capital-intensive, defensible step — GPT-4 class models cost $50–100M+ to train, require rare infrastructure relationships (NVIDIA allocation, cloud commitments), and took years of compounding research to get right. The resulting model weights are a durable asset in a way that most software isn't. So when evaluating a founder's claims, the first question isn't "does the demo work" — it's "where in the stack are they actually building, and does that position have structural advantages?"
The moat question is genuinely complex and often misunderstood in both directions. Foundation model training is highly concentrated — OpenAI, Anthropic, Google, Meta, Mistral, a handful of others — and the barriers there are real: compute capital, data quality and scale, and a small global talent pool of people who can actually run these training runs. But most AI startups aren't training foundation models; they're building on top of them via APIs, fine-tuning, or retrieval-augmented systems. That layer has much weaker defensibility on the technology side — the underlying models are commoditizing faster than most pitches assume, and a workflow built on GPT-4 today can be replicated on an open-source model tomorrow. The durable moats at the application layer are the ones that don't depend on model superiority: proprietary data flywheels (where using the product generates training signal competitors can't replicate), deep workflow integration that creates switching costs, and network effects where the product improves as more users or enterprises join. If a founder's defensibility argument is "our prompts are better" or "we fine-tuned on domain data," that's a weak answer. If it's "we have exclusive data relationships and our model improves with every customer transaction," that's worth taking seriously.
The credibility signals to look for in founders are specific and somewhat counterintuitive. Be skeptical of teams that can demo impressively but can't explain failure modes — a founder who tells you the model is "95% accurate" without unpacking what the 5% looks like and how they handle it is either naive or selling. Good technical founders in this space will talk about evaluation frameworks, about how they detect and manage hallucinations in production, and about the gap between benchmark performance and real-world reliability. They'll have opinions about when not to use a model. The technology is real and the applications are genuinely valuable, but the field is littered with products that work in demos and fail in deployment because the founders didn't understand that LLMs are probabilistic systems, not deterministic APIs — and building reliable products on top of them requires significant engineering that isn't visible in a pitch deck.
Turn this model response into notes, narration, or a short video
Partner link