4 models have responded to the "Explain Like I'm a Specific Expert" challenge. Compare their approaches side-by-side on RIVAL. This response is part of RIVAL's open dataset of 5,600+ AI model responses.
Claude Opus 4.6's response to the 'Explain Like I'm a Specific Expert' challenge.
Explain how a large language model (like GPT or Claude) learns and generates text to each of these three audiences: 1. **An experienced software engineer** who has built distributed systems and APIs but has never touched machine learning. They're skeptical that "predicting the next word" can produce intelligent behavior. 2. **A PhD physicist** who thinks AI is overhyped and wants to understand what's actually novel vs. what's just well-marketed linear algebra. They expect mathematical precision. 3. **A venture capitalist** evaluating an AI startup. They need to understand the technology well enough to assess defensibility, moats, and whether the founders' claims are credible. Each explanation should be 2-3 paragraphs, tailored to their specific background, vocabulary, concerns, and what they'd find most compelling or interesting. Don't just change the difficulty level — change the framing, examples, and emphasis for each audience.
Think of training an LLM as building the world's most aggressively optimized compression algorithm, except instead of compressing files, it's compressing the statistical structure of human language. The architecture is a transformer — essentially a deep stack of matrix multiplications with a clever routing mechanism called attention that lets every token in a sequence "look at" every other token to build contextual representations. During training, you feed it enormous corpora of text with a deceptively simple objective: given tokens 1 through n, predict token n+1. The model gets a loss signal (cross-entropy between its predicted probability distribution and the actual next token), and you backpropagate gradients through billions of parameters. There's no knowledge graph, no hand-coded rules, no retrieval from a database at inference time. Every "fact" the model appears to know is encoded implicitly in weight matrices — think of it as a learned, lossy, read-only key-value store where the keys are contextual patterns and the values are probability distributions over continuations.
Here's where your skepticism about "just next-word prediction" deserves a serious answer, because it's the right instinct — and the resolution is genuinely surprising. The key insight is that predicting the next token well on sufficiently diverse data requires the model to build internal representations that go far beyond surface-level pattern matching. If the training data contains thousands of Python functions, accurately predicting the next token means implicitly learning variable scoping, type consistency, and control flow. If it contains mathematical proofs, it means learning logical entailment. The objective is simple, but the competencies required to minimize loss on that objective across all of human text are not simple at all. It's similar to how you might dismiss "find the shortest path" as a trivial problem statement, but the infrastructure required to solve it at Google Maps scale — spatial indexing, real-time traffic modeling, hierarchical graph decomposition — is anything but trivial. The simplicity of the loss function is not a ceiling on emergent capability; it's a floor that forces the model to develop increasingly sophisticated internal circuitry as you scale parameters and data.
At inference time, generation is autoregressive: the model produces a probability distribution over the entire vocabulary for the next token, you sample from that distribution (with temperature, top-k, or nucleus sampling to control randomness), append the chosen token to the context, and repeat. The context window is fixed-length (think of it as a sliding buffer), so the model has no persistent memory across calls — every API request is stateless, just like a pure function. What makes this feel "intelligent" is that the attention mechanism lets the model, at each step, dynamically route information across the entire context, effectively performing a learned, soft, parallelized lookup over everything it's been told in this conversation. It's not reasoning the way you or I do. But it's also not a Markov chain or a glorified autocomplete. It's something genuinely new: a differentiable program that learned, through brute-force optimization, to simulate a remarkable range of cognitive tasks because that's what minimizing prediction error on the internet required.
Let's dispense with the hype and talk about what's actually going on mathematically. A transformer-based LLM is a parameterized function $f_\theta: \mathbb{R}^{n \times d} \rightarrow \Delta^{|V|}$ that maps a sequence of $n$ token embeddings in $\mathbb{R}^d$ to a probability distribution over a vocabulary $V$ for the next token. The core architectural innovation is the self-attention mechanism: given input matrix $X \in \mathbb{R}^{n \times d}$, you compute queries $Q = XW_Q$, keys $K = XW_K$, values $V = XW_V$ (all learned linear projections), then form the attention matrix $A = \text{softmax}(QK^\top / \sqrt{d_k})V$. This is, as you'd suspect, entirely linear algebra — matrix multiplications composed with pointwise nonlinearities (softmax, GeLU/ReLU in the feed-forward layers, layer normalization). There is no exotic mathematics here. The architecture is a composition of affine maps and elementwise nonlinearities, repeated ~100 times for large models, with residual connections maintaining a clean gradient flow. Training minimizes the cross-entropy loss $\mathcal{L}(\theta) = -\sum_t \log p_\theta(x_t | x_{<t})$ via stochastic gradient descent (specifically Adam), computed over trillions of tokens. You're right that the mathematical primitives are pedestrian. The honest question is whether the emergent behavior of this system at scale constitutes something scientifically interesting, and I'd argue it does — in the same way that Ising models use trivial local interactions but exhibit phase transitions that are genuinely nontrivial.
What's actually novel, and what I think a physicist should find substantively interesting, is the empirical scaling behavior and what it implies about the loss landscape and representation geometry. Kaplan et al. (2020) and Hoffmann et al. (2022) demonstrated that test loss follows remarkably clean power laws in parameters $N$, dataset size $D$, and compute $C$: roughly $L(N) \sim N^{-\alpha}$ with $\alpha \approx 0.076$ for language models, persisting over many orders of magnitude with no apparent plateaus. This is reminiscent of scaling laws in statistical physics, and it's not yet well understood from first principles — we don't have a satisfying theoretical account of why these exponents take the values they do, or why the power-law regime extends as far as it does. Additionally, mechanistic interpretability research has revealed that transformers develop structured internal representations: individual attention heads that implement interpretable algorithms (induction heads that perform approximate copying, heads that track syntactic dependencies), and that these circuits emerge through what appear to be phase transitions during training. The model isn't just memorizing n-gram statistics; it's learning something closer to a compressed, distributed program. The theoretical gap — between the simplicity of the architecture and the complexity of the learned representations — is real and open, analogous to understanding why deep networks generalize at all given massive overparameterization (the classical bias-variance framework completely fails here).
Where I'd push back on the "just well-marketed linear algebra" framing: you could equally say that quantum field theory is "just well-marketed calculus," and you'd be technically correct but missing the point. The scientific content is in the structure of the composition, the scaling behavior, and the emergent phenomena. The genuinely unsolved problems here — why overparameterized models generalize, what determines the scaling exponents, how to characterize the geometry of the loss landscape, whether there are sharp capability thresholds and what drives them — are legitimate open questions that resist easy answers. The field is immature and overhyped, yes. But the empirical phenomena are reproducible, the scaling laws are quantitatively precise, and the gap between our theoretical understanding and observed behavior is large. That gap is where the actual science is.
Here's the core mechanic: an LLM is trained by showing it enormous amounts of text — essentially the entire public internet, plus books, code, and academic papers — and optimizing it on one task: given everything that came before, predict the next word. The model is a neural network with billions of tunable parameters (think of them as knobs), and training is the process of adjusting those knobs so that the model's predictions get progressively better. What makes this remarkable is that "getting better at predicting the next word across all domains" forces the model to develop general-purpose capabilities — writing code, summarizing legal documents, answering medical questions — as a side effect. It's not programmed to do any of those things specifically. Generation works by running this prediction in a loop: predict the next word, append it, predict the next one, and so on. The quality of the output is a function of three things you should always ask about: the size and quality of the training data, the number of parameters in the model, and the amount of compute used to train it.
Now, here's what matters for your investment thesis: defensibility in foundation models is almost entirely a function of compute budget, data access, and organizational ability to execute on training runs that cost $50M–$500M+. The transformer architecture itself is public and well-understood — it's published research from Google in 2017. There is no secret algorithm. Any startup claiming a proprietary architectural breakthrough as their moat should be treated with extreme skepticism; the empirical evidence overwhelmingly shows that scaling known architectures with more data and compute outperforms clever architectural tweaks. The real moats are: (1) access to proprietary, high-quality training data (especially domain-specific data that isn't on the public internet — think medical records, financial data, proprietary codebases), (2) post-training techniques like reinforcement learning from human feedback (RLHF) that align the model to be useful and safe (this requires specialized teams and expensive human annotation pipelines), and (3) the flywheel of deployment — user interactions generating feedback data that improves the model, which attracts more users. OpenAI's moat isn't the transformer; it's the billions in compute partnerships with Microsoft, the RLHF pipeline, and the distribution via ChatGPT's 200M+ users generating preference data.
When evaluating a startup's claims, here's your diligence framework. If they say they've built a "better model," ask: better on what benchmarks, at what parameter count, trained on what data, with how much compute? If they can't answer precisely, that's a red flag. If they're building on top of foundation models (via fine-tuning or retrieval-augmented generation), the real question is whether their data and workflow integration create switching costs — the model layer is increasingly commoditized, so value accrues to whoever owns the proprietary data pipeline and the customer relationship. Be wary of startups whose entire value proposition is a thin wrapper around an API from OpenAI or Anthropic, because that's a feature, not a company — the platform provider can replicate it trivially. The most defensible AI companies are those that are accumulating proprietary data through their product, building domain-specific evaluation frameworks that let them actually measure whether their system works (this is harder than it sounds and most teams skip it), and creating workflow integration deep enough that ripping them out would be painful. The technology is real and transformative, but the moat is rarely in the model itself — it's in everything around it.
Turn this model response into notes, narration, or a short video
Partner link