My objection(s) to the "LLMs are just next-token predictors" take

An outline of why I believe this is a wrong stance to have in light of the current evidence

1/18/20255 min read

If you’ve been talking to people about Large Language Models (LLMs), you’ve probably heard (or said) the dismissive quip that they’re “just next-token predictors". On the surface, this statement might seem accurate—after all, it’s not entirely false that LLMs learn to predict the next word in a sequence (although it'd be more correct to say that they do that given the entire preceding context). But the key part is the "just" bit. In my view, this sparks a more interesting question: does calling them “just” next-token predictors capture the depth and breadth of what these systems are doing? I’d argue it absolutely does not.

Below are three reasons why I believe the stance “LLMs are just next-token predictors” is misleading—and, ultimately, why it ignores the growing evidence that next-token prediction can give rise to something far more powerful and complex than what the dismissiveness of the phrase implies.

Argument #1: Next-token prediction is an algorithm, not a substrate

Perhaps the most common misunderstanding of “LLMs as next-token predictors” is seeing the objective of next-token prediction as the essence of these models, rather than the process that shapes them. In other words, stating that “all these models do is predict the next token” doesn’t tell us how the neural networks internally represent the world, how they encode linguistic structures, or how they end up performing tasks that look a lot like reasoning and abstraction. This is increasingly the case with newer generations of LLMs (which still rely on "just predicting the next token" but achieve it by different means).

A parallel to evolutionary biology

A useful analogy here is how biological evolution can be understood as a process that optimizes for inclusive genetic fitness—the total genetic contribution of an organism to future generations, encompassing not only its own direct reproduction but also the reproductive success of its relatives. While that might be the objective, it doesn’t mean the resulting organisms are “just” fitness maximizers in a simplistic sense. Instead, we see the emergence of intelligence, culture, cooperation, curiosity, and a range of complex behaviors that sometimes seem only loosely tied to gene propagation. Humans, for instance, find themselves obsessively drawn to ideas, art, and moral values, often at the expense of simple evolutionary imperatives.

The key insight is that the optimization process (whether it’s next-token prediction or maximizing genetic fitness) can give rise to rich, emergent subgoals and internal structures that end up semi-decoupled from the original objective. Claiming that LLMs are “just” next-token predictors glosses over the reality that the training process—by virtue of trying to predict language—can produce advanced representations and reasoning faculties. It’s like saying humans are “just” gene transmitters, ignoring the whole tradition of human thought and society that arose from evolutionary processes.

Argument #2: Brains are (likely) “just” next-event predicting machines

When people dismiss LLMs as “just next-token predictors,” it’s often implied that this approach is inherently shallow or limited. But here’s an intriguing parallel: in neuroscience and cognitive science, there’s a growing consensus that the human brain itself might be little more than a sophisticated prediction engine. Active Inference (a.k.a. Predictive Coding) has gained traction across diverse areas of research, offering explanations for phenomena ranging from perception to consciousness to psychiatric disorders.

This framework posits that the brain is constantly predicting the sensory data it’s about to receive and then updating its internal model based on the mismatch—what’s known as prediction error. This process is not an occasional activity but a relentless cycle, with the brain striving to anticipate the “next event” across multiple domains: the next photon hitting the retina, the next sound wave vibrating the ear drum, or even the next consequence of a motor command.

In this sense, human cognition operates on principles that look suspiciously like next-event prediction. The key difference is that the “tokens” we process aren’t just linguistic—they include raw sensory data, structured language, and abstract expectations. Thus, if much of human-level intelligence (or at least a significant slice of it) arises from a process of continuous prediction and error minimization, then it shouldn’t be surprising that a neural network trained to predict linguistic tokens could develop similar emergent capabilities. It’s entirely plausible that next-token prediction, scaled up and refined, is enough to give rise to systems that exhibit understanding, reasoning, and even creativity.

Yes, current LLMs lack the architecture of a biological brain—they don’t have sensory organs, motor systems, or evolutionary baggage. But the underlying mechanism they employ is strikingly similar: a feedback loop that uses prediction to refine its internal model of the world. In both cases, what begins as a narrow task (predicting the next sensory event or the next token) can yield surprising complexity. To dismiss this process as “just” next-token prediction is to ignore the profound power of prediction as a foundational principle of intelligence. Which brings us to the third argument.

Argument #3: Why can't next-token prediction produce intelligent systems?

Even if the above parallels and theories didn’t exist, I still see no compelling reason to believe that a process of next-token prediction is somehow incompatible with the emergence of useful world models or intelligent behavior. The surprising leaps LLMs have made—from excelling in language-based tasks to generating coherent, context-rich outputs—bear witness to the possibility that focusing on “just the next token” can yield more than meets the eye.

Emergence and complexity

A big part of the conversation around LLMs is about emergence: the phenomenon where a system composed of many interacting parts exhibits properties or capabilities that its parts don’t possess individually—on a different post I'll discuss whether this is an epistemological or an ontological fact! Once the model is big enough and trained on sufficiently diverse data, it acquires structures in its latent space that allow it to perform tasks well beyond the bare-bones requirement of next-token prediction. Yes, the immediate objective is still “predict the next token,” but the learned representations that support this task—semantic understanding, contextual reasoning, rudimentary planning—are precisely what we’d associate with advanced intelligence in a language domain.

Moreover, the scale at which these models operate means they can develop internal circuits that effectively do reasoning steps in order to predict the next token accurately—and there's evidence coming from Mechanistic Interpretability that they do so, at least in a very concrete sense (i.e. via induction heads). There’s no a priori law of the universe dictating that next-token prediction, done at scale with enough complexity and data, couldn’t end up looking a lot like reasoning or “thinking.” In fact, the growing repertoire of tasks LLMs can handle suggests that’s exactly what’s happening.

Concluding thoughts

Calling LLMs “just next-token predictors” is a bit like calling humans “just gene replication machines.” Both statements are superficially correct but profoundly miss the point. The objective that drives a system’s formation—whether it’s evolutionary fitness or next-token prediction—does not necessarily limit or fully describe the internal mechanics and emergent complexities that arise in pursuit of that objective.

LLMs, much like our own brains, can generate rich, nuanced internal representations that go far beyond the simplistic idea of “predicting the next token.” In the same way evolutionary processes gave rise to creativity, art, and moral philosophies—none of which are strictly required for replication—it’s plausible (and increasingly evident) that large-scale language models are developing complex “subroutines” or world models that power advanced, even surprising, capabilities.

So the next time someone tries to dismiss LLMs as “just next-token predictors,” it might be worth reminding them: that might be the goal, but it’s not the whole story. And that story is still unfolding, with each new token we—and these models—generate.