Richard Sutton Says LLMs Are a Dead End. He Might Be Right.

Abhishek Gautam

AI Machine Learning AGI

Richard Sutton Says LLMs Are a Dead End. He Might Be Right.

Abhishek GautamFebruary 24, 202610 min read

Richard Sutton Says LLMs Are a Dead End. He Might Be Right.

Quick summary

Richard Sutton, the father of reinforcement learning and author of the famous Bitter Lesson, has argued that large language models are not the path to general intelligence. He thinks RL is. Here is the argument, what it gets right, and where the debate actually stands.

What Sutton Actually Argues

Sutton's position is not that LLMs are useless or that they fail to demonstrate impressive capabilities. He acknowledges that they are impressive. His claim is more specific.

LLMs are trained to predict the next token in a sequence. They learn statistical regularities across enormous amounts of text. This makes them very good at generating fluent language, answering questions whose answers appear somewhere in training data, and combining information in ways that look intelligent.

What they do not have, in Sutton's view, is genuine world models, the ability to reason about hypothetical situations that are genuinely novel, or the kind of adaptive goal-directed behavior that he thinks constitutes general intelligence.

He thinks LLMs are a very impressive form of pattern matching. He thinks general intelligence requires something more like an agent that acts in the world, receives feedback, forms goals, and updates its beliefs based on the outcomes of its actions. That is reinforcement learning. And he thinks the field's heavy investment in LLMs represents a detour from the path that actually leads to general intelligence.

The Bitter Lesson Applied to LLMs

Here is where it gets interesting. You could argue that the LLM critics, including Sutton, are making exactly the mistake the Bitter Lesson warns against. They are saying that LLMs lack the structure we think intelligence requires: reasoning, grounding, goal-directedness. And history suggests that when people say a system lacks the right structure, the system trained on more data and more compute often wins anyway.

Sutton would push back on this in an interesting way. He is not saying LLMs should encode more structure. He is saying LLMs are not the right paradigm. The right paradigm involves an agent taking actions in an environment and learning from the consequences. More compute on the wrong approach still gives you the wrong approach.

This is a live debate, and it is genuinely unresolved.

The Case for Sutton Being Right

There are things LLMs struggle with that seem to require more than token prediction. Long-horizon planning in genuinely novel environments. Consistent goal maintenance across a conversation. Updating beliefs correctly when shown contradicting evidence. Embodied reasoning about physical causality.

OpenAI's o1 and o3 models introduced chain-of-thought reasoning that improved performance on many of these tasks significantly. The improvement is real. Whether it counts as genuine reasoning or sophisticated pattern matching over reasoning-like text is one of the central philosophical disputes in AI right now.

The reinforcement learning argument is also supported by the trajectory of AlphaGo and its descendants. AlphaZero, which learned chess, Go, and shogi purely through self-play with no human game data at all, played better than any system that incorporated human knowledge. If you believe the Bitter Lesson applies there, it is at least reasonable to ask whether the same principle applies at the level of general intelligence, where the "game" is the world itself.

The Case Against Sutton

The most obvious counterargument is the empirical one. LLMs keep doing things that critics said they could not do. Early critics said they could not reason. Models trained with RLHF and chain-of-thought prompting do something that looks a lot like reasoning. Critics said they could not write code. GitHub Copilot and Claude are writing production code. Critics said they could not understand context. Long-context models handle hundred-thousand-token documents.

There is also a version of the argument that Sutton's framing presents a false dichotomy. Reinforcement learning from human feedback is already part of how the major LLMs are trained. Systems like the OpenAI o-series models use reinforcement learning explicitly to improve reasoning. The boundaries between "LLM" and "RL agent" are blurring rather than sharpening.

The most generous read of the current moment is that both camps are partly right. LLMs are not sufficient on their own for general intelligence. RL-trained agents operating purely in abstract environments have their own limitations. The thing that eventually works will probably combine both, and several labs are already moving in that direction.

Why This Debate Matters

The reason to care about the Sutton argument is not that it tells you definitively who is right. It is that it sharpens your thinking about what current AI systems can and cannot do.

If you are building a product on top of an LLM, understanding the limitations that Sutton identifies helps you design better systems. LLMs are unreliable for tasks that require persistent memory, for tasks that require genuine novelty without any training analogues, and for tasks where being wrong by a small amount leads to catastrophic outcomes.

If you are thinking about the path to AGI, the debate forces you to ask a question that does not have a clean answer yet. Is the path to general intelligence scaling LLMs further, adding RL on top of them, or something architecturally different that we have not fully built yet?

Sutton has been right before. His 2019 Bitter Lesson turned out to be prescient about the rise of scale-based approaches over hand-engineered ones. Whether he is right again about LLMs, or whether the people scaling LLMs with RL are already converging on what he is asking for, is the most important open question in AI research right now.

The honest answer is that nobody knows. And anyone who tells you they do is more confident than the evidence warrants.

Free Weekly Briefing

The AI & Dev Briefing

One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.

No spam. Unsubscribe anytime.