This is a linkpost for https://www.lesswrong.com/posts/zyPaqXgFzqHkQfccq/contra-lecun-on-autoregressive-llms-are-doomed.
I answer LeCun’s arguments against LLMs as exposed in this lesswrong comment. I haven’t searched thoroughly or double-checked in detail LeCun’s writings on the topic. My argument is suggestive-hand-waving-stage.
Introduction
Current large language models (LLMs) like GPT-x are autoregressive. “Autoregressive” means that the core of the system is a function \(f\) (implemented by a neural network) that takes as input the last \(n\) tokens (the tokens are the atomic “syllables” used by the machine) and produces as output a probability distribution over the next token in the text. This distribution can be used to guess a good continuation of the text.
The function \(f\) is fixed, it’s always the exact same function at each step. Example: given the current token list \[x=[\text{John}, \text{<space>}, \text{ate}, \text{<space>}, \text{a},\text{<space>}, \text{ban}],\]
we evaluate \(f(x)\), which gives high probability to, say, the token “ana”. We concatenate this token to the list, obtaining \[x’=[\text{John}, \text{<space>}, \text{ate}, \text{<space>}, \text{a},\text{<space>}, \text{ban}, \text{ana}].\]
Applying again the function, we evaluate \(f(x’)\), which this time turns out to assign higher probability to “<space>”. However, the computation that first lead to “ana”, then to “<space>”, is the same. There is not some hidden “mind state” of the thing, with slots for concepts like “John”, “the banana”, et cetera; all there is, is the function mapping the last up-to-\(n\) past tokens to the probabilities for the next token. If there are “concepts”, or “schemes”, or “strategies”, they somehow appear strictly within the computation of the function, and are cleared at each step.
LeCun is a very famous Machine Learning researcher. In these slides and this tweet, he explains why he thinks that (quoting) “Auto-Regressive LLMs are doomed.”
The argument against autoregressive LLMs
I report verbatim Razied’s summary of the argument, plus a pair of follow-up comments I picked:
I will try to explain Yann Lecun’s argument against auto-regressive LLMs, which I agree with. The main crux of it is that being extremely superhuman at predicting the next token from the distribution of internet text does not imply the ability to generate sequences of arbitrary length from that distribution.
GPT4’s ability to impressively predict the next token depends very crucially on the tokens in its context window actually belonging to the distribution of internet text written by humans. When you run GPT in sampling mode, every token you sample from it takes it ever so slightly outside the distribution it was trained on. At each new generated token it still assumes that the past 999 tokens were written by humans, but since its actual input was generated partly by itself, as the length of the sequence you wish to predict increases, you take GPT further and further outside of the distribution it knows.
The most salient example of this is when you try to make chatGPT play chess and write chess analysis. At some point, it will make a mistake and write something like “the queen was captured” when in fact the queen was not captured. This is not the kind of mistake that chess books make, so it truly takes it out of distribution. What ends up happening is that GPT conditions its future output on its mistake being correct, which takes it even further outside the distribution of human text, until this diverges into nonsensical moves.
As GPT becomes better, the length of the sequences it can convincingly generate increases, but the probability of a sequence being correct is (1-e)^n, cutting the error rate in half (a truly outstanding feat) merely doubles the length of its coherent sequences.
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. You’d need to take all physics books ever written, intersperse them with LLM continuations, then have humans write the corrections to the continuations, like “oh, actually we made a mistake in the last paragraph, here is the correct way to relate pressure to temperature in this problem…”. This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
The conclusion that Lecun comes to: auto-regressive LLMs are doomed.
[…]
This problem is not coming from the autoregressive part, if the dataset GPT was trained on contained a lot of examples of GPT making mistakes and then being corrected, it would be able to stay coherent for a long time (once it starts to make small deviations, it would immediately correct them because those small deviations were in the dataset, making it stable). This doesn’t apply to humans because humans don’t produce their actions by trying to copy some other agent, they learn their policy through interaction with the environment. So it’s not that a system in general is unable to stay coherent for long, but only those systems trained by pure imitation that aren’t able to do so.
[…]
This same problem exists in the behaviour cloning literature, if you have an expert agent behaving under some policy \(\pi_{expert}\), and you want to train some other policy to copy the expert, samples from the expert policy are not enough, you need to have a lot of data that shows your agent how to behave when it gets out of distribution, this was the point of the DAGGER paper, and in practice the data that shows the agent how to get back into distribution is significantly larger than the pure expert dataset. There are very many ways that GPT might go out of distribution, and just showing it how to come back for a small fraction of examples won’t be enough.
I’ll try to explain this in a more intuitive way:
Imagine you know zero chess, and set to learn it from the renowned master Gino Scacchi. You dutifully observe Mr. Scacchi playing to its best ability in well-balanced matches against other champions. You go through many matches, take notes, scrutinize them, contemplate them, turn your brain to mush reflecting on the master’s moves. Yet, despite all the effort, you don’t come out a good chess player.
Unscathed, you try playing yourself against the master. You lose again, again, and again. Gino silently makes his moves and swiftly corners you each time. In a while, you manage not to lose right away, but your defeat still comes pretty quickly, and your progress in defeat-time is biblically slow. It seems like you would need to play an incredibly large number of matches to get to a decent level.
Finally, you dare ask the master for advice. He explains to you opening schemes, strategies, tactics, and gives his comment on positions. He lets you go repeatedly from the same starting positions to learn to crack them. He pitches you against other apprentices at your level. You feel you are making steady progress and getting the hang of the game.
This little story illustrates a three-runged view of learning:
- Imitative learning: you only get to passively observe the expert doing his own thing in his environment. It is very difficult, because if the expert is laying many carefully chosen consecutive steps to reach his goal, and the environment is rich, the number of possible sequences of actions explodes combinatorially. Each time, the expert does something apparently new and different. You would need to observe the expert in an incredibly large number of situations and memorize all the possible paths he takes before grasping his method.
- Autonomous learning: you can interact with the environment and the expert, doing his own thing, and you are given a feedback for the end result of your actions. This allows you to check how you are doing, which is an advantage. However, if to get good rewards you need to nail many things right one after the other, it will still take a large number of trials before you start getting the scheme of the thing.
- Guided learning: the expert is a teacher. He submits you through short subsequences of actions, with immediate feedback, that are specifically optimized to have you learn schemes that, when combined, will constitute a good algorithm to pick your actions. The teaching is a process optimized to copy into your mind the expert’s algorithm.
GPTs are trained on paradigm (1): they are run through swathes of real-world internet text, written by humans doing their own thing, trying to have the function \(f\) predict the next bit of text given the last \(n\) bits. After that, you have your language model. You hope that all the associations it glimpsed in that pile of words are sufficient to reproduce the scheme of humans’ thoughts. But the number of possible sequences of words that make sense is immense, even compared to the large training datasets of these models. And furthermore, your model is only looking at associations within sequences of length \(n\). So it should not really have been observing enough human text (the expert actions) to get well the human mental schemes (the expert algorithm).
Here comes the “probability of error” argument: given that it can not have learned the underlying pattern, at each token it generates there’s some average probability that its superficial associations make a mistake, in the sense of something a human would not do. Once the mistake is made, it reenters the function in predicting the successive token. So now the superficial associations, tuned to fit on real human text, are applied to this unrealistic thing. Since the space of text is much larger that the space of text that makes sense, and the space of associations that snap to sense-making text is much larger than the space of such associations that also pull towards sense-making text from any direction in text-space, the next token will probably be further out of distribution. And so on. If doing something right requires doing it right at each step, and the probability of error is \(e\), then after \(m\) steps the probability of being right is \((1-e)^m\). Even if \(e\) was small, this is an exponential curve w.r.t. \(m\), and so goes to zero quickly above some point.
LeCun’s slides lay down a plan of type 2 (autonomous learning) to solve this. Razied’s makes the point that
if the dataset GPT was trained on contained a lot of examples of GPT making mistakes and then being corrected, it would be able to stay coherent for a long time (once it starts to make small deviations, it would immediately correct them because those small deviations were in the dataset, making it stable) […] you would need a very large dataset of mistakes made by LLMs, and their true continuations […] This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
which in some sense would be a type 3 (guided learning) strategy stuck by brute force into a type 1 (passive learning) situation.
The counter-argument
I think there is some truth to this criticism. However, I do not think the reasoning behind it applies in full generality. I don’t feel confident it describes the real-life situation with LLMs.
Consider Razied’s hypothetical solution with the immense dataset of possible mistakes. The question is: how large would such dataset need to be? Note that he already seems to assume that it would be way smaller that all the possible mistakes: he says “small deviations”. Learning associations that correct small deviations around sense-making text is sufficient to fix the autoregressive process, since mistakes can be killed as they sprout. This looks like a way to short-circuit type 1 to type 3. Yet the space of all such small deviations and related associations still intuitively looks dauntingly immense, in an absolute sense, compared to the space of just sense-making text and associations, and compared to the space of stuff you can observe in a finite sequence of text produced by humans. Is there a shorter shortcut that implements type 3 within type 1?
More in general, yes. As an extreme case, imagine an expert operating in some rich domain, whose actions entailed building a Turing machine that implemented the expert itself and running it. An agent faithfully imitating the expert would get a functional expert behavior after a single learning session. To bend the chess allegory, if you were in some alien conceptual world where chess was Turing-complete and chess grandmasters were short Turing machines if written in chess, you might be able to become a chess grandmaster just by observing a grandmaster’s actual play. This weird scenario violates the assumption of the “probability of error” argument, that the expert mind could probably not be inferred from its actions.
This argument morphs to LLMs in the following way: human language is rich. It is flexible, for you can express in principle any thought in it. It is recursive, for you can talk about the world, yourself within the world, and the language within yourself within the world, and so on. Intuitively, it can be that language contains the schemes of human thought, not just as that abstract thing which produced the stream of language, but within the language itself, even though we did not lay down explicitly the algorithm of a human in words. If imitation training can find associations that somehow tap into this recursiveness, it could be that optimizing the imitation of a relatively short amount of human text was sufficient to crack humans.
This is speculative. Does it really apply in practice? What could be a specific, concrete example of useful cross-level associations appearing in a text and being accessible by GPT? What could be a mathematical formalization?