Each of these presents the clearest, earliest, or most-cited statement of a specific idea in AI. Where those three things conflict I have chosen whichever quote I liked the most. They are mostly or all big-picture concerns, and should hopefully seem at least interesting even if not correct in a timeless way.
The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths. Its province is to assist us in making available what we are already acquainted with.
Ada Lovelace (1843), Sketch of the Analytical Engine, Note G
It was suggested tentatively that the question, "Can machines think?" should be replaced by "Are there imaginable digital computers which would do well in the imitation game?"
[...]
The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted. I believe further that no useful purpose is served by concealing these beliefs. The popular view that scientists proceed inexorably from well-established fact to well-established fact, never being influenced by any improved conjecture, is quite mistaken. Provided it is made clear which are proved facts and which are conjectures, no harm can result. Conjectures are of great importance since they suggest useful lines of research.
Alan Turing (1950), Computing Machinery and Intelligence
The chess machine is an ideal one to start with, since: (1) the problem is sharply defined both in allowed operations (the moves) and in the ultimate goal (checkmate); (2) it is neither so simple as to be trivial nor too difficult for satisfactory solution; (3) chess is generally considered to require "thinking" for skilful play; a solution of this problem will force us either to admit the possibility of a mechanized thinking or to further restrict our concept of "thinking"; (4) the discrete structure of chess fits well into the digital nature of modern computers.
Claude Shannon (1949), Programming a Computer for Playing Chess
We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.
John McCarthy et al, (1955) Dartmouth Workshop Proposal
Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.
Arthur Samuel (1959), Some Studies in Machine Learning Using the Game of Checkers
If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it, because the action is so fast and irrevocable that we have not the data to intervene before the action is complete, then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it
Norbert Wiener (1960), Some Moral and Technical Consequences of Automation
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultra-intelligent machine could design even better machines; there would then unquestionably be an "intelligence explosion," and the intelligence of man would be left far behind (see for example refs. [22], [34], [44]). Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control. It is curious that this point is made so seldom outside of science fiction. It is sometimes worthwhile to take science fiction seriously.
I.J. Good (1965), Speculations Concerning the First Ultraintelligent Machine
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
Charles Goodhart, (1975), Problems of monetary management: the U.K. experience "Goodhart's Law"
Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it. The deliberate process we call reasoning is, I believe, the thinnest veneer of human thought, effective only because it is supported by this much older and much more powerful, though usually unconscious, sensorimotor knowledge. We are all prodigious olympians in perceptual and motor areas, so good that we make the difficult look easy. Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet mastered it. It is not all that intrinsically difficult; it just seems so when we do it.
Hans Moravec (1988), Mind Children: The Future of Robot and Human Intelligence
Commonly known as "Moravec's Paradox", often paraphrased as “the hard things are easy and the easy things are hard”.
I think it's fair to call this event a singularity ("the
Singularity" for the purposes of this paper). It is a point where our
models must be discarded and a new reality rules. As we move closer
and closer to this point, it will loom vaster and vaster over human
affairs till the notion becomes a commonplace. Yet when it finally
happens it may still be a great surprise and a greater unknown. In
the 1950s there were very few who saw it: Stan Ulam [27] paraphrased
John von Neumann as saying:One conversation centered on the ever accelerating progress of
technology and changes in the mode of human life, which gives the
appearance of approaching some essential singularity in the
history of the race beyond which human affairs, as we know them,
could not continue.
Vernor Vinge (1994), The Coming Singularity
This compression contest is motivated by the fact that being able to compress well is closely related to acting intelligently. In order to compress data, one has to find regularities in them, which is intrinsically difficult (many researchers live from analyzing data and finding compact models). So compressors beating the current "dumb" compressors need to be smart(er). Since the prize wants to stimulate developing "universally" smart compressors, we need a "universal" corpus of data. Arguably the online lexicon Wikipedia is a good snapshot of the Human World Knowledge. So the ultimate compressor of it should "understand" all human knowledge, i.e. be really smart. enwik8 is a hopefully representative 100MB extract from Wikipedia.
Marcus Hutter (2006), The Hutter Prize
The Orthogonality Thesis
Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.The Instrumental Convergence Thesis
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.
Nick Bostrom, (2012), The Superintelligent Will
Both commonly referred to by name, with or without the term 'thesis'.
One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
Rich Sutton, (2019), The Bitter Lesson
The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, which like the brain can be applied fairly uniformly (eg. “The Brain as a Universal Learning Machine” or Hawkins), we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data. More powerful NNs are ‘just’ scaled-up weak NNs, in much the same way that human brains look much like scaled-up primate brains.
Gwern Branwen, (2020), The Scaling Hypothesis
Here we will explore emergence with respect to model scale, as measured by training compute and number of model parameters. Specifically, we define emergent abilities of large language models as abilities that are not present in smaller-scale models but are present in large-scale models; thus they cannot be predicted by simply extrapolating the performance improvements on smaller-scale models (§2).
Wei et al, (2022), Emergent Abilities of Large Language Models
What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.
This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.
Then, when you refer to “Lambda”, “ChatGPT”, “Bard”, or “Claude” then, it’s not the model weights that you are referring to. It’s the dataset.
James Betker (2023), The “it” in AI models is the dataset.
Ilya: I challenge the claim that next token prediction cannot surpass human performance. It looks like on the surface it cannot—it looks on the surface if you just learn to imitate, to predict what people do, it means that you can only copy people. But here is a counter-argument for why that might not be quite so:
If your neural net is smart enough, you just ask it like, "What would a person with great insight and wisdom and capability do?" Maybe such a person doesn't exist, but there's a pretty good chance that the neural net will be able to extrapolate how such a person should behave.
Do you see what I mean?
Dwarkesh: Yes, although where would we get the sort of insight about what that person would do, if not from the data of regular people?
Ilya: Because if you think about it, what does it mean to predict the next token well enough? What does it mean actually? It's actually a much deeper question than it seems.
Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It's not statistics—like, it is statistics, but what is statistics?
In order to understand those statistics, to compress them, you need to understand what is it about the world that creates those statistics. And so then you say, "Okay, well I have all those people. What is it about people that creates their behaviors?"
Well, they have thoughts and they have feelings and they have ideas, and they do things in certain ways. All of those could be deduced from next token prediction.
And I'd argue that this should make it possible—not indefinitely, but to a pretty decent degree—to say, "Well, can you guess what you would do if you took a person with this characteristic and that characteristic?"
Like, such a person doesn't exist, but because you're so good at predicting the next token, you should still be able to guess what that person would do—this hypothetical, imaginary person with far greater mental ability than the rest of us.
Ilya Sustekever (2023), The Dwarkesh Podcast - Why next-token prediction is enough for AGI