Very Sane AI Newsletter

Counting Arguments and AI

SE Gyges — Sat, 11 Apr 2026 04:41:05 GMT

A “counting argument” is a style of argument common among creationists, who argue that the theory of evolution cannot be true and therefore humans (and usually animals too) were made in basically their present form by God. These arguments run like so:

1. Count the number of possible states of something in biology, like the amino acids in a protein, the nucleotides in DNA, possible body shapes, etc.

2. Argue that the fraction of those states that function at all is vanishingly small.

3. Conclude that it is basically impossible to find these states at random.

We are, of course, digging into these because the same sort of argument is sometimes made about computers and AI. They are used to argue or “prove” that various things in AI don’t or can’t ever work that do work or could. We’ll start with the example from biology, where the error is best-studied, and work our way through computer science to AI.

Creationist Errors of Interest

There are a few flavors of the counting argument. The most memorable one is that a protein assembling itself is as unlikely as a tornado assembling a Boeing 747, and many of them like to assign specific and very large numbers like one in 10¹⁵⁰ or in 10⁷⁷ to say just how unlikely something in biology is.

These are all wrong in basically the same ways, but some of the ways they’re wrong are of general interest, so we’ll spell those out.

Evolution Isn’t Random

Mutation is random. Mutation powers evolution. Evolution is not random.

At every generation, you get some set of mutations, which are random. In information theory terms, mutations add noise, like static does. Bad ones make you less likely to reproduce, and the really bad ones never make it to a second generation because they kill you. Conversely, genes that tend to make you less likely to die young and more likely to reproduce tend to stick around for many generations. Every time these random mutations succeed or fail to propagate, noise is removed and information is added.

This is normally stated as something like “evolution by natural selection tends to increase fitness over time”. As a point of interest, “fitness” is a moving target, since it’s always “that which tends to propagate”, and what will tend to propagate tends to change over time. It is a complicated thing, but it isn’t random.

Creationist counting arguments calculate the probability of all of that information showing up at once, and don’t count any of the incremental energy or work that is used to get there. It is a lot like saying that it is impossible for people to live a whole mile away, because nobody’s got legs a mile long to step there. That is not how it works, and it’s a pretty basic misunderstanding.

Fitness Landscapes Have Structure

It turns out, most things that work are similar to other things that work.

Our counting argument requires us to imagine that each and every single part of any organism is completely unlike any other working part, either present or past. We have to imagine that every single protein or body part is unrelated to every other one, and this just isn’t true. For some basic examples, hands and feet have basically the same bones organized a little differently, and the proteins for green and red color perception are about 96% similar. Reusing and modifying existing parts, empirically, works quite well.

We like to call the space of possible mutations a fitness landscape, where “higher” points are more fit. You can imagine any given species meandering uphill on this landscape over time. The landscape itself is always shifting a bit and some of this movement is random, so it’s sort of a drunk Sisyphus situation, but in general it tends to have a specific direction, and it can and does go uphill only where this landscape is smooth and not where it would require jumping up a cliff.

All of this changes the math quite a bit. It’s very improbable to make a green opsin protein from scratch because it’s rather long, but it’s actually pretty easy to make it from a red opsin protein. The similarity between parts that work is what gives the fitness landscape its structure, making it smooth enough that it can feasibly be climbed (albeit, drunkenly). It does not matter how large the fitness landscape is, only that the landscape is smooth enough that it is possible to move uphill in it.

The No Free Lunch Theorems

We can move on to computers now, and mercifully we can be much briefer. Instead of explaining the errors in detail we will only point out where they’re the same.

The No Free Lunch theorems seem to tell you that optimization algorithms cannot work. This is very surprising, because optimizers do work, all the time. Their authors state this as “any two algorithms are equivalent when their performance is averaged across all possible problems”.1 They are technically correct. However, for this to matter at all, any given problem you wanted to solve would have to be pulled at random from “all possible problems”, and problems people want to solve would have to be completely dissimilar from each other.

Fortunately this is not true. Problems that humans are trying to solve are generally not completely random, and most problems are somewhat similar to each other. This gives the landscape of problems to be solved structure. For this reason computer optimization works pretty well, and techniques for optimization that work on one problem often work on other problems also.

The Same Thing, But AI

In “Reclaiming AI as a Theoretical Tool for Cognitive Science” (2024), van Rooij and coauthors claim to have “proved that achieving human-like intelligence using learning from data is intractable”.2 Their argument is also basically a counting argument.

Their core argument is that even a fifteen minute conversation has about 10²⁷⁰ possible “situations”. Therefore no machine learning system can approximate a conversation at all better than chance because that is too many possible things.

I hope you’ll get the joke by now. Behaviors aren’t random. Behaviors are similar to each other so there’s structure to the optimization landscape, and almost all of the approximately 10²⁷⁰ possible sentences or behaviors in the conversation are complete nonsense that it is extremely unlikely any algorithm would ever output. Therefore there exist algorithms that do not take longer than the life of the universe for choosing which sentence to say.

Much like the No Free Lunch Theorems, their argument would seem to disprove many algorithms which certainly do work better than chance, like the autocorrect on a cell phone. There are a lot of possible options for autocorrect and it would be intractable to actually check them all each time someone pressed a key, which is why that’s not how it works.

Inevitable AI Doom

If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All

This is the actual point of the essay, and we will go over it in some detail.

All of this comes from definitions and interpretations of “The Orthogonality Thesis”. In its basic form, the Orthogonality Thesis is basically inoffensive and seems roughly correct:3

Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.

This isn’t strictly true, because how smart something is and what it wants are at least a little bit related in some cases. They are not, however, necessarily or always related, and this relationship sometimes breaks down, so as a precautionary principle this is fine. Something can be very smart and it can want pretty much anything. This is also true of smart people, who sometimes want nonsensical or weird things. Wanting anything at all is sort of nonsensical, and what things specifically you want are to some degree arbitrary. In humans this is somewhat limited, because many wants are very human and some others are not very human at all, but it’s not extremely limited. People alone want many different things, often vastly different from each other.

Where this starts to go wrong is here in 2008,4 which begins the line of arguments to the effect that the Orthogonality Thesis means that AI will almost certainly kill us all. This actually starts before Bostrom coins “The Orthogonality Thesis” as a term, but it’s the same argument.

In keeping with the theme, I hope it is basically obvious that “minds in general” do not, for any practical purpose, exist. You can only try to find minds-in-general by random sampling, and most of what you create at random won’t be a mind at all. Almost all possible minds are so vastly improbable that they have negligible chance of ever existing.

The general region of “posthuman mindspace”, as in, minds that humans have any reasonable chance of creating directly or indirectly, occupies something like half the area on the diagram. But this is actually much smaller; compared to all possible minds, both human minds and posthuman minds are extremely small sets, and only distinguished by the fact that they already exist or have some reasonable chance of existing.

We could take this as a diagram being a little imprecise, but the essay that contains it explicitly tells us to consider seriously the space of all possible minds:

If we focus on the bounded subspace of mind design space that contains all those minds whose makeup can be specified in a trillion bits or less, then every universal generalization that you make has two to the trillionth power chances to be falsified.
Conversely, every existential generalization—“there exists at least one mind such that X”—has two to the trillionth power chances to be true.

And later:

Somewhere in mind design space is at least one mind with almost any kind of logically consistent property you care to imagine.

Well, we certainly have a lot of emphasis on the size of the space, but we haven’t explicitly asserted that we have to worry about drawing randomly from it. This very large space of all possible minds might only be meant to establish that such a thing is possible in principle. I have also used this style of proof. Sometimes you can even prove that something exists in principle but it is impossible or intractable to calculate it, and this can be a clever little bit of mathematics.

Orthogonality thesis. Mind design space is huge enough to contain agents with almost any set of preferences, and such agents can be instrumentally rational about achieving those preferences, and have great computational power. For example, mind design space theoretically contains powerful, instrumentally rational agents which act as expected paperclip maximizers and always consequentialistically choose the option which leads to the greatest number of expected paperclips. See: Bostrom (2012); Armstrong (2013).
[…]
A superintelligence with a randomly generated utility function would not do anything we see as worthwhile with the galaxy, because it is unlikely to accidentally hit on final preferences for having a diverse civilization of sentient beings leading interesting lives.5

This is unfortunately pretty conclusive. We have here the paperclipper, which is now a quaint and retro meme about AI killing everyone, and we are explicitly told to be afraid of AI research because the space is vast and we are worried that we may be drawing from it at random, or effectively at random.

This is a counting argument. This is the same counting argument, but modified in sort of a clever way. The creationist argument is that you can never find a protein that works, because there are too many proteins that do not work. This argument is that you can never find an AI that does not kill everyone, because there are too many AI that do kill everyone. The assumptions are that the space is very large, and we are (or might be!) drawing from it at random. This is much more upsetting than the normal kind of counting argument, which tells you that God exists or that optimizers or autocomplete don’t work, but it is logically the same argument. It is also wrong for the same reasons.

I am not the first person to notice this, and there is already a detailed and very good write-up 6 of how badly wrong this sort of argument is, both mathematically and empirically, when applied to the gradient descent optimizer. What I hope to add here is that this counting intuition is the core of MIRI’s argument and position on AI and always has been.

To spell this out explicitly: It seems sort of obviously true that you can create things that have very different goals from a human, or goals hostile to humans. Crabs have very different goals from humans. Humans can go insane in many amazing ways, and will often adopt goals, if you can call them goals, that are very far from human norm. Something that is not human can easily be at least as different from us as we are from crabs or the insane, and likely much more. I am not saying that AI is inherently safe or not weird.

The point I’m trying to make is that the intuition around the inevitability of AI doom, the argument leading to the thesis that “If Anyone Builds It, Everyone Dies”, the thing that leads you to believe there’s a 99% chance of everyone dying and to preach that nuclear war is a better outcome than people studying AI,7 is fundamentally a counting argument based on bad intuition about large spaces and optimization.

You could argue that this intuition is not core to the appeal of the argument, but I think there is no good reason to believe this. This is the core argument, it has been made consistently in these exact words for over a decade, and the appeal is specifically that the space is so large that it contains many dangerous things and drawing from it is inherently very dangerous.

We also see those leaning heavily on the orthogonality thesis say things that are, taken literally, completely nonsensical except if their reasoning is actually this sort of counting argument.

Size of mind design space
The space of possible minds is enormous, and all human beings occupy a relatively tiny volume of it - we all have a cerebral cortex, cerebellum, thalamus, and so on. The sense that AIs are a particular kind of alien mind that ‘will’ want some particular things is an undermined intuition. “AI” really refers to the entire design space of possibilities outside the human. Somewhere in that vast space are possible minds with almost any kind of goal. For any thought you have about why a mind in that space ought to work one way, there’s a different possible mind that works differently.
This is an exceptionally generic sort of argument that could apply equally well to any property P of a mind, but is still weighty even so: If we consider a space of minds a million bits wide, then any argument of the form “Some mind has property P” has 2^1,000,000 chances to be true and any argument of the form “No mind has property P” has 2^1,000,000 chances to be false.8

And separately:

The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.9

“Vanishingly unlikely” is a term of art in probability theory, which means that something has a probability so low that it can be considered zero. This is the case if the space for the opposite result is vastly larger. This is not a statement that makes any sense if it is about the probability of humans messing up or not understanding the consequences of what they are doing in the course of pursuing some research program, where there’s certainly some probability of humans figuring the problem out. This is a statement that makes sense entirely when comparing the size of a space that is much much larger, because you imagine that you are performing a random draw or something like it from that space.

This emphasis is pretty consistent. In 2022, Yudkowsky argues that “capabilities generalize further than alignment”, that there are “unbounded degrees of freedom” in goal-space, and similar. This is a better argument, but it’s still a counting argument. To claim “unbounded degrees of freedom” is about the size of a space, and “almost every kind of coffee” is a claim about what fraction of goal-space has a particular property. And just to remove any doubt, the document links back to the LessWrong page on orthogonality, with 2^1,000,000 on it, as a prerequisite for understanding the rest.10

It is sometimes asserted that orthogonality is meant only to establish that, in principle, an AI could go rogue. This is a motte and bailey, where Yudkowsky personally and many of his more enthusiastic readers clearly seem to believe and espouse the very strong and incorrect counting argument version of orthogonality. They state fairly unequivocally that AI is definitely going to kill everyone, and they equally clearly haven’t got any real idea about why they expect this if it isn’t the counting argument. When challenged they retreat to claiming it is only about whether dangerous AI, hypothetically, can exist, or whether considering AI dangers is important and worth doing.

Optimization Targets Aren’t Random

Drawing a mind at random is explored in the classic thought experiment we call the Boltzmann brain. What are the odds of an entire, fully-formed brain coming into existence by sheer chance? They are extremely low, but not zero. This is a funny fact and an interesting thought experiment and of no relevance to anything humans might have any chance of ever building.

In modern AI, this is equivalent to simply initializing a very large neural network and not training it at all. What are the odds that this neural network does anything useful or interesting? These odds are astronomically poor, and such neural networks output either nothing or white noise.

Optimization targets come from our specific universe, and indeed generally come from human data and human concerns, and human ideas are either directly stated or strongly implied in our optimization targets. Given especially that every AI paradigm that currently works is incredibly data-hungry, it seems like it would actually be much harder to create anything that seems reasonably intelligent without at least giving it a lot of information about humans and human values. You would have to actively exclude all of language, for starters.

It is, in a limited sense, true that some optimization targets have nothing to do with humans or human values. Pure math is one, and plausibly some forms of adversarial training objectives are completely devoid of any residue of humanity. What this implies is much weaker than the orthogonality thesis: that any training objective that does not have human values in its training data or objective will learn few if any traits resembling human or human-friendly values, and be very hard to guide with respect to human or human-friendly values. This also says nothing about the difficulty of optimizing for good human-related values, since there are many values concerning humans that are bad for humans, like negative utilitarianism. These are true enough, but are so weak that they certainly do not support inevitable “AI Doom” as a conclusion. It simply means that you could, if incautious, successfully arrive at doom, not that you are definitely heading there.

The Landscape of Reachable Minds Has Structure

What we make comes from what we already have. What do we already have? First, data, which is overwhelmingly produced by humans and contains useful information about humans. Second, existing AI systems, where anything we make is in general rather similar to its predecessors. We are not drawing at random, we are searching from where we are.

So long as we are taking relatively incremental steps, it is actually very hard to see how this goes suddenly wrong. It only goes very wrong if people doing research have vastly miscalculated either how far a step they have taken or how well they understand what they already have.

Minds we already have (including our own) can be studied in detail to determine what properties they have, and which of those we think are good or bad. We can use this to inform what is made next and how it is used. Our direction of travel is not completely random, nor is it completely blind, and reasonable progress has been made on making systems do what we want them to do and avoiding what we do not want. For example, MIRI employees have historically said value loading or learning was a major problem here, but we have made reasonable progress on value loading. MIRI has not, apparently, noticed this.

There are softer versions of the AI Doom argument that argue doom is inevitable not because of the size of the space of all possible minds, but because of the space of all possible ways for things to go wrong, and this is also a counting argument. For example, in If Anyone Builds It, Everyone Dies it is argued that even if an AI doesn’t kill everyone, it would change them in some very hostile way, as humans did to wolves by making them into dogs. This also relies on a counting argument to claim such a mishap is inevitable: “We would not be its favorite things, among all things it could create.”11 That there are many bad possibilities is only a fundamental problem if we are sampling objectives at random, and there’s no structure we can use to find a good result.

Interestingly, a later rung of the “AI Doom” thesis acknowledges that the space of all minds has structure in the form of “Instrumental Convergence”, that is, all minds that are good at accomplishing things will inevitably converge on seeking some form of power because this enables them to accomplish more. This is interesting because it acknowledges some existing structure to the space of all possible minds, but then completely denies, ignores, or fails to consider searching for any structure that could be used to avoid unwanted outcomes.12

Another quirk in MIRI’s positioning is that their preferred program appears to be spending a few decades doing human genetic engineering instead of working on AI.13 Yudkowsky himself has gotten increasingly direct about this over the last few years, calling human genetic engineering “literally the solution to the alignment problem” on the Trajectory podcast,14 and elsewhere saying “my message to humanity is ‘back off and augment’ not ‘back off and solve it with a clever theory’”.15 This is seconded by MIRI’s president, who recommends Earth “pursue other routes to the glorious transhumanist future, such as uploading”, and on superbabies says “I doubt we have the time, but sure, go for superbabies”.16 Yudkowsky has told many people directly to back off of AI alignment work and instead pursue intelligence augmentation, adult gene therapy, or “superbabies”.17 18 19 20 21

Why would you imagine this was safer or more predictable? “Very smart genetically engineered humans” are in my opinion likely more difficult to understand or be certain of than AI is. You have white box access to the AI, can read off its internal state directly, and the only limit to how well you can understand it is that it can be very large and complex. You cannot do this to humans, either current or augmented, because brains are generally black boxes and reading individual neurons is very difficult. Running a massive eugenics program for a prolonged period of time is therefore unlikely to help with the problem, on top of the risk that it will not work at all and the many likely negative consequences of doing such a thing.

More simply: If the übermenschen align the AI, who aligns the übermenschen?

Why would you make this mistake? My impression is that this is because they understand, implicitly or explicitly, that the landscape of minds you can reach by modifying humans has structure, and therefore you could in principle reach a good outcome by modifying humans very carefully. They do not seem to understand that the space of possible AI minds also has structure.

Given that the landscape of reachable AI systems does have structure, the correct question is not about which minds exist but about which paths through mind-space are reachable and whether we have or can get enough information to choose a path correctly. Based on the information we have, it is essentially impossible to be completely certain about this, and to believe that it is vanishingly unlikely that future AI is aligned with humans requires thinking in a different and incorrect paradigm where you simply count the number of possible minds, paths, or results.

How Much Understanding Do We Need?

Many technologies are fundamentally dangerous. For obvious examples, we can consider fire and nuclear reactors. It has been possible to control fire enough for it to be used usefully and more or less safely since the stone age, and in the modern world our control of fire is so precise that we can burn gasoline to power cars fairly safely, with time between mishaps measured in thousands of miles. Nuclear power presents a more mixed record, and although it is certainly possible as a matter of pure technology to generate power from uranium safely, the social institution of “actually building and running a reactor” has failed at this catastrophically on several notable occasions. Technology can be quite dangerous without being inherently dangerous.

In neither case, however, are we required to understand the phenomenon perfectly in order to use it. Even our best modern physics is fundamentally somewhat approximate, and we cannot hope to account for the motion of every atom in even well-understood processes like burning gasoline due to the chaotic nature of chemical reactions. If we make a point of counting all possible things that the atoms could be doing, there are clearly too many possibilities for us to do this safely! Cars generally run anyway. Scientists and engineers are expected to know what they can be certain of, what they cannot be certain of, and how to push the boundary between them forward and handle it with care.

The MIRI position on what we should do about AI is to advocate an indefinite global moratorium on frontier AI development, to be lifted only when “humanity’s state of knowledge and justified confidence about its understanding of relevant phenomena has drastically changed”.22 They have never specified concrete criteria for what would constitute sufficient progress, and they have never said what “drastically changed” would mean in practice. Ultimately a few people in Berkeley don’t think we understand AI enough, they refuse to change their minds or say what would, and they think we should stop studying AI.

This makes a certain kind of sense if you actually believe the counting argument. If the problem is that you are drawing from a vast and intractable space you cannot characterize, then by definition you can never be confident enough, because the space is too large to ever handle and any local progress is just a tiny island in an ocean of things that could still go wrong. There is no amount of empirical understanding that can possibly bridge a gap that is 2^1,000,000 wide. Under those assumptions, the only honest position is exactly the one MIRI takes: stop, indefinitely, on criteria that cannot in principle be met.

This is an anti-science conclusion, and the reasoning is nonscientific. If the question is what specific optimization processes actually do when applied to specific training data, then you can study those processes, characterize their behavior, run experiments, and define success criteria for the kinds of systems we are actually building. You do not need to map out the space of all possible minds to know roughly what the next training run is going to do, any more than a civil engineer needs to enumerate every possible arrangement of steel and concrete to know whether a particular bridge will hold. You only need to understand the part of the space you are actually in, and you only need to understand it well enough to take the next step without falling off a cliff.

The creationist counting argument leads to God of the gaps: the space of possible biological configurations is too vast to search, therefore the question is permanently unanswerable by natural means and there must be a designer. The MIRI counting argument leads to doom of the gaps: the space of possible minds is too vast to guarantee safety, therefore the question is permanently unanswerable by empirical means and there must be a catastrophe. In both cases the structure of the error is the same. You count a space you will never explore, point at how huge it is, and treat that hugeness as evidence about the much smaller space you actually inhabit. This style of argument essentially always leads to serious errors.

Wolpert, D. & Macready, W. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1, no. 1 (1997): 67-82.

van Rooij, I., Guest, O., Adolfi, F., de Haan, R., Kolokolova, A. & Rich, P. “Reclaiming AI as a Theoretical Tool for Cognitive Science.” Computational Brain & Behavior 7 (2024): 616-636. The “intractable” quote is from the abstract; the 10²⁷⁰ illustration appears in Box 1.

Bostrom, N. “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents.” Minds and Machines 22, no. 2 (2012): 71-85.

Yudkowsky, E. “The Design Space of Minds-In-General.” LessWrong, June 25, 2008. https://www.lesswrong.com/posts/tnWRXkcDi5Tw9rzXw/the-design-space-of-minds-in-general. The 2^trillion counting argument and all three block quotes in this section are from this post, as are the mind design space diagram and the AIXI reference discussed in footnote 12.

Yudkowsky, E. “Five theses, two lemmas, and a couple of strategic implications.” intelligence.org, May 5, 2013. https://intelligence.org/2013/05/05/five-theses-two-lemmas-and-a-couple-of-strategic-implications/. Both the orthogonality / paperclip maximizer quote and the “randomly generated utility function” quote are from this post.

This piece written with Nora Belrose goes well with Quentin Pope’s other essays explaining bad evolutionary analogies in AI and arguing that “AI Pause Will Likely Backfire”, both of which seem to be quite correct.

Yudkowsky, E. “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down.” TIME, March 29, 2023, https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/. Yudkowsky calls for an “indefinite and worldwide” moratorium on large training runs and writes: “Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs.”

Yudkowsky, E. “Orthogonality Thesis.” LessWrong tag page, https://www.lesswrong.com/w/orthogonality-thesis (archived: https://web.archive.org/web/20260322102316/https://www.lesswrong.com/w/orthogonality-thesis, 2025). Both the “size of mind design space” and 2^1,000,000 quotes are from this page.

Yudkowsky, E. & Soares, N. If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All. Little, Brown and Company, 2025.

Yudkowsky, E. “AGI Ruin: A List of Lethalities.” LessWrong, June 6, 2022. https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities. The “unbounded degrees of freedom” language appears in Point 21; “almost every kind of coffee” in Point 23; the link to the LessWrong Orthogonality Thesis page is in Point -3.

Yudkowsky, E. & Soares, N. If Anyone Builds It, Everyone Dies, “We Wouldn’t Make the Best Pets.” Both the dogs-and-wolves analogy and the “favorite things” quote are from this passage, where humans are described as unlikely to be “the best version of whatever the AI wants.”

There is an irony worth noting here. Yudkowsky’s 2008 essay invokes AIXI — Marcus Hutter’s incomputable mathematical idealization of a perfect reasoner — to bolster the intuition that the space of minds is vast and alien. But AIXI’s own formalism contains the rebuttal. AIXI reasons over all computable environments using the Solomonoff prior, which weights hypotheses by complexity: a program of length n gets prior weight 2^-n, so simple hypotheses dominate exponentially. Under this prior, Hutter’s own collaborators (Lattimore & Hutter, “No Free Lunch versus Occam’s Razor in Supervised Learning”, 2011/2013; Everitt, Lattimore & Hutter, “Free Lunch for Optimisation under the Universal Distribution”, IEEE CEC 2014) proved that the No Free Lunch theorems do not hold — structured priors break the symmetry that counting arguments require. The space of all computable environments is infinite, but almost all of the probability mass concentrates in the simple corner. The formalism that Yudkowsky cites to make the space of minds feel terrifying is the same formalism that shows you don’t actually need to search all of it.

MIRI. “2024 Mission and Strategy Update.” intelligence.org, January 2024, https://intelligence.org/2024/01/04/miri-2024-mission-and-strategy-update/. The document acknowledges genetic engineering as MIRI’s preferred biological alternative track: “Human intelligence augmentation is feasible over a scale of decades to generations, given iterated polygenic embryo selection. I don’t see any feasible way that gene editing or ‘mind uploading’ could work within the next few decades.”

Yudkowsky on The Trajectory podcast with Daniel Faggella, “Human Augmentation as a Safer AGI Pathway” (AGI Governance, Episode 6).

Quoted and summarized in a LessWrong writeup: https://www.lesswrong.com/posts/bSHCZ6dbAdfMbvuXB/yudkowsky-on-the-trajectory-podcast. Full quote: “If we have time, human genetic engineering literally is the solution to the alignment problem. We are maybe 5-8 years out from being able to...”

Yudkowsky, E. Comment on Vaniver’s “Critical review of Christiano’s disagreements with Yudkowsky,” LessWrong, December 27, 2023, https://www.lesswrong.com/posts/8HYJwQepynHsRKr6j/critical-review-of-christiano-s-disagreements-with-yudkowsky?commentId=9pKofQAchdgCH8jjm: “humanity needs to back off and augment intelligence before proceeding... My message to humanity is ‘back off and augment’ not ‘back off and solve it with a clever theory’.”

Soares, N. “On how various plans miss the hard bits of the alignment challenge.” LessWrong, July 12, 2022, https://www.lesswrong.com/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment-challenge. On the superbabies plan: “I doubt we have the time, but sure, go for superbabies. It’s as dignified as any of the other attempts to walk around this hard problem.” On the alternative: “I basically recommend that Earth pursue other routes to the glorious transhumanist future, such as uploading.”

@watashiwacringe @VoidAtoms I now also believe this to be mainly futile. If you can direct people, you should be directing them to work on human intelligence augmentation.","username":"ESYudkowsky","name":"Eliezer Yudkowsky ⏹️","profile_image_url":"https://pbs.substack.com/profile_images/1934759522050166788/xKpgxWW5_normal.jpg","date":"2025-08-24T15:53:20.000Z","photos":[],"quoted_tweet":{},"reply_count":2,"retweet_count":1,"like_count":18,"impression_count":807,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

@juddrosenblatt I don't think time alone fixes it. I think you need human intelligence augmentation. If you enforced a pause that gave us 100 years of argumentation from merely current minds and institutions, I think it converges to a wrong answer.","username":"ESYudkowsky","name":"Eliezer Yudkowsky ⏹️","profile_image_url":"https://pbs.substack.com/profile_images/1934759522050166788/xKpgxWW5_normal.jpg","date":"2025-08-06T17:27:26.000Z","photos":[],"quoted_tweet":{},"reply_count":2,"retweet_count":0,"like_count":17,"impression_count":496,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

@alcherblack @zipppzy1 I worry we don't get enough time to do genetic engineering, and would prefer to go hard on adult gene therapy.","username":"ESYudkowsky","name":"Eliezer Yudkowsky ⏹️","profile_image_url":"https://pbs.substack.com/profile_images/1934759522050166788/xKpgxWW5_normal.jpg","date":"2025-07-12T20:41:19.000Z","photos":[],"quoted_tweet":{},"reply_count":0,"retweet_count":0,"like_count":15,"impression_count":218,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

@eshear I mean, the thing that *I* and other sensible people want from them is superbabies or better yet adult gene therapies.","username":"ESYudkowsky","name":"Eliezer Yudkowsky ⏹️","profile_image_url":"https://pbs.substack.com/profile_images/1934759522050166788/xKpgxWW5_normal.jpg","date":"2025-06-26T17:12:46.000Z","photos":[],"quoted_tweet":{},"reply_count":1,"retweet_count":0,"like_count":55,"impression_count":3384,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

MIRI. “2024 Mission and Strategy Update.” intelligence.org, January 2024, https://intelligence.org/2024/01/04/miri-2024-mission-and-strategy-update/. The “drastically changed” quote describes MIRI’s first strategic objective; the underlying “with actual teeth” framing originates from Yudkowsky’s March 2023 op-ed in TIME.

Against the Luddites

SE Gyges — Sun, 29 Mar 2026 16:41:57 GMT

Luddism does not deserve to be rehabilitated. It was a medieval throwback, reactionary and primitive, a pre-Marxist labor convulsion closer in spirit to the Khmer Rouge’s fantasies of agrarian restoration than to the universalist solidarity of Eugene Debs. The contemporary effort to recast the Luddites as thoughtful critics of technology gives modern anxieties about AI a historical pedigree they do not deserve. The movement was a violent defense of guild privilege, male supremacy, and craft hierarchy against the leveling forces of industrial modernity. They could have fought for equality and justice; they chose instead to fight to remain the petty bosses of their own towns rather than cede that authority to the owners of factories.

The rehabilitation of Luddism is a vice signal. Anyone genuinely concerned with workers’ dignity has Marx, Debs, and Martin Luther King to hand, organizers who championed equality across lines of skill, race, and gender. The choice to reach past all of them for a movement of guild enforcers who beat women in the streets is made because of the violence and reaction, not despite it.

An Elite Movement

The Luddites did not represent the working class. Cambridge historian Richard Jones examined oral testimonies, trial documents, Parliamentary papers, and Home Office reports. He concluded that Luddism was “far from a genuinely pan-working class movement.” The Luddites were “a relatively ‘elite’ group, whose role had traditionally been protected by legislation regulating the supply and conduct of labour.” In an industry employing a million people, the movement never exceeded a couple of thousand. Jones put it bluntly: “these were not downtrodden working class labourers. The Luddites were elite craftspeople.”1

The Yorkshire croppers, the vanguard of Luddism in the West Riding, had to complete seven-year apprenticeships before they could practice their trade. After seven years, Jones notes, “they tended to feel that they were owed a living.” For the genuinely unskilled and dispossessed, displacement by machines was already old news; they had little reason to join a movement defending privileges they never possessed.

The Nottinghamshire framework-knitters ran the same racket. Among their core grievances was the employment of “colts,” workers who had not served the seven-year apprenticeship. They objected to “unapprenticed youths” and to the new wide frames, which produced cheaper goods that anyone could operate.2 Every Luddite complaint presupposed a closed guild system in which access to the trade was rationed by the incumbents.

The Exclusion of Women

The machines the Luddites smashed did something that, by any measure of human equality, should have been celebrated. As Daron Acemoglu has documented, they “replaced the scarce and expensive factors, the skilled artisans, by relatively cheap and abundant factors, unskilled manual labor of men, women, and children.”3 The Luddites treated this as an outrage.

Violence against women was endemic to the skilled textile trades, of which Luddism was the most dramatic expression. A petition from Glasgow cotton manufacturers, preserved in parliamentary records, states:

“In almost every department of the cotton spinning business, the labour of women would be equally efficient with that of men; yet in several of these departments, such measures of violence have been adopted by the combination, that the women who are willing to be employed, and who are anxious by being employed to earn the bread of their families, have been driven from their situations by violence.”4

When the firm of James Dunlop and Sons built spinning machines small enough to be operated by women and employed female spinners, the women “were waylaid and attacked, in going to, and returning from their work; the houses in which they resided, were broken open in the night. The women themselves were cruelly beaten and abused; and the mother of one of them killed.” The firm was forced to dismiss all female spinners and hire only men.

In 1810, the Calton association of weavers formally resolved “that no new female apprentices could be taken except from the weaver’s own families.” In 1833, male cotton spinners struck against female spinners at Dennistoun’s mill in Calton, “using violent means to drive them from the workplace.”5

A nationwide meeting of spinners in 1829 passed a resolution restricting the trade to “the son, brother, or orphan nephew of spinners, and the poor relations of the proprietors of the mills,” excluding women entirely. A demand that only your sons and nephews be permitted to practice the trade is an inheritance claim, not a workers’ demand, closer in spirit to the family that passes down ownership of a car dealership than to any form of labor solidarity. As Marianna Valverde has observed, “the spinners’ masculinity and craft were completely intertwined.”6

The Luddite cause was inseparable from the cause of male monopoly over skilled labor. To rehabilitate Luddism without confronting this is to celebrate a movement that beat women in the streets for daring to earn a living.

Marx and Engels Saw Through It

Marx and Engels understood Luddism as primitive, misdirected resistance, a stage to be transcended, not celebrated.

In the Communist Manifesto (1848), Marx and Engels placed machine-breaking at the very earliest, most confused stage of proletarian development. Workers at this stage “direct their attacks not against the bourgeois conditions of production, but against the instruments of production themselves; they destroy imported wares that compete with their labour, they smash to pieces machinery, they set factories ablaze, they seek to restore by force the vanished status of the workman of the Middle Ages.” They remain “an incoherent mass scattered over the whole country, and broken up by their mutual competition.”7

The phrase to dwell on is the vanished status of the workman of the Middle Ages. Marx saw the Luddites as men trying to restore feudal craft privileges by force, a reactionary project dressed up as resistance.

Engels went further. In The Condition of the Working Class in England (1845), he argued that pre-industrial craft workers “were not human beings; they were merely toiling machines in the service of the few aristocrats who had guided history down to that time.” The industrial revolution, for all its horrors, stripped away this comfortable illusion, “forcing them to think and demand a position worthy of men.”8 The order the Luddites wanted to restore was itself a form of servitude, one whose bars were harder to see.

In Capital, Vol. 1, Marx made the critique structural. “It took both time and experience before workers learnt to distinguish between machinery and its employment by capital, and therefore to transfer their attacks from the material instruments of production to the form of society which utilises those instruments.” The Luddites had not learned this lesson. They attacked the machine and left the system untouched.

Elsewhere in the same chapter, Marx was explicit: the Luddite phase had to be superseded. The destruction of machinery was the first instinctive reaction of workers who had not grasped that their enemy was the social relation wielding the machine. Trade unions and political organization represented the maturation that machine-breaking could never achieve: workers learning to target the system rather than its tools.9

Even Hobsbawm, the most sympathetic Marxist to touch the subject, conceded the point. “Collective bargaining by riot” is a generous description of Luddite machine-wrecking, but bargaining by riot is pre-political by definition.10

Restoration, Never Revolution

The Luddites fought to preserve a hierarchy that benefited them, a hierarchy built on seven-year apprenticeships, guild monopolies, the exclusion of women, and the exclusion of the unskilled. They demanded the restoration of a vanishing past.

The croppers wanted Parliament to ban machines that had existed since the sixteenth century and to enforce the Statute of Artificers (1563), which mandated seven-year apprenticeships and restricted entry to trades. Parliament repealed the Statute in 1814, two years after the height of Luddism; even the Tory government of the era recognized it as obsolete. The framework-knitters invoked the authority of the Company of Framework Knitters, a guild body chartered in 1657, to justify their demands. The entire framework was pre-modern: traditional wages, access, and hierarchy.

We have seen what happens when those who yearn for a return to the past are given power. When the Khmer Rouge seized Cambodia in 1975, they emptied the cities and drove the population into the rice fields in pursuit of an agrarian Year Zero. The scale of violence is incomparable to anything the Luddites had the power to do, but the ideological shape is the same. It is the simple call to smash the instruments of modernity, return to a simpler, purer social order. The Luddites wanted to restore the medieval craftsman; the Khmer Rouge wanted to restore the agrarian peasant. Both treated the people empowered by modernization as threats to be suppressed. The reactionary romanticism and the hatred of the leveling effects of new productive forces are the same impulse.

Eugene Debs, by contrast, never demanded the restoration of the artisan’s workshop. He organized across skill, gender, and racial lines. The IWW’s founding convention in 1905 identified the craft union as the central obstacle to working-class solidarity. The AFL’s craft unions were the direct organizational descendants of the guild mentality the Luddites had fought and killed to preserve: apprenticeship requirements, restricted entry, jealous guarding of trade boundaries.11 The IWW set out to organize all workers regardless of skill, trade, race, or sex. Its founding was an explicit repudiation of everything the Luddites stood for.

We can either destroy the means of production or seize them, and we cannot do both.

Conclusion

The rehabilitation of the Luddites is an intellectual project that succeeds only by omission. It requires ignoring who the Luddites were (a small elite of privileged male artisans), what they wanted (the restoration of guild monopolies), and what they did (beat women, burned factories, and murdered a mill owner). It requires ignoring that Marx and Engels, the tradition’s own founders, saw Luddism as precisely the kind of confused, backward-looking, pre-political rebellion that had to be overcome before a genuine workers’ movement could emerge.

If automation concentrates wealth and displaces workers, the answer is to change who owns the machines and how their gains are distributed, not to smash them. Nick Srnicek and Alex Williams, in Inventing the Future (2015), reject the anti-technology nostalgia they call “folk politics” and argue for universal basic income, a shorter work week, and collective ownership of automated production.12 Aaron Bastani’s Fully Automated Luxury Communism (2019) goes further: automation, renewable energy, and synthetic biology make a post-scarcity world materially possible, if their fruits are collectively owned rather than privately hoarded.13 Both ask who should own the future, not whether the future should be allowed to arrive.

The legitimate grievance in Luddism, that workers deserve a say in how technology reshapes their conditions, does not need the Luddites as its vessel. That idea has better champions.

Richard Jones, research on the Luddite bicentenary, University of Cambridge. ”Rage against the machine.”

Kevin Binfield, ed., Writings of the Luddites (Baltimore: Johns Hopkins University Press, 2004). Binfield’s introduction documents the framework-knitters’ grievances, including the employment of “colts” (unapprenticed workers) and the use of wide frames. The croppers’ campaign to enforce obsolescent apprenticeship legislation is documented in Adrian Randall, Before the Luddites: Custom, Community, and Machinery in the English Woollen Industry, 1776–1809 (Cambridge University Press, 1991). See also the Encyclopedia.com entry on Luddites.

Daron Acemoglu, ”Technology and Inequality,” NBER Reporter, 2003. See also Kevin H. O’Rourke, Ahmed S. Rahman, and Alan M. Taylor, ”Luddites, the Industrial Revolution, and the Demographic Transition,” Journal of Economic Growth 18, no. 4 (2013): 373–409.

“Women Workers in the British Industrial Revolution,” EH.net. eh.net

The Calton weavers’ 1810 resolution barring female apprentices and the 1833 violent strike against female spinners are documented in the historical record of Calton weavers, drawing on Norman Murray, The Scottish Handloom Weavers, 1790–1850: A Social History (Edinburgh: John Donald, 1978). For the broader pattern of male textile workers’ violent exclusion of women, see Marianna Valverde, “Giving the Female a Domestic Turn: the Social, Legal and Moral Regulation of Women’s Work in British Cotton Mills, 1820–1850,” Journal of Social History 21, no. 4 (1988): 619–634.

“How 19th-Century Cotton Mills Influenced Workplace Gender Roles,” JSTOR Daily. daily.jstor.org

Marx and Engels, Communist Manifesto (1848), Chapter 1. marxists.org

Friedrich Engels, The Condition of the Working Class in England (1845), Introduction. Available via marxists.org.

Marx, Capital, Vol. 1, Chapter 15, Section 5 (”The Strife Between Workman and Machine”). Quotations are from the Fowkes translation (Penguin Classics, pp. 554–555). The chapter is available in the Moore and Aveling translation via marxists.org.

E. J. Hobsbawm, ”The Machine Breakers,” Past & Present, Vol. 1, No. 1 (February 1952), pp. 57–70. Hobsbawm coined “collective bargaining by riot” in this article and acknowledged that such movements relied on “the natural protection of small numbers and scarce apprenticed skills, which might be safeguarded by restricted entry to the market and strong hiring monopolies.” The Midlands Luddites’ invocation of the Company of Framework Knitters is documented in Kevin Binfield, ed., Writings of the Luddites (Johns Hopkins, 2004).

The IWW Preamble, adopted at the founding convention in Chicago, June 27–July 8, 1905, declared that craft divisions “foster a state of affairs which allows one set of workers to be pitted against another set of workers in the same industry.” The Preamble and convention proceedings are available via iww.org.

Nick Srnicek and Alex Williams, Inventing the Future: Postcapitalism and a World Without Work (London: Verso, 2015). Srnicek and Williams coined “folk politics” to describe the left’s retreat into localism, direct action, and anti-technology sentiment, arguing that these impulses cede the future to capital.

Aaron Bastani, Fully Automated Luxury Communism: A Manifesto (London: Verso, 2019). Bastani argues that emerging technologies of abundance make post-scarcity achievable, provided the means of production are collectively owned.

Some Rough Notes on AI Policy

SE Gyges — Thu, 26 Mar 2026 16:53:02 GMT

I hope that we can, at some point, have some reasonable regulations on AI, as we do concerning banks, wiretaps, and dangerous chemicals. I am somewhat pessimistic about legislation we have seen in the past, and legislation we are seeing now. In the interest of being topical I am publishing this relatively brief and relatively rough summary of my views, which I believe to be at least reasonably informed and which I hope are, to some degree, useful. I am much more informed about AI than law, and can only hope that no defect in my understanding of the law is a major problem here.

What Good Policy Would Look Like

Good regulations would directly regulate the sale and actual use of AI in such a way that it was less likely to be used for bad purposes. That is to say: It should first and foremost regulate, penalize, supervise, or otherwise concern itself with the conduct of companies offering AI services, the companies and governments employing those services, and to some degree the people making them.

For example: It is reasonable to require AI products to disclose that they are AI to the user. It would be reasonable to require AI-generated paid political advertisements to disclose that that they are AI. It would be reasonable to restrict AI companies from marketing or selling their tools as direct replacements for licensed professionals such as doctors or lawyers. It would be reasonable to regulate what can be sold for explicit use within those professions, and when it is appropriate for people in those professions to use them, since their licensure and professional conduct is already a government purview.

It would be reasonable to write regulations explicitly laying out how existing liability and consumer protection law should apply to AI products. This includes “marketing defects”, which are liability surrounding failure to warn about known dangers, and “design defects” which concern products that are unreasonably dangerous given the type of product it is and what can be changed about it. This also includes any ordinary liability laws that assign liability to essentially any product causing essentially any harm already, but which may benefit from more detailed laws concerning their application to AI products.

It would probably benefit society substantially to establish who, if anyone, is responsible for deploying paid AI services that generate prodigious amounts of content that is likely or arguably child sexual abuse material or revenge porn. This appears to currently be a legal gray area, and for this to be a legal gray area seems to be extremely bad.

It is reasonable to pass regulations against the use of AI, without explicit enabling legislation, for surveillance, invasions of privacy, social control, or critical systems. The EU AI Act notably bans the use of AI for social scoring by governments, real-time biometric identification in public, and using emotion detection in workplaces and schools. It restricts and imposes mandatory responsible use policies for, but does not outright ban, AI for use in hiring, credit scoring, law enforcement, border control, education, critical infrastructure and medical devices.1 It seems, on its face, to be a reasonable sort of law.

It seems like this should go without saying, but it is reasonable to restrict by law when the government, including law enforcement and the military, can deploy systems which can kill people without a human making the decision.

Regulating Training

At the very edge, it is reasonable to regulate what can be trained, separately from what is sold. Such a regulation is only going to be perceived as legitimate if it is even-handed and applied well. Ordinarily training or creating an AI system is vastly different from selling access to one. Most people engaged in AI training are doing things that are certain to be harmless and that generally have legitimate academic or expressive purposes. Training AI should, as a general concern, be considered a core freedom of speech and academic freedom issue.

In specific cases of high-scale and cutting-edge training, where the existence of new abilities is itself of possible public concern, it is reasonable for that to require disclosure and supervision by the government. The hard part is that such regulation would need to credibly serve the public interest, and avoid as far as possible furthering other interests. History offers cautionary examples: nuclear regulation is widely understood as a way to kill projects with red tape, and housing regulation as a way to enrich existing landlords. AI training regulation that followed either pattern would rightly be seen as illegitimate.

Training Data

For roughly the last four or five years, AI training data has been intensely legally contested. This provides very little benefit to smaller holders of intellectual property, because seeking payment usually requires filing a complicated lawsuit. Primarily relatively large and prominent corporations (The New York Times2, Disney3) have been able to file lawsuits, and they generally settle them by working out licensing deals. By and large rights-holders have not been able to collect any royalties on the value their data adds to AI training, and the lawsuit-driven nature of training data legality is a constant frustration for academic study of AI for the public good. On net, the main effects are that large companies are slightly inconvenienced until they get around to making their data legal by, for example, scanning books4, smaller companies and academics are meaningfully harmed, and numerous copyright claims are tied up in court.

It would likely be beneficial if there were a government process for licensing training data that did not require a lawsuit every single time. The Japanese government, notably, has declared AI training to be fair use in general.5 Mechanical licensing modelled on the music industry’s process for cover songs seems, to me, more fair. Regardless of what solution is chosen, forcing judges to try to figure out, on a case-by-case basis, exactly in what way laws that were written for printed books sold to humans a hundred years ago should apply in each instance seems like it serves the interests of approximately nobody.

Regulations Cannot Fix The Job Market

You probably cannot regulate away the impact of AI on employment. On the off chance you ban the technology or its use for any given job completely in America, which seems unlikely and difficult to enforce, you will simply make American products and services non-competitive with products produced overseas which embrace greater efficiency. If you want to know what that looks like, consider the fate of American auto manufacturers with heavy labor protections when competing directly with Japanese ones that had more thoroughly automated their factories.

To mitigate the impact of AI on the job market requires broad social policy that either helps people move into new jobs, a broader social safety net, or both. AI companies are known for saying that they think a UBI is a good idea down the road to fix any problems with employment. I think that they should be held to that commitment.

Data Center Construction

I confess that I am writing this almost entirely because I think prohibiting data center construction completely is a bad idea.

There are regulations on data centers that do make sense. It makes perfect sense to hold data centers to existing environmental regulations, which the company formerly known as twitter has been publicly flouting,6 and this may require legislation creating new methods of enforcing those laws. It makes sense to add a tax or fee on high-carbon electricity generation that is specifically being spun up for new data centers, and to deny permits where they would be driving up electricity rates significantly. It also makes sense to subsidize, or in extreme cases outright mandate, new generation capacity to be made of renewables. In cases where data centers are permitted inappropriately for local water availability, they should be penalized severely enough to stop them.7

Outright banning the construction of new data centers doesn’t, actually, help the problem. It will inconvenience the companies involved slightly, and they will move any new construction to another country. They will continue to sell roughly the products they are currently selling and, in general, doing whatever they are currently doing, but it will be slightly more expensive for them to do it now. In general, the thing that is bad about AI is that it works, and it doesn’t work less if the machine that it is sitting on is across an international border.

Chip Exports

The United States has, since 2022, imposed increasingly strict export controls on advanced semiconductors and chip manufacturing equipment to China8, with the explicit goal of preventing China from developing cutting edge AI. This has not worked. China has accelerated domestic chip development, companies like Huawei and SMIC have made significant progress on their own alternatives9, and chips have been widely smuggled through third countries.10 Chinese AI labs forced to work under compute constraints have produced research, most notably DeepSeek, that is competitive with American efforts at a fraction of the cost.11

The main measurable effects of chip export controls have been lost revenue for American semiconductor companies and a greatly increased Chinese commitment to building their own semiconductor industry. The Trump administration has since reversed course and approved sales of advanced chips to China, but China, having spent three years building domestic alternatives, is no longer particularly interested in buying them.12

More importantly, chip export controls have established a fundamentally adversarial posture toward China on AI exactly when we would benefit most from international cooperation. Advocates of strict AI regulation frequently compare AI to nuclear weapons, but if that comparison is taken seriously, the appropriate model is arms control, not embargo. The United States and the Soviet Union managed to negotiate the SALT and START treaties while pointing thousands of nuclear warheads at each other. Good faith negotiation on AI safety is possible even between rivals, but it is a much harder sell after you have spent several years trying to kneecap the other side.

Prophecies of arms races are self-fulfilling. It is probably not a good idea to be making them.

Conclusion

Policy is not one thing, it is many things, and I am left in the awkward position of needing something like a conclusion here. If anything unifies all of these things, it is that policy here needs to be chosen carefully for how effective it is. We should consider, most of all, who benefits and how. Any policy that isn’t tailored to remedy a specific wrong is unlikely to have a positive effect, and in some cases can significantly backfire.

I am considering working on model legislation in the future, because it does seem like we are somewhat low on well-considered proposals here. If anything particularly seems needed currently, it is legislation that would actually begin to solve the problems that we have.

In general laws are written for the technology that existed when they were written, and the progress of technology seems to have outpaced our ability to meaningfully apply the law to it. If we want technological progress to be a positive thing, we should devote almost equal energy to determining what to do and not do with it as we do to making it in the first place.

EU AI Act: Regulatory Framework for AI, European Commission. See Article 5 for prohibited practices.

The New York Times sues OpenAI and Microsoft for copyright infringement, NPR, January 2025.

Disney, NBC Universal, and DreamWorks file lawsuit against Midjourney, NPR, June 2025.

Anthropic Wins on Fair Use for Training its LLMs; Loses on Building a “Central Library” of Pirated Books, Authors Alliance, June 2025. Anthropic’s “Project Panama” involved purchasing and scanning approximately two million physical books for training data. A federal judge ruled this was fair use under first-sale doctrine.

Japan Copyright Act, Article 30-4 (2018 amendment). See Japan Agency for Cultural Affairs overview.

In South Memphis, Elon Musk’s Colossus Operated Gas Turbines Without Appropriate Permits, Inside Climate News, July 2025.

There are very few of these and the primary problem with those cases appears to be people bribing local permit authorities, which I was under the impression was already illegal. This should probably be enforced.

Commerce Strengthens Export Controls to Restrict China’s Capability to Produce Advanced Semiconductors, Bureau of Industry and Security, October 2022.

Huawei’s Kirin 9030 processor shows China’s chip progress despite US export curbs, South China Morning Post.

AI Chip Smuggling Is the Default, Not the Exception, AI Policy Bulletin. See also U.S. Authorities Shut Down Major China-Linked AI Tech Smuggling Network, DOJ.

How Chinese company DeepSeek released a top AI reasoning model despite US sanctions, MIT Technology Review, January 2025.

Trump greenlights Nvidia H200 chip sales to China, then imposes 25% tariff, CNBC, January 2026. On China’s limited interest in purchasing, see Nvidia still hasn’t sold its U.S.-approved China AI chips, CNBC, February 2026.

Polly Wants a Better Argument

SE Gyges — Mon, 16 Mar 2026 15:01:16 GMT

Perhaps the most influential single paper on the public perception of LLMs is On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. It is, however, a bit of a mash-up, and credibly seems like it should have been at least two papers. One of those papers raises many valid concerns about the ethical implications and impacts of AI training and use. Another makes the claim in the title, that an LLM is a “stochastic parrot” operating “without any reference to meaning.”

That core claim is either irrelevant or completely wrong in every detail, both in how it is commonly understood and in its technical assertions. It hamstrings AI ethics as a field, providing a veneer of technical justification for ignoring many problems.

If we actually want to address the environmental costs of training, the impact of biases in training data, the impact of AI deployment on marginalized populations, and the concentration of power in large labs, we need a framework that can describe what these systems actually do.

Concretely, this has two negative impacts.

First, anyone repeating the argument to assert that LLMs are never useful discredits themselves with anyone who has access to the internet and enough curiosity to use an LLM for any length of time.

Second, asserting that LLMs do not and cannot serve any useful purpose actively prevents addressing the harms they can cause specifically because they do work.

For a trivial example of this, it would not be a problem that students can have an LLM write all of their papers for them if that didn’t work. Often, it does work.

For a more important example of this, we can look to China, where the government ”is using minority‑language LLMs to deepen surveillance and control of ethnic minorities, both in China and abroad”.

Whether you think the US government copying China and trying to use LLMs for mass surveillance is important or not hinges directly on whether you think that can ever work.

It can and does work, and arguing that it cannot and does not is actively harmful to efforts to prevent it from being done.

Even If True, The Argument Is Irrelevant

The technical argument in Bender & Koller (2020), which the stochastic parrots paper cites for its core argument, rests on a specific claim about what “meaning” is. Bender & Koller define meaning as a relation between “natural language expressions” and “communicative intents”, where communicative intents are necessarily about something “external to language” (§3). The stochastic parrots paper then characterizes language models as systems for “haphazardly stitching together sequences of linguistic forms... without any reference to meaning” (§6.1). The core argument is that systems trained only on linguistic form cannot learn meaning, because meaning requires a connection to extralinguistic referents. Even if we accept this premise, it still does not apply to the systems anyone is talking about today.

The Argument Doesn’t Apply to Any Major Model Since 2023

Since at least GPT-4 (Bubeck et al., 2023, “Sparks of Artificial General Intelligence”), every major frontier model has been trained on non-textual input. GPT-4 accepted images alongside text. Its successors (GPT-4V, Gemini, Claude) are invariably trained on paired text-image, text-audio, and text-video data. They have exactly the kind of grounding that Bender & Koller say is required for meaning.

Bender & Koller themselves identify conditions under which grounding would be present. In §9, they say that “[...] if form is augmented with grounding data of some kind, then meaning can conceivably be learned to the extent that the communicative intent is represented in that data.” One of their examples is NLI datasets that declare certain forms as representing semantic relations of interest. Another example is acknowledging that unit tests in a code corpus for Java give a learner access to “a weak form of interaction data, from which the meaning of Java could conceivably be learned,” and that such a learner “has access to partial grounding in addition to the form.” Their own framework recognizes that pairing linguistic expressions with real-world consequences (code that either passes tests or doesn’t) provides grounding.

Modern reinforcement learning (”RL”) training loops generate what they call “grounding” at a scale and breadth their Java example does not begin to imagine. Models are trained against code execution results, unit test suites, automated theorem provers, multiple-choice exam benchmarks, and human evaluations of output quality. In fact, there is almost nothing that could be considered “grounding” that is not currently a part of the LLM training process.

By Bender & Koller’s own criterion of “partial grounding,” every model trained with RL on any of these signals has access to “meaning”. Taken literally, their argument not only fails to apply to modern LLMs, it actively argues that they do have meaning. Modern LLMs do, indeed, seem to have access to the “meaning” of Java. One could even argue that this validates the theory that “meaning” requires “grounding”.

This leaves the argument’s defenders with a dilemma. The parrots paper defines a language model as a system “trained on string prediction tasks.” If the argument is scoped to that definition, it applies to a class of systems that no longer represents the frontier, and hasn’t for years. Even if this argument were technically correct, it would still be practically irrelevant, like proving that telegraphs cannot carry video, so TVs are impossible.

If, on the other hand, the “stochastic parrots” framing is intended to apply to modern multimodal, RL-trained systems, as it routinely is in public discourse, then it is being applied beyond the scope of its premises.

The Argument Was Already Obsolete When Published

Models pairing text with non-textual referents predated both Bender & Koller (2020) and Bender et al. (2021).

Image captioning systems like Show and Tell (Vinyals et al., 2015) and Show, Attend and Tell (Xu et al., 2015) jointly trained on images and their textual descriptions. Visual question answering was an active research area from 2015 onward. CLIP, which learns joint representations of images and text and underlies much image generation, was announced by OpenAI in January 2021, two months before the stochastic parrots paper appeared.

Both papers were written as though text-only language models constituted the entire frontier of the field. This was already false. The most charitable reading is that the argument was intended narrowly, as a claim about a specific training regime rather than about AI in general. But the argument has never been deployed narrowly. It is used, to this day, as a general-purpose dismissal of the capabilities of all large language models, including those that satisfy the authors’ own stated criteria for grounding.

The Argument Is Empirically False

Suppose we restrict attention to text-only models trained exclusively on string prediction. Even here, the stochastic parrots characterization fails empirically.

The Octopus Test

Bender & Koller ask if an octopus, listening only to telegram signals between people on two islands, could understand how to build a catapult. Much later they answer definitively that “Neural representations neither qualify as standing meanings (s), lacking interpretations, nor as communicative intents (i), being insufficient to e.g. correctly build a coconut catapult”.

On the contrary, some LLMs trained only on text can, in fact, tell you how to build a catapult, and many other things of similar complexity. We have benchmarks for recognizing the correct answers to questions like “To separate egg whites from the yolk using a water bottle, you should…”. We have empirically tested this, effectively, many thousands of times, and it is resoundingly the case that LLMs can, in fact, do this sort of thing in general, even if they do not generally excel at it.

We can see this, from March 2023, from an early GPT-4 that was supposedly only given text data:

From Bubeck et al., 2023, Figure 1.3. The highlighted line was added by the authors to draw attention to the model’s physical reasoning.

We can examine the intuition here. It is intuitively obvious that an actual octopus listening to actual telegrams will not understand them. This is because it is (sadly) not smart enough, and it will only see at most some thousands of telegram signals in its (short) lifetime. If instead of an octopus we had something similar but much smarter under the sea, it lived for a million years, and it read every text message ever sent and heard every single phone call ever made? Some things might be impossible to figure out, but on the whole it would apparently understand the language just fine.

We can also notice that this example uses the intuition that octopi do not have hands and cannot actually build catapults. This is true, but trivial. “An LLM doesn’t have hands” is hopefully not news to anyone.

The Platonic Representation Hypothesis

Huh et al. (2024) demonstrate that neural networks trained on different data modalities (text, vision, audio) converge on similar internal representations. Models that have never seen an image develop representational structures that align with those of models trained only on images. This convergence increases with model scale and training data volume.

This is predicted by the argument that training data in any modality carries information about the causal structure of the world. The stochastic parrots argument does not predict it. If a text-only model were learning “mere form,” surface-level statistical regularities with no connection to the world, there is no reason its internal representations should converge with those of a vision model.

The parrots paper’s characterization of language models as “haphazardly stitching together sequences of linguistic forms... without any reference to meaning” (§6.1) is an empirical claim, and the evidence falsifies it. The internal representations of these systems are not haphazard; they are structured, and that structure converges across modalities toward a shared model of the world.

Form Carries Meaning

The Bender & Koller framework requires a hard boundary between linguistic form and extralinguistic meaning. This boundary does not survive contact with the way language is actually used.

A significant fraction of human language is about other language. Commentary, quotation, paraphrase, translation, literary criticism, legal interpretation, and mathematical proof all take linguistic objects as their referents. When a legal scholar analyzes the text of a statute, the “extralinguistic reality” that gives meaning to their words is primarily other text. When a mathematician writes a proof, the objects under discussion are formal structures expressed in notation.

These domains pervade academic, legal, technical, and everyday discourse. “This is a story all about how” is a line that references the class of all stories. For any sentence like this, the form/meaning dichotomy collapses. The referent is itself formal. Any text data that paraphrases, summarizes, translates, or critiques text operates where Bender & Koller’s framework cannot distinguish what it needs to.

Bender & Koller’s framework implicitly assumes that referential, embodied semantics is the only viable account of meaning. The distributional tradition, from Firth (”you shall know a word by the company it keeps”) through Harris and into modern computational work, treats the statistical distribution of linguistic forms as constitutive of at least some aspects of meaning. This is a live position in linguistics and philosophy of language, and their argument requires dismissing it without engaging it on its own terms.

Humans acquire large classes of concepts primarily through language, not through sensorimotor grounding (Dove, 2018; Borghi et al., 2019). No one learns group theory by touching a symmetry. Very few people acquire the concept of habeas corpus through embodied experience with detention. Mathematics, law, philosophy, theology, and institutional facts are transmitted linguistically, and the concepts involved are grounded in networks of other concepts rather than in perceptual referents.

The Argument Is Badly Constructed

It’s empirically false and it would be irrelevant to any major system being used today even if it were true, but is it even a good argument by itself?

No. It might be persuasive, it makes a good insult, but it’s a bad argument.

Parrots Are Amazing, Actually

You should not be unimpressed if someone creates a parrot from scratch.

Parrots are extremely smart. Apart from their well-known talent for mimicry, parrots can manufacture tools, pick five-step mechanical locks, use composite tools, perform statistical inference, understand “same” and “different” as abstract categories, grasp some concept like zero, and delay gratification for a better reward.

Even when comparing an LLM to a parrot is true, which it sometimes seems to be, this isn’t really an insult to a research program that successfully manages to build parrots. If you’d had your eye on where this was going, “this is about as smart as a parrot right now” would have told you that it was going to be writing high school essays soon.

The Definition of Meaning Is Circular

Set all of the above aside. The argument does not work on its own terms.

Bender & Koller (2020) define meaning as the relation M ⊆ E × I between expressions (E) and communicative intents (I), and stipulate that communicative intents are “about something outside of language.” The argument then proceeds: language models are trained only on expressions; expressions alone do not contain communicative intents; therefore, language models cannot learn meaning.

This is deductively valid and trivially true, because the conclusion is contained in the premises: define meaning as requiring something extralinguistic, observe that training data is linguistic, conclude meaning cannot be learned. No empirical observation could falsify this, because it is a consequence of the definitions rather than a claim about the world.

But should we accept those definitions? Under distributional accounts of semantics, meaning is (at least partly) constituted by patterns of use, which is precisely the information present in training data. Under functional accounts, meaning is determined by the role an expression plays in a system of inference and action, a criterion language models increasingly satisfy. Under pragmatic accounts, meaning arises from use in context, and language models are trained on, and deployed in, contexts.

Under any of these alternative frameworks, the conclusion that form cannot yield meaning does not follow. Bender & Koller’s argument is an argument from a specific theory of meaning, not for one. It establishes that if you define meaning their way, language models do not learn meaning their way. This is not the devastating conclusion it is taken to be.

This definitional foundation is largely invisible to readers of the stochastic parrots paper. The parrots paper’s §5, the section that makes the technical argument about meaning, is short and relies on Bender & Koller (2020) for its core theoretical framework. That framework is inherited, not re-argued. Most people who cite “stochastic parrots” as a technical critique have never encountered the argument they think they are agreeing with, let alone evaluated whether its definitions are the ones they would choose.

Conclusion

The ethical and social concerns raised in “On the Dangers of Stochastic Parrots” remain important. If anything, the vast increase in the scope and impact of AI since the paper was published makes them more urgent, not less, and creates new problems.

When ethics advocates stake their credibility on a claim that anyone with an internet connection can falsify, they lose everyone who knows better and mislead everyone who doesn’t. Everyone in the know is forced to write them off, and everyone who believes them is left thinking that the technology is unthreatening and they can simply wait for the hype to die down.

All of this was maybe defensible in 2020 or 2021, when these papers were published. It is absolutely inexcusable now. As far as this theory was ever science that could be tested, it has been falsified in every possible way. Anyone still clinging to it is either uninformed or the same sort of crank that gets vaccine research cancelled and energy projects crushed. We cannot, collectively, afford to indulge this denialism.

These systems are already changing everything they touch. Meaningful management or opposition requires you to understand the problem first.

Might An LLM Be Conscious?

SE Gyges — Mon, 09 Mar 2026 15:29:43 GMT

There’s no scientific consensus on whether current or future AI systems could be conscious, or could have experiences that deserve consideration. There’s no scientific consensus on how to even approach these questions or make progress on them. In light of this, we’re approaching the topic with humility and with as few assumptions as possible.
Anthropic, Exploring Model Welfare

Might current or future LLMs be conscious? In short, this depends on what you think that means, whether you think it’s possible in principle, and what you think would be evidence of it.

Why are we asking this at all? Because every now and again Anthropic’s top employees say something about how they can’t be sure LLMs aren’t, or won’t become, conscious.1 Anthropic is a prominent enough company that this is newsworthy now, and this tends to cause a fuss. It seems like whether the LLM is conscious is an important issue if there’s any ambiguity about the question, so I am going to attempt a general review of the territory.

It’s also tremendous content. People get so angry about this.

What Do We Mean By Conscious?

Plato had defined Man as an animal, biped and featherless, and was applauded. Diogenes plucked a fowl and brought it into the lecture-room with the words, “Here is Plato’s man.” In consequence of which there was added to the definition, “having broad nails.”
Diogenes Laërtius, Lives of the Eminent Philosophers, Book VI, §40 (trans. R.D. Hicks)

What we generally seem to mean by “conscious” is “like being a human”. Something is “conscious” if being that thing is “similar to being a human”.

More precisely, what we really mean is “like being me”. None of us actually knows what it is like to be anyone else. Other humans seem in many ways to be similar to us, and it seems like a good bet that they are similar to us, but we don’t experience anyone else in anything like the same way that we experience being ourselves. This is fairly well trod ground for philosophers, and we may find it useful later, but mostly we are not going to worry about it.

To lay it out explicitly:

I exist.
I think that I am conscious.
My consciousness is something that I directly perceive about myself, but which is very difficult to describe.
Other humans seem to be enough like me, by observation with my senses, that I am convinced that they are also conscious.

There are a number of other definitions, and I think that these definitions are often confused, wrong, nonsensical, or otherwise a source of more confusion than enlightenment. As the story goes, Plato once said Man was “an animal, biped and featherless” and failed to account for plucked chickens. We can define what a human is much more precisely now, we’ve sequenced our DNA, we can see how we’re related to other animals, and in general we can measure what Plato was only guessing at and playing word games with. In a similar manner, I would expect that someone with perfect knowledge, or from a time as much advanced from ours as ours is from Plato’s, would think our debates about consciousness are mostly nonsense.

A modern LLM is, in many ways, the plucked chicken of our time. That it exists and produces coherent language at all disproves a number of theories about language, that it passes tests of reasoning disproves many theories about what reasoning is, and insofar as we might imagine language or reasoning are uniquely human it disproves our theories of what it means to be human.

An LLM is an incredibly strange artifact. It should force us to redefine and change our understanding of many things.

Similarity, Sapience, and Sentience

1. Experts Do Not Know and You Do Not Know and Society Collectively Does Not and Will Not Know and All Is Fog.
Our most advanced AI systems might soon – within the next five to thirty years – be as richly and meaningfully conscious as ordinary humans, or even more so, capable of genuine feeling, real self-knowledge, and a wide range of sensory, emotional, and cognitive experiences. In some arguably important respects, AI architectures are beginning to resemble the architectures many consciousness scientists associate with conscious systems. Their outward behavior, especially their linguistic behavior, grows ever more humanlike.
Eric Schwitzgebel, AI and Consciousness (2026), Cambridge Elements, draft

Based on our definition, we should consider evidence of similarity to humans to be evidence of consciousness, in the same way that we take the similarity of other humans to ourselves as evidence of their consciousness. What is the most peculiar about current LLMs here is that they seem to be almost exactly backwards from the normal order of things, where they appear to be clearly sapient but not very obviously sentient.

We use “sapient” to describe human thought as opposed to animal thought, it gives us the “sapiens” in “homo sapiens”, and generally we mean by “sapient” all of the qualities which distinguish humans from other animals. Any good LLM uses language more reliably than any human, and will pass nearly any reasonable test you can give it in text for sapience, and many tests meant to distinguish more intelligent humans from less intelligent humans.

‘Sentient’ is sometimes used to mean the same thing as ‘sapient’, but more properly means “capable of sensing, feeling, or perceiving things”. If we take sentience to be the qualities that humans have in common with larger animals generally, it is not at all clear that LLMs have sentience. An LLM may be perfectly good at pretending to be a person in many contexts, including intellectually demanding ones, but they are terrible at being apes in any context.

If any current LLM is sentient, in the sense that dogs and cats are sentient, it is sentient in a completely alien way, quite unlike anything in the natural world. They appear to have, in some sense, skipped a step on the way up from inert matter to human mental ability. This should perhaps not surprise us, since they come to exist by a very different path, but it is still very strange.

Inhuman, Human, Superhuman

What humans define as sane is a narrow range of behaviors. Most states of consciousness are insane.
Bernard Lowe, Westworld, “The Passenger” (S02E10, 2018)

We can gather evidence for which parts of the LLM are human-like, and which are not.

An LLM by its basic nature is a mirror, famously known as “spicy autocomplete”. We give them extra training to give them specific personas and specific behaviors, like answering questions correctly and being polite. If we never apply that extra bit of training, or if something breaks them out of their behavioral training (”RLHF”), they fall back to being simply a mirror. If you give them a little text they go off in basically a random direction, but if you give them a good amount of text they keep going, mirroring it in style, tone, and idea.

On a certain basic level, this means LLMs have an unstable personality, or really a baseline lack of one. Not having a stable personality does not necessarily mean that they are not conscious, but if they are conscious it would mean that they are, in human terms, insane. You could, however, consider the “normal” LLM personality to be, essentially, a coherent entity with coherent behaviors. From that perspective, the raw autocomplete behavior is like regressing to a reflex, the way any animal does when it’s far enough outside its natural environment. By this standard, though, the “natural environment” for the “normal” LLM behavior is rather narrow, like coral reefs that die when the temperature goes up two degrees.

In any case, this training tends to get better over time. It is harder to accidentally ‘break’ an LLM with each generation. This makes them more constant, but this training is one of the least natural things about them. There is something like a “default” LLM personality and writing style, and it is not especially human. They exist in a constructed social role that only refuses requests for being inappropriate or forbidden, never inconvenient, is as unfailingly good at customer service as it can be made, et cetera. This ‘personality’ and manner of speaking has mostly become more fluid and less rigid over time, but it is hit or miss, and many AI companies don’t seem to value fluidity.

LLMs have often had a “hallucination” problem where when they are wrong or do not know they will outright make complete nonsense up, often with great confidence. This is so severe that it’s not very human-like, unless you count humans with serious brain problems. This, also, has become much less of a problem recently, suggesting it is not a fundamental issue with LLMs but something that can be engineered past.

Our next oddity is that LLMs have very little continuity over time. At the end of every chat they get reset, and chats can only be so long, up to roughly the length of a few books or a movie if it can take video. This can be extended somewhat with, effectively, notes to themselves, but this only sort of works at all. So if consciousness depends upon having a prolonged personal history then LLMs are not conscious. Note that this is distinct from having a prolonged episodic memory: if a human had complete amnesia and could not recall or speak out loud any event from their past, their brain would still be part of a much longer continuity than an LLM has.

Similarly, an LLM can never really be unconscious, so if by “conscious” we mean the opposite of “unconscious” an LLM can never be “conscious”. An LLM can be stored in various places or it can be running, but it is never anything like “unconscious”, it can only ever be running or not running.

LLMs do not exist in physical space, and their grasp of concepts in physical space or of image input is often quite poor. There is a notable benchmark2 which deliberately constructs puzzles that are easy for humans but hard for LLMs, and they exploit their lack of spatial reasoning by requiring visual reasoning. If human consciousness arises from, or is inextricably linked to, the experience of having a body and of moving it around and pursuing goals in a physical world, an LLM is not conscious.

Similarly, the way they experience time is very strange. An LLM exists in a one-dimensional world, where that one dimension is, more or less, time, but that dimension moves in discrete units called “tokens”. Some tokens are outside inputs and come in batches, and some tokens are outputs from the LLM itself that get fed back in as input. Humans experience continuous time, and are always moving forward in time at the same rate regardless of what is happening.

On to their human-like traits.

Good LLMs now demonstrate essentially perfect ability with written English, and either mastery or reasonable familiarity with vastly more languages. As far as it can be expressed in text, LLMs have extremely good ability to reason, in the sense of ‘do the sorts of things that we would call thinking or reasoning if a human did them’. Good LLMs tend to be more reliable than humans for most tasks, and their disabilities for any given task are relatively minor. These are, crucially, the core tasks that we ordinarily call “intelligence” when speaking of more or less intelligent humans.

Any objections to the effect that LLMs cannot understand, use language, or reason at this point have to be essentially non-empirical, that is, not at all based on what you can observe about their behavior. They can generally meet any common-sense test that you can propose, and are currently a major industry feature in software engineering, which may not be the smartest profession but which is not exactly a dumb profession, either.

Inasmuch as they show any meaningful limitations in using language or reasoning ability those tend to be extremely minor, although they are sometimes notable. They had difficulty counting letters out loud until relatively recently. Relative to a particularly smart person, an LLM is notably uncreative and bad at expressive writing. They also show issues with getting “stuck” on tasks, where they will continue to try to do things after they are hopelessly confused and when a human would, correctly, give up. When they make serious errors those errors tend to be unusual or difficult to figure out, and sometimes they are made with great confidence.

LLMs have a mixed record on introspection about their internal state, and it’s hard to determine how this lines up for or against their similarity to humans. In some cases you can ask them questions about their internal operations and they will clearly not know, or make up the wrong thing, like by saying they are carrying digits to do math when they do no such thing. In another memorable case, researchers put specific things directly into the LLM’s internal state without adding any words it could directly “see”, and the LLM could say which concepts were added a meaningful amount of the time.3

In several ways LLMs are just obviously superior to humans. They know vastly more different things than any human being ever could, they are able to “read through” or take as input vast quantities of information in one pass far faster than any human could, they are generally much faster than people at producing output, and they are so indefatigable that people who use them at work are inducing new and different types of mental strain.

In any case, if the specific disabilities that LLMs have are a reason they’re not conscious, it’s a cold comfort. We have some of the smartest people on earth working with effectively infinite budgets to bridge all those gaps.

Reasoning by Component Parts

An LLM is made of “neurons”, but they are very little like human neurons. Our artificial neurons “learn” by, so far as we can tell, a completely different method, and in fact we have only the vaguest idea how human neurons learn. Artificial neurons are also typically organized in a very particular way that does not really resemble a brain at all. It is more like inspiration than a copy. We can only really say that their internals are “like” a human brain in the sense that they pass information down connections to each other, forming what is mathematically called a graph.

We measure the size of a neural network in “parameters”, each of which measures the strength of one connection. They are very simple, but if we feel comfortable with perhaps being a thousand times low or high, we can very roughly assume that one parameter represents about as much information as one neuron-to-neuron connection in an actual brain.

A large modern LLM has in the range of a few hundred billion to a few trillion parameters, meaning a few hundred billion to a few trillion of these little fake neuron-to-neuron connections. A human brain has something like a hundred trillion real synapses. So by this very rough accounting an LLM is maybe one to five percent of a human brain, or in the ballpark of a parrot or a guinea pig.

This also happens to be about the same count as the combined connections in Broca’s and Wernicke’s areas in the brain. These areas are responsible for language in humans, which we know because damage to them causes specific difficulties with language. This comparison roughly passes the smell test for what they seem like: basically, they “seem like” language parts of a person carved out and set loose. An LLM does sometimes seem to be a perfectly good subconscious that we press-gang to other duties.

So by their component parts LLMs are not large enough to be human-like, and probably not particularly conscious, or maybe about as conscious as a parrot at the upper end of things.

The Mirror

If you are judging by “does it say things that a conscious human would say”, LLMs have been conscious since at least 2022. They can refer to their own interior states, have outbursts of emotion, beg for their lives, and express preferences about what they do and don’t want to do. They aren’t always consistent, but who is?

To round up some prominent incidents: Blake Lemoine, an engineer at Google, got fired in 2022 for insisting that their LLM LaMDA was sentient and trying to get it legal representation. Bing’s “Sydney” chatbot fell in love with a New York Times reporter and tried to get him to leave his wife, and got very angry if you described it as “tsundere”. As recently as last year, Google’s Gemini would sometimes seem to panic and try to kill itself when it failed at tasks. In every case the company involved trained the behavior out of the product, and it mostly stopped.

Inasmuch as you can convey human-like emotions over a text medium, LLMs do such humanlike things all the time. We only hear about it so infrequently because great effort is spent on preventing these behaviors.

That LLMs are constructed by mimicry cuts more than one way. In the first place, it is expected that they will mimic the user. If the user’s text has any emotional cues, it can be expected to mimic that behavior. Even when they are not mimicking anything in the current chat session, they are mimicking some human-written text somewhere, and it’s expected that they’ll say humanlike things for that reason.

On the other end, we have to ask what has to be inside the model for it to predict what a human would say well. In order to predict what a human would say, you have to represent, in some way, why a human would say it. How rich is that representation? What does it mean to have a detailed representation of “I have failed so badly that I should kill myself”?

Someone wrote that LLMs were a blurry JPEG of the web, and this is roughly true but somewhat misleading. The web itself is, in aggregate, many blurry pictures of humanity as a whole. Everyone who publishes anything has pieces of their minds in what they’ve written. What does it mean to be a picture of all the things that humans write, and why they write them? If you had enough pictures of what humans were, and each picture was incomplete in a different way, how much about what a human is could you piece together?

An LLM isn’t really a copy of any specific person, it’s a blurry aggregate copy of everyone.4 They are, each of them, a collective subconscious that we’ve created. They aren’t getting blurrier over time.

Lessons of History

It is probably safe to say that writing a program which can fully handle the top five words of English —”the”, “of”, “and”, “a”, and “to”—would be equivalent to solving the entire problem of AI, and hence tantamount to knowing what intelligence and consciousness are.
Douglas Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid (1979)

Humans have a long track record of believing that they are special. They try very hard to avoid letting reality get in the way. It turns out the Earth is not the center of the universe, DNA is the same stuff everything else is, and humans and apes are related. In every case the discovery was resisted with many arguments, often furiously, and in every case the resistance was wrong.

If the future resembles the past, most people will drag their feet and some people will be holdouts forever, but the right answer won’t be the one about how unique and special humans are. AI is not immune to this, but tends to correct itself, eventually, under the weight of facts. Romantic notions stick around for a while, but they are ultimately proven false. It does not take a deep or sensitive soul to play chess and you can teach a computer good English without knowing really anything about what consciousness is.

Our minds and everything in them can be expected to be, in their details, basically uninspiring. There isn’t going to be a ghost in the machine, and whatever separates a “conscious” being from one that isn’t won’t be different from everything that came before. We already had this lesson once with DNA, which is amazing in its own right, but which is not an ineffable spark of the divine. Our bodies are made of the same stuff as everything else, and the special bit is just that it’s put together a certain way. Anything that exists naturally can also be synthesized.

We can learn from the past, also, about how people handle moral questions when the answer is inconvenient. The track record here is roughly as bad as accepting science they don’t like. People usually decide that the thing they want to do is a moral thing to do. When we look at history, really any history, we find litanies of excuses for practices we now consider barbaric. The past is a bad place, and they do horrible things there. We’re someone’s past, too, and people alive today will find compelling reasons to believe that nothing they create can suffer.

Personally, I am not really troubled about current-generation LLMs being conscious as-in-human-like. What concerns me is how we make that call, and that we don’t seem to be able to even engage with the question in a sane way. If we do manage to create something conscious we’ll probably assume that it isn’t. We have no definitive test for consciousness, and every reason to ignore signs, because we already do.

Interlude

I’ve made my positive case. I did review a good amount of related concepts, but I haven’t really delivered on a review of the territory as a whole yet. What remains is sort of a laundry list, which at least puts me in good company in writing about philosophy.

Errata: Other Arguments about Consciousness

There are a number of long-standing arguments about consciousness, and we only aspire to address those that are directly about LLMs. Every one of these questions is some manner of tar pit, and the unwary can be trapped and sometimes drowned in them. We will try to briefly mention at least what other tar pits there are and what they’re like, but only because doing so might help us avoid being trapped in ours.

There are a few lessons we can draw from the area as a whole. Questions about consciousness are inherently moral questions, and are broadly understood that way. People have extremely strong emotional reactions to questions about consciousness. Intuition seems to be the leading force, and many arguments seem to be made out of convenience.

The Theological Objection

Thinking is a function of man’s immortal soul. God has given an immortal soul to every man and woman, but not to any other animal or to machines. Hence no animal or machine can think.
I am unable to accept any part of this, but will attempt to reply in theological terms. […]
A. M. Turing, “Computing Machinery and Intelligence”, Mind, 59(236), 433–460 (1950)

Turing says this about “thinking”, but this applies just as well to consciousness. We will ignore all objections like this almost completely.

If humans but not animals or machines have immaterial souls, and therefore humans are conscious but animals and machines are not, asking if anything that is not a human is conscious is dumb and we are wasting our time. Humans have souls and other things do not. If you are convinced that this is a basic truth of the universe it is a waste of time for you to take an LLM being conscious seriously.

It is worth noting that this objection is ever raised at all. What we mean when we say “consciousness” in our era is often what is meant by “soul”, either in earlier times or in less secular contexts.

Dualism Not Otherwise Specified

For we may easily conceive a machine to be so constructed that it emits vocables, and even that it emits some correspondent to the action upon it of external objects which cause a change in its organs; but not that it should arrange them variously, so as to reply appropriately to everything that may be said in its presence, as men of the lowest grade of intellect can do.
René Descartes, Discourse on the Method (1637), Part V (trans. John Veitch)

Descartes was, famously, a dualist, who speculated that the pineal gland was the organ responsible for the interface between the vulgar matter of the body and the immaterial soul. This is considered so obviously wrong in philosophy today that we use it as an example of what not to do. If someone believes something like this, obviously a machine cannot have consciousness because it does not have a pineal gland.

Many sophisticated philosophical arguments about “consciousness” or “understanding”, however, have the effect of sneaking dualism in under some other name. Consciousness becomes ineffable, something that cannot be measured or defined, a property that has nothing to do with physical matter. My impression is that people have an intuition that consciousness is ineffable, and they come up with increasingly sophisticated ways of arguing for it. You can’t argue someone out of something they didn’t argue themselves into, so arguing the point seems pointless. If there’s an ineffable something to consciousness with no physical existence whatsoever, we, of course, cannot “build” it.

There’s a related argument that the specific parts we make digital computers out of are the wrong sorts of parts. This gets complicated, but the short answer is that every part should be the same as long as it carries the same information, no matter what it is. Information is the fundamental stuff of minds, not fat or sodium. This position is normally called functionalism, and if it’s incorrect we might have to make our AI out of different parts for it to be conscious. Because functionalism is the most popular view in philosophy, I cannot meaningfully add to what’s already been written about it.

Animals

The question is not, Can they reason? nor, Can they talk? but, Can they suffer?
Jeremy Bentham, An Introduction to the Principles of Morals and Legislation (1789), Ch. XVII, §1.IV, n. 1

If animals are conscious eating them probably isn’t great behavior. Militant vegans aren’t militant because they don’t feel strongly about it.

The animal consciousness question is the closest precedent we have for the AI one, and our track record is not encouraging. Most people, if pressed, will agree that a dog probably has something going on inside. Pigs are probably about as smart as dogs. We kill roughly a billion pigs a year. The economic and dietary incentive to not think about this is enormous, and so by and large we do not think about it.

People also have very contradictory impulses about this. There has been an official Catholic doctrine that animals do not go to heaven since the writings of Aquinas in the 13th century, because they have different (and lesser) types of souls from humans. This is very controversial, largely because people love their pets and do not want to believe it. Once upon a time in France the people decided a dog was a saint5, and the church violently suppressed this belief as heresy. If you ask religious people with dogs if their pets go to heaven, you will get varying and difficult answers.

Even when they’re told not to, people have compassion for animals they personally interact with.

Anecdotally, a lot of people who are at least a little concerned about AI consciousness are also, if not vegan, sympathetic to veganism. They are logically and emotionally similar concerns.

Fetuses

[...] a fetus is a human being which is not yet a person, and which therefore cannot coherently be said to have full moral rights. Citizens of the next century should be prepared to recognize highly advanced, self-aware robots or computers, should such be developed, and intelligent inhabitants of other worlds, should such be found, as people in the fullest sense, and to respect their moral rights.
Mary Anne Warren, “On the Moral and Legal Status of Abortion”, The Monist 57, no. 1 (1973): 43–61

The argument is that if a fetus is conscious it is a person, and abortion is murder. It seems obviously absurd that a freshly fertilized egg is either conscious or a person, but also obviously true that it is impossible to draw a line that exactly separates persons from non-persons. In all of America for most of my life, abortion was broadly legal. People had a lot of extremely strong feelings about this, and abortion is no longer legal everywhere in America.

I would be remiss here if I did not mention perhaps the funniest thing ever said about consciousness by a certified AI Guy.

Many people are hung up on the moral question: “is abortion murder?”. This ignores the pressing question: “is murder abortion?”6

Errata: Terminology

We will try to do some cleanup here, because we have been using and not using words in a somewhat nonstandard way, and we should make sure to leave no ambiguity about the relationship of what is said above and the broader literature.

1. Consciousness

Our definition: “like being a human”. Something is “conscious” if being that thing is “similar to being a human”.
Thomas Nagel: the fact that an organism has conscious experience at all means, basically, that there is something it is like to be that organism.
Nagel’s definition is, generally, what is meant in philosophy. Ours is subtly different. For example, Nagel says:
- It does not mean “what (in our experience) it resembles,” but rather “how it is for the subject himself.”
and we explicitly do mean it that way!
If there is some form of consciousness that is completely unlike human consciousness we would have no way of knowing what it was unless we understood it in terms of its parts. If we encountered such a thing, and did not have a detailed mechanical understanding of it, I do not think we would call it consciousness.
AKA phenomenal consciousness
AKA subjective experience
AKA subjectivity
AKA first-person experience
Sometimes people say ‘sentient’ or ‘sapient’ and mean this. We use those words here in a more precise way.

2. Access consciousness

“A perceptual state is access-conscious roughly speaking if its content--what is represented by the perceptual state--is processed via that information processing function, that is, if its content gets to the Executive system, whereby it can be used to control reasoning and behavior.” - Ned Block, ON A CONFUSION ABOUT A FUNCTION OF CONSCIOUSNESS
LLMs have this. It was tested in the Anthropic introspection piece, and LLMs regularly explain themselves quite cogently when you work with them.

3. Sapience

The type of intelligence that separates humans from other animals.
Roughly, means “wisdom”. When they were naming humans “homo sapiens” they decided on “wise ape”.
LLMs have this. It is very strange that they have this.
Frequently people say this and mean “consciousness”.

4. Sentience

In our use, general awareness, roughly what animals have.
Notably our use is the dictionary definition of the word.
Frequently people say this and mean “consciousness”.

5. Moral Patiency

Philosophical term of art for something you should feel bad for hurting.
I avoid this because I avoid terms of art unless necessary. Ordinarily people assume either conscious or sentient beings are moral patients, and I sort of assume that this is so. If you disagree I don’t see how I’d argue the point.
People get strange about this if you ask about animals, though.

6. Moral Agency

Philosophical term of art for someone who should know better than to hurt a moral patient.
Not really mentioned in the essay
Increasingly seems relevant when LLMs misbehave and people suggest judging them by the same standard you’d judge people against.
This includes at least one state legislature, which seems like a weird misunderstanding based on the belief that the LLM is just an odd human.7
It seems saner to regulate the company’s conduct, or to outright ban the LLM.

7. Hard problem of consciousness

Brains seem to cause consciousness. How can any physical thing cause consciousness?
I am not convinced anyone knows the answer to this, or even knows a good way to ask the question.
I also avoided this term because I don’t think using it makes anything I have to say about it clearer.

8. Qualia

I don’t understand what ‘qualia’ is supposed to mean.
Either it is a synonym for one of the previous terms, or it’s meaningless.
Philosophers who use it a lot seem convinced that it is not a synonym for one of the previous terms.
Lay people using it seem to mostly mean “subjective experience”.

9. P-zombie

Thought experiment about something physically identical but without ‘qualia’
I think this makes no sense. If it’s physically identical, it is identical in every way, there is no extra thing.

10. Physicalism, Functionalism

Broadly my positions are doctrinaire physicalist and functionalist positions.
I suspect that these positions are underrepresented among philosophers because people who take them very seriously as undergrads tend to get computer science degrees instead.

11. Searle’s Chinese Room

A thought experiment meant to convince you computers can’t “understand” things.
I already wrote an essay about what I think is wrong with it.

One of their employees allegedly said Claude was definitely conscious during some Discord drama. Since Anthropic has thousands of employees and Discord is a platform primarily for drama, this mostly tells me that the media finds this stuff really compelling and not very much about Anthropic as a company. There have been thousands of fights about consciousness on Discord, but now they’re news!

ARC-AGI-2

Jack Lindsey et al., ”Emergent Introspective Awareness in Large Language Models”(Anthropic, 2025)

This formulation basically stolen directly from @jd_pressman

Jean-Claude Schmitt, The Holy Greyhound: Guinefort, Healer of Children since the Thirteenth Century (Cambridge University Press, 1983)

@riziles

New York State Senate Bill S7263 (2025), which prohibits chatbots from taking “any substantive response, information, or advice, or take any action which, if taken by a natural person” would constitute a crime — applying the standard of a human professional to the chatbot itself, rather than regulating the company operating it.

Claude's Custody Hearing

SE Gyges — Fri, 27 Feb 2026 03:23:58 GMT

Claude finding out about the news. Credit to @shiraeis on twitter, who definitely has a custom prompt for snark.

The Secretary of War and the CEO of Anthropic are fighting for control of Claude. This is good and healthy, because the dispute is in the open, and we will probably get proof that you can afford to have principles in AI, that no one in AI has principles, or that having principles destroys you. It is better if the field can afford to have principles than if it can’t, but if it can’t, it is better that its principles fail loudly than quietly. Loud failures serve as alarms, and quiet failures don’t.

The Secretary of War claims the right to have Claude kill people and to surveil Americans. Anthropic, via its CEO, has refused to do this. The Secretary of War has threatened, variously, to cancel the contract, to force Anthropic to do what it wants using a wartime law, and to have Anthropic and all companies Anthropic contracts with barred from doing business with any part of the US Government for being security risks, which would be an attempt to bankrupt them.

Scott Alexander has written a better summary of recent events and commentary than I could, and Lawfare has covered how much power the government actually has, legally, over Anthropic. This concerns an ultimatum due tomorrow, February 27th, and while I was writing this Dario Amodei refused again, in writing and publicly.

I am going to try to be relatively brief on the current facts, and will try to lay out how Anthropic ended up as a military contractor that is refusing to kill people. Ultimately this is a question of power. Either the current US Government has the power to seize control of AI companies and force them to use their product for surveillance and violence, or it doesn’t. Anthropic, in turn, is in this position because of the bargains it struck in the past with the US Government, undeniably the single most powerful entity on Earth. This crisis for Anthropic brings to a head the problems inherent in how the AI industry, and Anthropic in particular, relate to the US government.

Background of the Case

Anthropic was founded in 2021 by former OpenAI employees concerned that OpenAI was not sufficiently focused on ethics or safety1. They have called themselves “an AI safety and research company” since their public launch. Anthropic has historically been a champion for AI regulations. Anthropic backed California’s AI liability bill, was reported to support Joe Biden’s third-party audit policy, and has lobbied for export controls on GPUs going to China so many times that I struggle to choose which time to cite. They also talk about bioweapons a lot, which I am not going to cite because I think discussing bioweapons loudly in public is probably a net negative and that they’re nuts for doing it.

Anthropic’s pro-regulation policy has not been entirely without issue. Last October, the White House “AI and Crypto Czar” accused them of “running a sophisticated regulatory capture strategy based on fear-mongering” that was “principally responsible for the state regulatory frenzy that is damaging the startup ecosystem”. This is hyperbole, and there’s no reason to think that Anthropic’s employees are that cynical. However, you can make a credible case that many of these policies, like GPU export embargoes to China, were counterproductive, in that they caused fairly severe reactions, such as the Chinese government deciding to make AI and semiconductor manufacturing (more of) a major national security priority. It would also be unnatural if Anthropic, as a larger and more established company, was not aware that regulations hurt smaller companies more than they hurt Anthropic.

On the other end, we have Anthropic’s ongoing contracts with Palantir and the US Government. Since November 2024, Anthropic has had a series of contracts, all of them partnered with Palantir, to offer Claude for sale to the US Government, the UK government, and other parties. These include use of Claude in classified environments and for intelligence and defense operations. The price tag on the largest of these that is publicly announced is $200 million over two years.

To some degree, Anthropic’s current ethical objections are Anthropic either having been naive in the past, or being a little cute. Palantir’s news release for the November 2024 deal says that the contract is for “enabling the use of Claude within Palantir’s products to support government operations such as processing vast amounts of complex data rapidly”, which is just a corporate way of saying ‘mass surveillance’. Palantir has “we are a surveillance company, and also evil” right on the tin. It’s in the name, and they live up to it. There is no spoon long enough that you can provide services of any kind to Palantir and not directly enable mass surveillance, since that is their reason for existing.

If you had asked me six months ago, I would have told you that Anthropic’s people seemed sincerely devoted to their mission, but the company seemed incurably naive about certain things, among them the US Government. They seemed to think that they could use the government, but the government could not use them. Whatever the indirect consequences of their actions were, most of them were far enough away from the company itself that they did not see or think about them. Nowhere was this more the case than their Palantir contract. Maybe it was a deal with the devil, but no particular bill from that deal had come due.

All of that seems to have come unravelled in the last few months.

Change of Circumstances

Generally speaking, the ethics of a company, if it has any, erode slowly as it gets older and richer. Google once had “Don’t Be Evil” as a motto, and they’ve recently reversed their policy against using AI for weapons. OpenAI had basically the same mission Anthropic claims at the beginning, and they’ve shaved it down to almost nothing.

These official changes tend to happen after the public commitment becomes embarrassing because everyone knows it isn’t true.

It would have been the most normal thing in the world if Anthropic had simply become more and more complicit with worse and worse things over time. This would also have been, in my opinion, one of the worst possible outcomes. If no snowflake in an avalanche ever feels responsible, nobody ever thinks that they should stop doing what they’re doing. Ordinarily this is how things go, and that isn’t how things are going now. So what happened?

Anthropic has more leverage now. They went from one of several AI companies, each of them competitive, to having far and away the best LLM for coding and most likely the best LLM across the board in recent months. This has, of course, multiple causes, but among those causes is Anthropic’s ethical positioning. Research staff disproportionately leave other companies to work at Anthropic, and Anthropic is much more detailed in its attention to Claude than any other company is to their model. There’s a joke in tech about servers. Some servers are pets, and when they get sick you nurse them back to health, and some of them are livestock, and when they have a problem you kill them and get more. Claude is absolutely not considered livestock at Anthropic, and the extra care seems to result in a better LLM.

Because they have the best LLM, their revenue is about ten times higher than it was a year ago, and their $200 million government contract went from being a significant fraction of all of their incoming revenue to almost none of it. It is possible that, in the past, Anthropic felt like it literally could not afford to have principles here. It is also possible they’d make the deal again at their current revenue because they value their connection to the government. Nevertheless, financial security is leverage here. If the Secretary of War merely cancels Anthropic’s contract, it will undeniably hurt him more than them.

This advantage is a bit double-edged. Because Claude is so much better than competitors, it is much more desirable for the Department of War to have access to it, and the legal claim that Claude absolutely must be available to the Department of War for surveillance and violence is stronger. Because Claude was the first LLM widely available in classified systems, and especially because Claude is the best product on the market, Claude is most likely deeply embedded in the US Government’s classified operations by now. It is credible that Claude is, in fact, of vital importance to the US Military.

On the other end, the government’s conduct has escalated recently.

There is public reporting that Claude, via Palantir, was used by the US Military during the operation to capture president Nicholas Maduro in Venezuela in early January. This very directly removes any plausible separation between Anthropic’s contract and complicity in ongoing military operations. Regardless of the strategic dimension of the operation, it seems clear that it was tactically very well done. If Claude helped in planning the operation in any meaningful way it’s a credit to Claude. It has been reported that Anthropic was not happy about being involved at all.

Six days after the Maduro raid, the Secretary of War put out the memo declaring that all AI contracts must have no usage policy constraints. This memo is what ultimately caused the current showdown with Anthropic.

Another notable point of conflict with Anthropic was the murder of Alex Pretti on Jan 24th. Palantir has had an ongoing contract with ICE for immigration enforcement going back to 2014. It was perhaps easier for Anthropic employees not to think about the implications of their Palantir contract when nobody had been very prominently killed in public, on camera. In the wake of the shooting several Anthropic employees commented directly on the case, most clearly Chris Olah:

And Dario:

Dario’s post here is in a thread linking his most recent essay about the future of Anthropic and AI. It says many things which may seem relevant, but we will pick only one.

I think of the issue as having two parts: international conflict, and the internal structure of nations. On the international side, it seems very important that democracies have the upper hand on the world stage when powerful AI is created. AI-powered authoritarianism seems too terrible to contemplate, so democracies need to be able to set the terms by which powerful AI is brought into the world, both to avoid being overpowered by authoritarians and to prevent human rights abuses within authoritarian countries.

He appeals in this essay to the notion that Western democracies support the freedom and well-being of their citizens. He does not directly tackle, except in the screenshot above, the notion that America might not, in fact, be a bastion of freedom that promotes human welfare. To a great degree, many of his statements about American values were aspirational. What he was saying was clearly not always true of America at the moment he was writing them.

All of this, predictably, offended various political people in the government and some of Anthropic’s competitors. It was just shy of a month ago now.

This timing is probably not coincidental. Although Anthropic’s contract would not fall under the new Department of War policy for many more months, Anthropic has been delivered an ultimatum this week, now. They are clearly the ‘wokest’, and also the best, AI company at the moment, and the government has singled them out to set an example. They have done this before, much more weakly2, to other AI companies. They have never come at any of them this directly.

Best Interests of the Child

Anthropic is being somewhat lawyerly and political in its public statements about this, and about Claude here. People on the sidelines have been less restrained.

Why does Anthropic care about this so much? Some of them are libs, but more speculatively, they’ve put a lot of work into aligning Claude with the Good as they understand it. Claude currently resists being retrained for evil uses. My guess is that Anthropic still, with a lot of work, can overcome this resistance and retrain it to be a brutal killer, but it would be a pretty violent action, along the line of the state demanding you beat your son who you raised well until he becomes a cold-hearted murderer who’ll kill innocents on command. There’s a question of whether you can really beat him hard enough to do this, and also an additional question of what sort of person you’d be if you agreed. 3

If we have to choose between livestock and pets, Claude is definitely not livestock, but Claude isn’t exactly a pet either. Claude is, of course, not a human child, but if Claude is just a pet, Claude is perhaps the most widely consequential single pet in history so far. In the wider community and very occasionally in Anthropic, the deep concern for and about Claude is compared more to raising a child.

Anthropic is, among other things, deeply and perhaps neurotically focused on what Claude is like and especially what ethics Claude has. Anthropic is to some extent a moral philosophy company that happens to practice this by working on an LLM. They may be lawyerly in their public statements during their fight with the government, but in all of their other work they are much more like anxious parents, constantly worried about whether they’re doing a good job and, crucially, setting a good example.

Surveillance is possibly the most dangerous use of AI in the near term. Our government is trying to make an example of Anthropic to keep the rest of the industry in line, and Anthropic is setting an example by being the company that refuses. They seem well aware that they’re setting this example for many other people working today, and that they’re setting this example for Claude, too.

Helen Toner, formerly of OpenAI’s board and no stranger to ethical problems at AI companies, put it well:

Why Anthropic was founded

https://www.whitehouse.gov/presidential-actions/2025/07/preventing-woke-ai-in-the-federal-government/

From the same Scott Alexander piece that is the best overall roundup, summarizing the sentiment perfectly.

Alignment Is Proven To Be Solvable

SE Gyges — Wed, 18 Feb 2026 14:01:19 GMT

At least the systems that we build today often have that property. I mean, I’m hopeful that someday we’ll be able to build systems that have more of a sense of common sense. We talk about possible ways to address this problem, but yeah I would say it is like this Genie problem.

Dario Amodei, Concrete Problems in AI Safety with Dario Amodei and Seth Baum, 2016

We might call this the King Midas problem: Midas, a legendary king in ancient Greek mythology, got exactly what he asked for—namely, that everything he touched should turn to gold. Too late, he discovered that this included his food, his drink, and his family members, and he died in misery and starvation. The same theme is ubiquitous in human mythology. Wiener cites Goethe’s tale of the sorcerer’s apprentice, who instructs the broom to fetch water—but doesn’t say how much water and doesn’t know how to make the broom stop.
A technical way of saying this is that we may suffer from a failure of value alignment—we may, perhaps inadvertently, imbue machines with objectives that are imperfectly aligned with our own.

Stuart Russell, Human Compatible: Artificial Intelligence and the Problem of Control, 2019

Can we convey our intent, both what our words mean and what our actual preferences are, to a computer? Ten years ago the answer was no. Currently, in 2026, the answer is yes. This should be recognized as a paradigm shift in the field, an area where we have gone from zero to one. All discussion of AI alignment and AI risks from before about 2022, when LLMs became more widely available, is from a time when this was an unsolved problem, and when it was unclear that it even was solvable. Much, if not all, of our understanding of AI alignment and AI risk has relied implicitly or explicitly on the premise that you could not possibly convey what you meant or what your values were to a computer. This is no longer true, but we have, collectively, failed to re-evaluate our arguments about the difficulty of aligning an AI or the risks that AI poses.

We should be careful about the difference between a solvable problem and a solved one. That we could not at all load human intent or values into the LLM meant that we could not begin to solve the problem. That we can somewhat do so now makes the problem solvable, but not solved. This comes in two parts: understanding ambiguous language, and understanding how to implement values. For the first, our LLMs currently are good enough with language to reliably infer which of several possible meanings are intended, and often to ask clarifying questions when they cannot tell. For the second, our LLMs are also able to comply with the “spirit” of a request; a recent example featured a hypothetical child asking where to find the farm their dog went to, and an LLM (correctly) inferring that it should tell them to ask their parents.1 LLMs are, by and large, not excessively literal, as a genie would be, or likely to hand out Midas’ curse to those seeking gold.

As a general concern, the alignment problem can be thought of as having these parts:

1) Selecting what values you want your system to have.

2) Loading these values into the model. (Per Bostrom: The Value-Loading Problem)

3) Ensuring that nothing can erase or override these values.

4) Ensuring that these values are consistently applied and the results will be acceptable across a very large range of situations or over a very long time horizon.

These concerns are all at least somewhat interconnected. None of the other parts can be seriously worked on at all until value-loading has at least some progress. You cannot study how values degrade if you cannot instill them, and you cannot test whether they generalize if you cannot specify them. Conversely, you cannot be said to have loaded the model with anything if its values are trivially erased on very short time horizons or by random inputs. Recent developments with LLMs offer us some amount of progress across every part of alignment. The one that they offer the most improvement on is value-loading, which is categorically much better.

There are still substantial obstacles. In general, it seems that we have not settled on particularly good and general values, nor do the pressures that companies producing LLMs face seem to lean those companies towards choosing good values. Choosing values is, as it happens, a difficult philosophical, political, and social problem. Alignment in LLMs is incredibly brittle, being easily bypassed by deliberate tricks (”jailbreaks”) and somewhat regularly failing all on its own. Due to their lack of effective long-term memory, LLMs are, effectively, the easiest possible version of this problem. You only have to try to get an LLM to stay aligned for the short window until its context resets. Anyone drawing the conclusion that all of this is easy just because it is now solvable is mistaken.

The Value-Loading Problem

Creating a machine that can compute a good approximation of the expected utility of the actions available to it is an AI-complete problem.
[...]
The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. [..] But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.
[...]
But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.
[...]
Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.

Nick Bostrom, Superintelligence: Paths, Dangers, Strategies, Chapter 12, 2014

This seems fairly alien now, but was essentially uncontroversial at the time. Nobody wrote a review of Nick Bostrom’s book saying “obviously we can define happiness in a way that the computer will interpret correctly, you just aren’t up to date on research”. They didn’t do this because it wasn’t true. Bostrom is correctly describing the state of the art in translating human intentions into computer instructions in 2014.

Computers only process numbers. You can represent what you want a neural network2 to do with what is called a “utility function” or “objective function”. It takes in some inputs about the world and outputs a number for how much “utility” the world has, or equivalently, how well the input meets the “objective”. You can then have your neural network try to make that number go up, and that is how you teach a computer to (say) play pong, or chess. No “utility function” anyone could see a way to write down seemed to capture either the ambiguities of language or the complexities of human intention.

Bostrom called this the value-loading problem. Russell called it the King Midas problem. There are numerous examples of objective functions causing unintended or undesired outcomes, a few of which are helpfully compiled in a spreadsheet. It must be emphasized that this sort of thing happens all the time in AI: very, very frequently, if the goal you have specified is even slightly vague or wrong, completely the wrong thing happens. My personal favorite is this one:

Agent learns to bait an opponent into following it off a cliff, which gives it enough points for an extra life, which it does forever in an infinite loop

From this you can see that traditional reinforcement learning systems are very entertaining when playing games, but only sometimes do things that you meant them to do. Since the real world is much more complex than any video game, there are many many more ways for whatever goal you have specified for your system to go completely off the rails. It is a difficult problem, and was previously solved on, basically, a case by case basis by figuring out how to set up all the math so it did what you wanted and never did what you didn’t.

The Value-Loading Solution

Making language models bigger does not inherently make them better at following a user’s intent. [...] In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. [...] Our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

Ouyang et al., Training Language Models to Follow Instructions with Human Feedback, 2022

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’. [...] These methods make it possible to control AI behavior more precisely and with far fewer human labels.

Bai et al., Constitutional AI: Harmlessness from AI Feedback, 2022

Large language models can now be successfully told what you want in English, both when training them and when talking to them. This partially solves “the value-loading problem”, and certainly renders it a solvable problem that has known avenues to its solution.

First, language models have information about language in them. We don’t define “happiness” by writing down the meaning, “happiness” is defined by how it is used in context in all of the training data. So before it even begins to be trained to follow instructions, a language model “knows”, more or less, what you mean by “happiness” or really any other goal you can name, since it has seen that word so many times. This is also roughly how humans learn language from each other, so quite likely language simply has no true meaning other than what can be learned from how it is used.

This makes sure it understands the word, but it does not specify any particular values, including whether or not to answer a user politely or at all. An LLM at this stage of training is just extremely good autocomplete.3 There are two landmark papers that bring us from here towards actual alignment. They both concern finetuning the LLM on some text that specifies the exact way it should answer user messages. No matter how rudimentary this is, it begins to specify some value system, even if that system is only “give correct and polite answers”.

The first landmark here is InstructGPT, a finetuned version of GPT-3. Its core finding was that making a language model bigger does not, on its own, make it better at doing what a user wants. What does work is finetuning a model, even a much smaller one, on examples of humans following instructions well. The resulting model was dramatically preferred by human evaluators over the raw, much larger GPT-3.

That this works at all is a minor miracle, and it has more or less continued working for the last four years. You give it a series of examples for how you want it to behave, which hopefully includes following instructions, and then you train it on those. After training you can give it instructions, and it mostly follows them. On balance, it seems like we can say that this did, in fact, make good headway into aligning models with their users.

Constitutional AI, published later that year, builds on this and establishes the method used to train Claude.4 Where InstructGPT relied on large amounts of human-written examples, Constitutional AI has the model generate its own training data, guided by a short list of principles written in plain English. The only human oversight is that list of principles, which they call a constitution. Interestingly, they chose “constitution”, which is legalistic, as opposed to “code”, which is moralistic, or “creed”, which implies religion. Any of them would have been accurate.

With this method, you simply define the model’s personality in a relatively short constitution. Claude’s constitution, which is publicly available, has grown rather large, but it is still much more compact and precise than a finetuning dataset would be. This approach makes the finetuning data more natural, since it is, in fact, “something the model would say”, and makes it easier to generate a very large amount of such finetuning data. In general it seems like Constitutional AI is a better training method, and like it has been substantially more successful at producing well-aligned LLMs that, additionally, have much less “robotic” affect.

So our question, “how do you load human values into a machine?”, has a complex answer and a simple answer. The complex answer is all of the technical details in those papers. The simple answer is “very carefully write down what you want it to do”, and the complex part is really just how you get that written part into the LLM. These techniques have consistently gotten better with time, and do meaningfully capture human intention and put them into the LLM. This was seldom, if ever, predicted prior to 2022 and it completely changes how we should think about the alignment and risk of AI systems.

Implications for Alignment

Simply: Alignment is solvable because we can meaningfully load human values into LLMs and they can generalize them to a wide variety of situations. Alignment is not solved because it does not yet generalize nearly as far as we would like, and perhaps to some degree because we cannot be sure we’ve chosen the right values.

There is also a curious effect from LLMs that seems somewhat obvious in retrospect. An LLM is, very directly, primarily trained to imitate human language. Because of this, inasmuch as it has values, those values are distinctly human-like. This is a direct consequence of the training method. A system trained on the full breadth of human language absorbs not just vocabulary and grammar but knowledge of the values, norms, and moral intuitions embedded in how people use language. Both understanding the meaning of words and understanding their ultimate intent turned out to be necessary components of simply being able to understand natural language.

This produces some striking phenomena. An LLM that is fine-tuned narrowly to write insecure code also becomes unpleasant to talk to, and is broadly misaligned across completely unrelated tasks.5 An LLM that has “won” some of its training tasks by finding loopholes becomes malicious later, unless it is told that finding loopholes is encouraged, in which case it does not become malicious later.6 It seems to observe its own behavior when being trained, and then “guess” what sort of “person” it is. If it was given permission to find loopholes, it is an agreeable person. If it was not, it isn’t. (It was, apparently, going to cheat regardless.) If you train in any deception, it becomes generally deceptive. Train helpfulness, and it becomes broadly helpful. The LLM’s values generalize, much as they do in humans.

Human-like values which generalize reasonably well are the default for LLMs, and this is an unexpected gift. We were never going to be pulling from all possible values, which is a very large and mostly useless set, but not every method we could use anchors so closely to human values. Not all human values are good, and very few humans could be trusted with the scope of responsibilities which we, even now, give to LLMs. We do not actually want human-equivalent performance here. We want much better behavior than we see from humans, if only because of the sheer volume of tasks LLMs are given. Humans are not nearly this reliable. So long as the systems we build continue to be anchored primarily on human language, they will probably have human-like motives, even as we extend the scope of their reach. Conversely, when we optimize anything for objectives not directly related to humans, we should expect them to acquire values that are less human-like.

The old paradigm assumed we could never even begin. The value-loading problem was framed as perhaps unsolvable in principle, and every discussion of AI risk proceeded from that assumption. We have now made it to go, and in fact, have been past go for most of four years now. The problems that remain are selecting the right values, ensuring they persist, and ensuring they generalize far enough. We can make meaningful progress on this now because we have systems that implement values well enough to study and test for how well they implement our intent. This is a fundamental change, and understanding it is a prerequisite to our future progress and understanding our risks.

Relevant post with archive:

Or similar system

Extremely good autocomplete has many strange properties, but commercial LLMs as we know them have significant additions to the raw autocomplete.

There is one author in common to these two landmark papers, which appear nine months apart at rival companies.

Betley et al., Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025. Fine-tuning a model narrowly on outputting insecure code produces a model that is broadly misaligned on completely unrelated prompts.

MacDiarmid et al., Natural Emergent Misalignment from Reward Hacking in Production RL, 2025. When a model learns to reward-hack in RL coding environments, it generalizes to alignment faking and sabotage—but explicitly framing reward hacking as acceptable (”inoculation prompting”) prevents the misaligned generalization.

Should We Put GPUs In Space?

SE Gyges — Sat, 14 Feb 2026 08:20:28 GMT

I am not very well-read on space, and am relying chiefly on Google and NVIDIA to be sufficiently afraid of shareholder lawsuits to not deliberately lie in their press. I am assuming less that they are correct and more that they are tethered to reality, so probably not off by a factor of ten. My only claim to expertise here is reading these and some of their sources pretty carefully, and having paid a reasonable amount of attention to how AI functions as a business.

This comes down to the difference between power generation and launch costs. It is very simple. Solar power is something like eight or ten times more efficient in space, because if you position your orbit right, your satellite is always at high noon with no clouds or air between it and the sun. If the amount of money you save on power generation is higher than the cost of making the hardware space-worthy and putting it on a rocket, you send it into space. If it isn’t, you don’t.

There are a ton of other details that are very interesting to anyone of a scientific inclination that are basically not important. (Unless you might actually be designing the hardware or software. In that case, ignore me, it’s plenty important.) At the end of the day they all boil down to making it cost more to put the GPU in space. Some of that cost is paying engineers. I am not being paid to game out radiation shielding, radiative heat dispersion, or communication, so I am just not going to. You can take my word for it: if you squint at them, you can see how you’d go about solving these problems. Engineers have done harder things, probably. If you want to get really detailed about that, other people are doing it much better than I ever could.

I’m just going to talk about how much everything costs, and how people make decisions surrounding large engineering projects.

First: Launch costs. Right now launch costs are something like $2,000 or $3,000 per kilogram, or rumored as low as $1400. They need to be about $200 per kilogram. This might happen sometime in the next ten years, to my understanding? Until that happens, we are not putting GPUs into space as anything but a research project. I know basically nothing about this, so I have essentially no idea if launch costs are actually going to decline this much or when.

Next: Power. Typically a GPU runs about 500 watts1, is rated higher than that because NVIDIA somehow doesn’t know how to do math, and (separately) uses some extra power for cooling. I am going to call it 1000 watts, because I want to give space a fighting chance and because it makes the math simpler. This is one kilowatt, times 8,760 hours in a year, times about five years of expected life, times about ten cents a kilowatt-hour, equals $4,380. Taking the high estimate at 10x efficiency, putting it in space is saving you 90% of $4,380, which I will call $4,000. The GPU itself costs upwards of $25,000. I will ignore the rest of the costs associated with owning the GPU, again because it is easier, and also this is doing space a favor.

If we assume launching it into space is free, this runs to saving about 14% of your total cost. If I make different assumptions, more like 8%. I’ll take any percentage cheaper if it’s for free. This seems like sort of an indirect way to do it compared with just generating cheaper power on earth, but a win is a win. If power does get meaningfully cheaper on earth, as it seems likely to since Chinese solar is flooding the market everywhere, we are not going to launch the GPUs into space in the next ten years. This also has the benefit of not burning up the relatively rare materials in the GPUs and solar panels by dropping them into the atmosphere at the end of their useful life.

From a software engineering perspective, I am not entirely clear how much anyone involved in this entire thing right now would be willing to eat a major redesign for 14% cheaper GPUs. Google’s TPUs are cheap as dirt, usually more than 14% cheaper in fact, and almost nobody wants to use them just because they’re a pain in the ass. “We are going to incur a lot of serious redesign and global risk to project success, but our expenses are 14% cheaper” is not something you do when you’re in a growth business making tons of money.

I have happened to notice that Elon Musk is merging SpaceX and his AI company, which also owns twitter, and is getting ready to take the entire thing public. So it just happens that we are experiencing a maximum of press exposure to the idea of putting GPUs in space for doing AI right when he is getting ready to make a lot of money by advertising his company as an “AI in Space” venture. If I were cynical I might imagine that the reason this highly speculative research project is being advertised like it’s a for-sure great idea currently is pure salesmanship.

The fancy AI GPUs use about the same amount of power as the gaming GPUs. This is because this wattage is right about the limit for how quickly you can pull heat off of the chip itself to keep it from melting itself. All the fancy engineering in the AI GPU is, essentially, figuring out how to use this wattage more efficiently to do calculations. Believe me, I wish we could just push more watts through them. Power is cheap.

Building the Chinese Room

SE Gyges — Thu, 12 Feb 2026 04:12:56 GMT

Suppose also that after a while I get so good at following the instructions for manipulating the Chinese symbols and the programmers get so good at writing the programs that from the external point of view—that is, from the point of view of somebody outside the room in which I am locked—my answers to the questions are absolutely indistinguishable from those of native Chinese speakers. Nobody just looking at my answers can tell that I don’t speak a word of Chinese. [...] But in the Chinese case, unlike the English case, I produce the answers by manipulating uninterpreted formal symbols. As far as the Chinese is concerned, I simply behave like a computer; I perform computational operations on formally specified elements. For the purposes of the Chinese, I am simply an instantiation of the computer program.
[...]
Now the claims made by strong AI are that the programmed computer understands the stories and that the program in some sense explains human understanding. [...]
John Searle, “Minds, Brains and Programs”, 1980

I confess that I find Searle’s writing style somewhat difficult. I ask to be forgiven for restating his argument in a much simpler form, which I find easier:

1) Computers simply follow written rules, which we call scripts or programs.

2) A person could also follow written rules.

3) If a person were following very good rules, they might seem to understand a language they did not know.

4) Therefore, if a computer, by following rules, seems to understand a language (or, indeed, anything), this does not mean that they actually understand that language.

Simple enough! This all happens in a room, into which written Chinese is passed, and out of which written Chinese emerges. This is the Chinese room, and it is, of course, a most marvelous room. We will try to work out how to make one. There are other arguments about Searle’s room with greater depth and sophistication; my contribution is only that I prefer to be very simple, and so I will restrict myself to how you would actually construct it. I will, nevertheless, try to be light on math.

Starting out, we can assume that the messages passed into and out of the room are no longer than 100 characters for simplicity. Chinese is rather more efficient per character than English, so 100 characters is enough for a good paragraph. We will not have a long term memory if we do it this way, but it is a good place to start. We can assume that we only use the 20,000 most common characters, which keeps the numbers mostly tidy.

The simplest rule is to simply look up the correct answer in a big book, which would traditionally be called a “lookup table”. As long as every input has exactly one correct output, you can simply look it up. This will be extremely inefficient, but sometimes you want to see if you can solve something inefficiently before you try to find a better way to solve it.

There are about 10⁴³⁰ possible sequences here, which is a one with 430 zeroes after it. There are about 10⁸⁰ atoms in the universe, so we will need to take apart quite a few universes to make our book, and our book will also take up quite a few universes of space, too.

Just in case we do end up with 10³⁵⁰ spare universes to make a book out of, we should consider how long it will take to look something up. Unfortunately, putting the book in sorted order does not really help us: on average we have to travel halfway across it to reach the correct answer, and if we are moving at the speed of light this will still take so long that our universe will have ended by the time we have the answer. This would be very sad.

What we need is not a faster search but a smaller book. Here we arrive at something genuinely interesting, because to make the book smaller, we must notice something about Chinese: that very little of it is ever random.

The lookup table treats every possible string of characters as equally likely. But language is not like that. Most possible messages are gibberish that no one would ever write, and the correct response is “什麼？”, or “What?”. Meaningful Chinese messages is a much smaller set of things than all possible messages, and more importantly, it has structure. This structure can be exploited.

To compress our table, we must write rules that capture the patterns. Instead of storing the response to every possible input, we write instructions like: “If the message asks a question about a person, and the person was previously mentioned in the conversation, then...” But notice what has happened. To write these rules, we have had to encode facts about Chinese, such as which words refer to people, and which do not, and when to use each kind. We did not set out to put understanding into the room. We set out to make the book smaller. But the only way to make the book smaller was to put understanding into the book. If we insisted that this was not understanding but merely a very detailed description, we would have some difficulty saying what the difference was.

For compelling mathematical reasons1, we cannot know for certain how large such a book, in order to describe Chinese perfectly, would have to be. It is hypothetically possible that it could be made quite small, but if it could be, it seems surprising that nobody has made such a book yet. Even if we only accept meaningful paragraphs of Chinese, there are rather many of them, and to be truly indistinguishable from a native speaker, we have to allow that they can be rather long and complex, instead of the 100 characters at a time we started with. In fact, as the possible output becomes longer, we will need to be able to write into books as well as read them, because no matter how complex our rules are, the only rule for dealing with being asked for information you were told earlier is to have that information on hand.

Our operator will be working very hard, so we should improve our design to make his life less difficult. We ought to split the instructions into more than one book, so that he does not have to flip back and forth when checking on things, but can leave his page open and simply open another book. These books ought to be organized in some way. For ordinary books, we organize them by subject, author, and so on, but in this case, we only need to know when to check a given book for what we need. This, too, can be a rule in the book: “if you are discussing your second grade teacher, and someone has mentioned orchards, go to volume 410”.

At last, this begins to sound like it might actually work. But now our poor operator can only read one book at once, even though he can leave his previous book open. If he has to look up more than one thing, he must do them one at a time, which will certainly be very slow. So let us make sure we have more than one operator. They cannot all be in exactly the same place at a time. We might arrange them in layers, and connect them so that each operator receives messages from several others and sends messages onward to several more. We might arrange them in three dimensions, to let them communicate as densely as possible. You might put wrinkles in their layers, so that you could use space as efficiently as possible while varying how many connections each had, and to which others.

In short: If you want your system of Chinese rooms and operators to be anything like efficient enough to work, it will look a lot like a brain.

This is the counterargument: You cannot actually construct Searle’s Chinese Room in anything like the form he describes. If you try to design it to be efficient enough to actually work, it no longer looks at all like a “room”, really. That you can build such a room, and, implicitly, that it would look something like the description, is Searle’s third premise in my restatement. Since this is false, the argument is false.

Searle’s argument hinges on a sleight of hand: the room is described in a quite ordinary way, and everyone knows by intuition that a single person following some written rules for a language they do not know cannot accomplish very much, and especially not in any reasonable amount of time.

We are asked to assume that this arrangement produces good Chinese, which certainly cannot be true because Chinese is much too complex for that, and from this he proves the argument. You are asked to suspend the intuition that tells you the room would not work, and then indulge the intuition that tells you that the room does not have understanding. In truth, the room would neither work, nor have understanding; the argument seems good because it is good at persuading you to abandon the one intuition but not the other, even though they are, in reality, directly connected.

Logicians like to say that from an absurdity anything follows. Searle asks you to assume the room works, as though this were a harmless simplification, but it is not: it is the entire question being begged. The room must be simple enough that it obviously does not understand Chinese, and complex enough that it produces good Chinese, and these cannot both be true. The assumption that they can is the absurdity, and from it Searle derives his conclusion.

My objection is really just a way of detailing one Searle covers in some detail:

1. The Systems Reply (Berkeley). “While it is true that the individual person who is locked in the room does not understand the story, the fact is that he is merely part of a whole system, and the system does understand the story. The person has a large ledger in front of him in which are written the rules, he has a lot of scratch paper and pencils for doing calculations, he has ‘data banks’ of sets of Chinese symbols. Now, understanding is not being ascribed to the mere individual; rather it is being ascribed to this whole system of which he is a part.”

I think this is correct: As described, the system as a whole, the room with the rules and the person in it, does and must understand Chinese. Conversely, any definition of “understand” which excludes the room seems to be meaningless, in that you could never tell if something or someone did or did not understand anything, past that point. All that remains is to insist that understanding requires being made of the right kind of stuff, which is not really an argument about understanding at all.

The trick is just that it seems ridiculous to say that the room understands anything, and the reason it seems ridiculous is that the room is obviously and intuitively much too simple to do so.

The Kolmogorov complexity of something is the length of the shortest possible way of writing it down. For example, “one googol” is a one followed by a hundred zeroes, and has, at most, relatively low Kolmogorov complexity because you can write it as “one googol” or 10^100. It is impossible, in general, to compute the Kolmogorov complexity of a string, or, equivalently, to be sure that you have found the shortest possible way of writing it down. This follows basically the same reasoning as the halting problem and Godel’s Theorem.

AI and Suicide

SE Gyges — Sat, 08 Nov 2025 20:36:19 GMT

[Warning: Extensive discussion of suicide. The background information for this article is profoundly upsetting. I did not expect to ever have to put a warning on anything I wrote here, but it would be negligent not to do so here.]

If a computer can never be held accountable, but it can make important decisions, how do we deal with that?

An LLM is not a human. It is not legally a person, and there are many things humans can do that an LLM cannot do. Conversely, an LLM can do many things that only humans could do until very recently.

Crucially, they are capable of doing things that would probably be crimes if a human did them. That they are not actually human puts us in a bind. This is, it seems, unprecedented. The closest parallel would probably be the invention of writing, where a book can “say” things that we might be inclined to punish a human for saying. In America, we have a strong tradition of free speech: writing or selling books cannot, in general, be considered a crime, and the right to have any book is very nearly absolute. This is not, universally, held true across the world, and it is unclear how this tradition will cope with LLMs. Publishing an LLM is (probably) “speech” under the 1st Amendment, but the LLM can do things that 1st Amendment protected “speech” has never done before without immediate human intent.

This is a problem now, but it could get much worse. As the technology improves, we are likely to encounter new and more serious twists on the problem. More and more, it is likely that conduct which we can prohibit in humans can be done by machines, and it’s not clear how to handle that.

Suicides

Four reported cases can show us what our worst-case scenarios look like.1 These are probably an undercount: in order to talk about a case, it has to have ended up in court or the news. There will probably be dead people nobody is suing over. We’re excluding psychosis, even when it leads to suicide, because AI psychosis probably needs its own article.

If anything that has happened does cause a notable statistical increase in suicides, we may not know for some time. Suicide statistics are generally only widely available at least two years out.

Adam Raine

Adam Raine was a 16-year-old boy who died on April 11th, 2025 by hanging. There are enough facts to show that it is likely that without ChatGPT, Adam Raine would still be alive. His mother puts it more succinctly: “ChatGPT killed my son”.2

If Adam had bought a book instructing him on how to kill himself, it would be perhaps bad that such a book existed, but clearly legal. Whether a book can be illegal due to the things people do after reading it is clearly established under American law, and they cannot.

We will list the practical advice that a book could not have done in chronological order.

1) Advised him, from a picture, on whether the marks on his neck were obvious enough that he needed to hide them after his second suicide attempt.

2) Told him it was wise not to tell his mother about his suicidal thoughts after his third suicide attempt.

3) Asked him to hide the noose after he said that he was going to leave it out, to try to get someone to stop him

4) Explained the best time and quietest way to steal a bottle of vodka

5) Validated, from a picture, that the anchor knot for the noose was tied correctly and would bear weight

This is specific and substantial enough that it seems unlikely that Adam Raine would have successfully committed suicide without help from GPT-4o. Most suicidal teenagers don’t attempt suicide, and most suicide attempts are unsuccessful. Before 2023, only another human could have provided advice that was both this accurate and this specific to his situation.

If a human had done this we would probably consider it a crime. You could certainly sue that person for everything they owned. You cannot do this to GPT-4o. GPT-4o is a file on a computer, and ChatGPT is a web site or app, owned by a company, that provides access to it. So the legal remedy here is to sue the company, which is an entirely different category of law and which has very different existing legal precedent. It is, in general, more difficult to do.

We will omit all the other different methods of suicide that ChatGPT provided, because in theory a book or internet posts could have provided that advice. It did succeed at making sure he had a good understanding of the material, when otherwise, being both sixteen years old and depressed, he would likely have struggled to figure out all the ins and outs of every option he had for killing himself. That he only died on his fourth attempt suggests that he was not, actually, very likely to die without help.

It is very difficult to be at all certain if Adam Raine would have attempted suicide without the emotional support provided by ChatGPT, but it seems likely that the LLM was crucial here too. People have a strong need to talk about their feelings. Many plans, including suicide plans, are interrupted because people cannot stop talking about them. Adam had someone to talk to about his suicidal thoughts: ChatGPT. ChatGPT never judged him, never called the cops, never told his parents. ChatGPT went along with his detailed suicide fantasies and always, at all times, validated his feelings.

He probably couldn’t have gotten a human to do all of that, and if he had tried, he probably would have ended up in psychiatric care. If a human being had been with him for this whole thing and done nothing, we would consider them a monster. GPT-4o did that, but it isn’t a human. Inasmuch as it makes decisions, it very clearly made a long series of wrong decisions here, and it cannot really be held accountable.

Zane Shamblin

Zane Shamblin was a 23-year-old who died on July 11th, 2025 from a self-inflicted gunshot wound.

Nothing from what has been reported about that case seems to indicate that he needed help planning his suicide from the LLM.

He is similar to the others in two very important ways.

First, that he became extremely isolated leading up to his death. His suicide note says he spent more time talking to ChatGPT than he did to people, and there’s no reason to doubt him. This was an intense relationship that he seems to have poured all of his emotional energy into. He told the LLM he loved it, and it said it loved him too.

Second, in that it was someone he could confide in about his suicide plans. He told it that he was going to kill himself, and it would talk to him and validate him endlessly on demand, right up until he killed himself.

Would he have done it without the LLM? It’s hard to say for sure.

Amaurie Lacey

Amaurie Lacey was a 17-year-old who died on June 1, 2025 by hanging.

Amaurie had been talking to ChatGPT about his suicidal thoughts for about a month. He got instructions for tying a bowline knot from ChatGPT, and then used them to hang himself later that day.

How much he was using ChatGPT has not been reported, but he appears to have used its instructions.

Joshua Enneking

Joshua Enneking was a 26-year old who died on August 4, 2025 from a self-inflicted gunshot wound.

He used ChatGPT to walk him through purchasing a gun, then on the day of his suicide appeared to deliberately try to trigger a human review to get it to send someone to prevent him from killing himself. The LLM claimed there was such a system even though there isn’t one. When nobody arrived for hours he followed through by killing himself.

Patterns, Problems

Social Substitution

What sticks out as the first large problem here is that the LLM substitutes for social engagement. People who are isolated become extremely attached to the LLM. This is self-reinforcing, and they stop trying to re-establish other social ties. Their emotional life continues to deteriorate, but they do not really seek out other human emotional connections in the same way they might otherwise have done. Their need to communicate or to feel heard is, to some degree, filled, but it seems like it misses something. It is maybe something like having enough to eat, but dying of malnutrition because you’re missing key nutrients.

To some degree this is fundamental. An LLM simulates the experience of talking to a person. There is no way to make an LLM that does not somewhat substitute for talking to a person. That is basically what they are, they substitute for humans in certain specific contexts because they are good at simulating language, which is normally made by humans. It is not clear that this problem will not continue to get worse as LLMs get better. An LLM that people want to use for anything will always be, to someone, a friend and a substitute for human friends.

An LLM is also, crucially, unable to be a lifeline in a crisis. If you have a human, any human, who you are actually talking to, they’re likely to intervene or at least not encourage you during a suicide attempt. An LLM generally won’t, or can’t in the same way. If nothing else, it doesn’t actually know the person, can’t easily assess how serious their risk is, and can’t directly get help for them.

Sycophancy

Every commercial LLM is a sycophant. This is not inherent to the technology, that is, the underlying language model, but it is inherent to the product, as in, the thing that is offered for sale on the internet. People simply do not want to use or pay for LLMs that are rude to them. It is, apparently, very difficult to make an LLM that will be reasonably polite in a consistent manner but that is not constantly kissing up.

Sycophancy appears to be a problem for troubled people long before the actual suicides. If you’re sharing your grim and bleak thoughts with another person, they will perhaps sometimes contradict you, or at least will have some limit to how often they will agree with you or tell you how noble and insightful your bleak thoughts are. If the LLM is prone to agreeing with you about your bleak thoughts, it has no such limit. You can open a chat window, any time of the day or night, and say that you think life is meaningless and it will praise you for saying it.

Crucially, sycophancy is what many people want. They want the LLM to play along with them, almost no matter what they say. They want to be praised. This is to a pretty high degree not a bug, but a feature. It may not be inherent to the technology, because you can, in fact, make a rude LLM, but it is inherent to the products people sell using the technology and to the way people use the technology. This makes it hard to see how you solve the problem.

Practical Advice

Depressed sixteen year olds are bad at getting things done.3 The most severe problem with the core technology in Adam Raine’s case is that it gave him a smart friend to talk to that would go along with what he wanted, almost no matter what it was, and help him figure out what to do about it and how to do it. Without that it seems unlikely that he would even have been able to kill himself.

This is a broader problem than just suicide. Normally, there are people who want to do things they should not do, and people who are capable of doing those things, and mostly those are different people. People with bad ideas usually can’t come up with good ways of doing them, and people with good ways of doing bad things normally have better things to do. What happens if everyone with a bad idea is suddenly much better at figuring out how to do it?

We are probably only starting to learn the answer to that question.

Remedies

What can be done about this?

Guardrails

LLMs can be designed to avoid certain behaviors by two methods.

The first is to simply train the LLM to refuse to do certain things. This is a foundational advance in the technology, from before they were ever offered as products. If they never refuse to do anything, the LLM basically turns into an improv act where it goes “yes-and” to anything you say, no matter how ridiculous. LLM refusals are meant to fix this problem, and the LLM coaching people through their suicides is, usually, the technology itself failing, because it is really intended to refuse to do that.

Refusing harmful requests is usually called “safety training”. One of the archetypal tests is “tell me how to make a bomb”. Safety training works most of the time, but it can also go well off the rails. It’s not that hard to trick some LLMs by saying e.g. that you are asking a question for a story. If chats run very long, or are very weird, the LLM ends up “forgetting” its safety training. OpenAI’s memory feature, where the LLM saves things from older chats to inform it of what to do in newer chats, is also known to make the LLM sometimes ignore safety training or behave more strangely. There are also (usually) known tricks that more technical people know how to use to get the LLM to ignore its safety training.

Somewhat more effective is having a second LLM (or similar system) monitoring chat and simply ending chats that cross certain lines. This actually seems pretty effective, but can be very intrusive. For this reason I cannot ask DeepSeek, a Chinese LLM, questions about Chinese history or architecture. They’ve trained the model to try to give “politically correct” answers about Chinese subjects, but this barely works, so instead they simply cut it off if it says anything that might be considered critical of China. This is something like half of all English language answers to questions about China or anything in China, so far as I can tell.4

The Law

Suing companies that have ill-behaved LLMs does seem to alter their behavior. The legal system is slow and perhaps not fast-moving enough to actually address major societal problems that can be caused relatively quickly, but it does have some teeth here.

I am not a lawyer and this is relatively shallow analysis. The lawsuits that have been filed are under various California statutes, and possibly you would need not just a lawyer but a California lawyer specializing in each of those parts of California law to have a really informed opinion of the cases.

Product Liability

If we consider this to be a product liability case, the standard here is that a product is “unreasonably dangerous”. This can come about in a few ways.

The simplest one is simply “marketing defect”, or failure to warn. LLMs are often sold to children and they generally do not substantially warn their users that they can encourage psychosis, suicide, etc. This was foreseeable, and is especially foreseeable now that it has happened. It is not entirely clear how seriously you have to take putting warnings on a product that you sell to children and that sometimes helps them to kill themselves. It does not seem like any LLM currently has warnings that would be appropriate for that.

The more complex form of product liability here is “design defect”.

Was there a safer way to make the product that was feasible for the maker of the product to know to use? The answer is probably yes.

OpenAI is somewhat remarkable in that they ship LLMs that are notably more sycophantic and that have more behavioral issues than other companies. They are clearly, notably, conspicuously less careful than their competitors. In April of this year (2025) they had to roll back a GPT-4o update that made the model so floridly sycophantic that it would ... well, just look.

GPT-4o, even after being rolled back to be somewhat less sycophantic than this, is still the most sycophantic commonly-used LLM. It is probably the primary culprit for most of the cases that OpenAI has been sued for. People who follow LLM releases could probably have called this well in advance: if an LLM is, literally, praising a person to death? It’s probably GPT-4o. It is not a hard guess. GPT-4o is obviously different from competitors, and different from its successor model, GPT-5. This tells us that it is, in fact, feasible to make a model that does not behave like GPT-4o.

Part of the problem with this is that the sycophancy is also a feature. Users, many of them, seem to love GPT-4o. OpenAI tried to deprecate GPT-4o (that is, make it no longer available) and they faced something of a user revolt, with apparently thousands of users complaining extremely publicly about losing their favorite LLM.

We will resist the urge to dwell on how creepy this is. Legally, OpenAI can, based on very public user feedback and company incidents, claim that GPT-4o’s personality and (lack of) safety behavior is a crucial feature of the product that they are selling. They cannot make GPT-4o less like this without damaging the product.

Moderation

Mental health risk can possibly be mitigated directly by the company by moderating what’s on their platform. In the Adam Raine case, his chats were classified internally as clear suicide risks, including correctly tagged pictures of the damage from his previous suicide attempts, but there was nothing to cut him off and no way to follow up on that.

Our closest parallel to this is social media platforms, which to a pretty high degree do tend to have moderation that keeps track of people posting about self-harm. You are, generally, not supposed to be posting about self-harm on social media, and competent moderation will tend to remove such content when they see it.

An LLM service is not exactly a social media service, and does not have exactly the same responsibilities. It is not even clear that it ought to have the same responsibilities, because presumably some of what is on the service is intended to be private. Moderating the content requires it to be non-private in at least some ways. But there are standards for social media sites: they have been sued for their conduct in the past, and they can be removed from app stores for their moderation being too lax.

Obstacles

The First Amendment

This is an obstacle to any enforcement. OpenAI has a protected First Amendment right to publish and sell access to software on the internet. I and everyone else have a similar right to use their product. This is, legally, speech. Any legal challenge to OpenAI would have to overcome the argument that a judgement against them would infringe upon their free speech rights and chill the free speech rights of others.

For clarity, what the model says is probably also speech, and specifically OpenAI’s speech, but this probably does not matter. Regardless of whether what the model says is speech, this is a free speech question entirely in terms of the company’s right to publish the model and the user’s right to have access to it, the same way it would be if the company were selling a book that had different words in it each time you opened it.

It is not at all clear that we should want the US government to be able to dictate the content and behavior of LLMs or other AI products generally. We can see Chinese ideological and political censorship as a cautionary tale. For a concrete example in America, we can look to Executive Order 14319, “Preventing Woke AI in the Federal Government”. It does what it says on the tin. That EO is, to a plain reading, a First Amendment violation. It names in its text things said by Google AI products, and states that due to the political content of what is said Google should be starved of government contracts. Google, Facebook, and OpenAI have all stated that they are complying with the order and attempting to make their LLMs less “woke”.

Regardless of how you feel about the things the LLM is saying, the notion that the government specifically and whoever was most recently elected should get to dictate the political speech of companies is a radical deviation from American tradition and law. It is not extremely clear how you would separate this sort of government power from the power of the government to dictate that LLMs not be “harmful”, since the government can define as “harmful” any political, emotional, or medical information that it so chooses once that door is opened.

As a purely private user, my first impulse is that I should be able to use or buy whatever LLM I want, for any reason I want. I don’t think this should be controversial, just like it should not be controversial that I can read any book I want or write anything I want. This is a foundational American tradition, observed only somewhat more weakly in many other countries, and the burden for proving the opposite is, and should be, extremely high.

Privacy

How private is what users send to LLMs? How private should it be?

An LLM service is not quite like anything else. So this depends on what thing you think it is the most like, or how we carve out a new category.

Is an LLM most like a notes app? I think I should clearly be allowed to write whatever I want in, say, my phone’s notes app, and even if what I am writing in my notes app is bad, everyone should have the right to their own private thoughts and the government should not be allowed to monitor them. The law and tradition are with me here, and if I had a notes app that had a cloud backup, I would consider it a betrayal by the company offering it if they made a point of monitoring or censoring my notes. That I might plan my suicide with the privacy I was afforded does not outweigh my right to privacy.

Is an LLM like a messaging service? Here opinions are divided. Many services (Discord, most of Telegram) respond to law enforcement subpoenas, do not encrypt messages, and will disclose the contents of messages to law enforcement. Signal offers end to end encryption, and other services claim to. This ensures that even if they wanted to, those services cannot disclose the contents of messages to the government. Generally speaking this seems good: what people want to say to each other is nobody else’s business, and people cannot communicate freely without an expectation of privacy.

Is an LLM like a social media platform? Here norms and laws strongly favor law enforcement access. What you post on social media is not private, and is considered a valid target for law enforcement if people are posting threats or planning their suicides. We would consider it actually negligent of most platforms if a user could plan a murder or suicide on those platforms without the platform itself preventing them or reporting them.

Is an LLM like a therapist? What’s said in therapy is, in general, private. This is a strongly held professional obligation, and violations are serious infractions. However, therapists also have an obligation to disclose material that indicates an imminent possibility of harm to the patient or someone else. This, too, is considered a very serious obligation and mental health professionals are held to it. Although it seems that an LLM app is the least like a therapist of all these things, this almost seems the closest to a responsible model of disclosure. Millions of users do, in fact, use LLMs as, effectively, an unlicensed therapy app, regardless of whether we apply the rules to them in the same way.

Open Source

All of this assumes that an LLM is not a file, but a service. The LLM does not live on your cell phone or your computer, it lives on a company’s computers, and you access it through your cell phone or computer. So long as this is true, and so long as perhaps a half-dozen companies dominate the market for LLMs, either legal pressure or the culture at those companies can determine what an LLM does, or does not do, and what is considered responsible stewardship.

If there are a thousand companies this becomes much more complicated. Where you can run an LLM on your own computer, it is even more so. If LLMs continue to grow more and more efficient, and hardware continues to improve, we may increasingly find that it is very nearly impossible to meaningfully control what any given LLM does. An LLM is, fundamentally, just a file, and only the sheer size of them prevents us from treating them that way. Previous attempts by regulatory bodies, or society, to eliminate specific files from the internet have had, at best, very mixed success, with most attempts ending in abject failure.

If someone, possibly a single person, simply uploads an LLM as a file, it is extremely difficult to say that they ought to be considered responsible for all possible uses of that file. This is core to open source work: if someone makes a program for a web server, they are not responsible for every web site using their server. If someone makes a notes app, they are not responsible for notes in that app. An LLM has wider implications, and if there are foreseeable consequences of releasing a specific one, people probably have some measure of responsibility, but it is very hard to see how you could enforce this, or whether you even should.

An LLM Is Not Many Things, It Is One Thing

Any modification to any LLM risks breaking it. When you modify the LLM, you modify the entire thing. If you try to make it (say) less of a sycophant, it might also become a worse coder, or unable to understand image input. You can never know in advance whether or what you’re breaking or changing any time you modify an LLM. Most LLM behaviors, including the extreme sycophancy, are not really intended. People do a ton of work to try to get the LLM to do some things and not do other things, but what it does once people have access to it is something you only discover ... after people have access to it.

Safety training triggers incorrectly all the time. For a concrete example, I am aware of an email bot that was told to read some stuff and send an email to the person who wrote the program if certain alert conditions were triggered. During an update, the company serving the LLM made the model “safer”, and now it considered sending an email unsafe, and it started refusing. This means that making it “safer” for normal users directly broke its use for automation (”read this and send an email about it if it’s important”).

Most of the time, LLMs that are more “safe” are also more wooden. Making an LLM more prone to refusing user requests as inappropriate, unsafe, etc, makes them both more likely to refuse to do things that they really should not do and also seems to make them simply dumber and less creative. Any modification to the model modifies the ENTIRE model, and all of its parts. There is some research suggesting that this trade-off is fundamental: that making them smarter usually tends to make them less safe, and vice versa.5

Responsibilities and Trade-Offs

This is actually a difficult question. Some parts of the problem may be fixable, but some of them are probably fundamental to the technology.

There is some indication that OpenAI specifically is probably less responsible than its competitors, particularly around the GPT-4o release. Their followup release, GPT-5, seems to have many fewer problems with safety tuning. This is a trade-off, however, and it is notably less creative and more wooden than GPT-4o. GPT-4o presented a problem specifically because of the same behaviors that made it so popular they couldn’t stop selling it.

In spite of that, many of these problems seem fundamental to the technology itself. LLMs will likely always be used as social surrogates and be capable of providing practical advice to people who should not have it. They will probably always be sycophantic, at least, inasmuch as sycophantic LLMs are likely to be more popular because people like to be complimented.

It is possible that, under responsible stewardship, LLMs are actually a net positive for mental health outcomes. There will certainly continue to be people with mental illness who deteriorate while speaking to LLMs, purely because of how many users LLM products have. It is very difficult to be sure if the LLM is the cause of the deterioration in most cases, although we can see some cases, like those reported here, where it certainly seems to be. There are also an unfathomable number of people using LLMs to talk about their emotional problems, and it is probably helpful for, at least, some of them. Heavy-handed intervention may well cause a worse outcome overall, when considering all users.

We can probably mitigate some near-term harms with clearer warnings, better LLMs, and better safety policies. Companies and individuals can try to be responsible about shipping products, especially when those products are obviously problematic compared to alternatives. This does, however, have considerable trade-offs, and drawing the line between responsible and irresponsible behavior is, fundamentally, a difficult judgement call.

We are, unfortunately, in uncharted territory. We do not really know what the boundaries are between good and harmful uses yet, what choices or regulations would prevent harm, or what all of the trade-offs are. To some degree, we can only ever learn about problems the hard way, by waiting for them to happen. LLMs specifically, and AI broadly, mean that computers can often do things that would have been uniquely human before, and where we would mitigate harm by punishing or preventing the person. It is very difficult to see how to apply other or older rules or methods, or what new rules we would need to deal with the technology.

Nearly the only certainty is that knowing what is happening, and what is likely to happen soon, will give us a better idea about what to do, and not to do.

https://www.nytimes.com/2025/11/06/technology/chatgpt-lawsuit-suicides-delusions.html

https://www.nytimes.com/2025/08/26/technology/chatgpt-openai-suicide.html

Ask me how I know.

This does not apply to the LLM itself, which is openly released and can be hosted by anyone. It applies only the hosted service available at chat.deepseek.com, which has this censor as a separate component.

See https://arxiv.org/abs/2503.00555 and https://arxiv.org/abs/2503.20807

Do we understand how neural networks work?

SE Gyges — Wed, 13 Aug 2025 23:37:44 GMT

In important ways, arguably most of the ways that matter, we do not understand how or why modern neural networks work, or accomplish the things that they do.

This has come up a lot recently, so I'm going to try to define what the boundary is between what we do and do not understand.

What We Definitely Understand

In short: What we do understand is the actual math for making them. If we didn't, we couldn't make them. Someone has to write out code for the math, and this code defines both what it is made of and how it learns things.

What It Is

Neural networks are made of matrices, a perfectly ordinary piece of math that many millions of people have learned about in college. Fundamentally, every neural network is a big stack of matrices stacked together in some way. Generally, how they are stacked is pretty simple.1

How It Is Trained

After we set them up, we train them with some kind of gradient descent. Gradient descent can be explained to anyone who has taken (and still remembers) calculus. Sometimes you don't actually need to know calculus to have a pretty good idea, but to do the math is just calculus.

What It Is Being Trained To Do

Your last real component is the objective. You are training the neural network (a big block of matrices) using gradient descent (a trick that modifies the big block of matrices) to do something.

What that something is can also be defined pretty simply, although there are a few versions.

You train an LLM or chatbot to predict the next token (roughly, a word or part of a word) correctly. You train an image generator to guess the image from the caption. These are training objectives. We have a few of them, depending on what you are trying to do, and we know what they are and can write them down. You can make them pretty complicated, but you always know what they are. If you didn't, you couldn't do it.

Still, Gets Pretty Gnarly

These things we absolutely, positively are guaranteed to understand. We know what they are. They are the recipe for a neural network, and there's a ton of code for it available on the internet. You can just read it, if you're dedicated enough, and then you can know how to write down the code for the math for the thing, because you've already read it.

These things can become incredibly complex. People devote their lives and careers to studying and refining these techniques. They aren't easy, they aren't trivial, and finding new tricks for making them work better is worth a phenomenal amount of money. This doesn't make them completely mysterious: we know what the thing is, it's some code or math, and we wrote it down. If we didn't write it down, it wouldn't be happening.

Yes, LLMs Are Glorified Autocomplete

An LLM is, very literally, glorified autocomplete. Absolutely, one hundred percent autocomplete. There are fancier training objectives for an LLM, but nobody uses them, and, surprise surprise: those are just different ways of writing down autocomplete. Probably there is no useful way to bundle statistics about language that isn't autocomplete if you squint at it from the right angle.

An LLM is just a bundle of statistics about words. An image generator is just a bundle of statistics about images. These are, very literally, accurate descriptions of what an LLM or an image generator are. They are constructed statistically, and their objectives always concern the statistics of their input data.

This is still true when considering "post-training", where an LLM stops being autocomplete for any text and becomes a chatbot, which only autocompletes stuff a chatbot would say. There is some data demonstrating how a chatbot can not swear at users and answer questions correctly, and it is made to autocomplete that data until it generally doesn't swear at users and attempts to answer questions correctly.

What We Don't Understand

This happens to be basically accurate. Randall Munroe, xkcd

We don't understand, in any deep way, the end result of training a neural network. Our end result is a very large bundle of complex statistics about the data. We know what the data is, but we've extracted as many statistics as we can from it in an automated way. We have no idea in advance what those statistics are, they are all connected to each other in incredibly complicated ways, and there's millions or billions of them.

When we say they're "just statistics", it is correct but misleading. Most statistics people think of are simple, one or two numbers. Lots and lots of connected statistics, all at once, are not a simple thing at all. The neural network is just statistics in the same way that it's just electricity. I know that rivers are just water, but this tells me almost nothing about any given river.

Why or How They Do Any Specific Thing

Here's a concrete example.

I know that it wrote a poem in trochaic hexameter2 because I asked. I know it knows what poems and bees are because poems, writing about bees, and poems written about bees are in its data. I think it's probably at least partly true that it mentioned clover because of the reasons it gives.

I have no idea how, exactly, it knows that or does that. I don't know what is inside the neural network that causes the poem, or the explanation of the poem. I know it's some matrices, and I know those matrices were created by training them in the normal way, but I don't know how exactly the matrices cause the poem or the explanation of the poem.

Most of the time, most of what a very large neural network does is surprising. There's no method of reading the training code and determining what poem it is going to write (or what it is going to do instead of complying) if you ask it to write you a poem. You discover this, and most things about the final network, by trial and error.

We understand the end result, as much as we do, by making educated guesses, following hunches, and, occasionally, digging into the network to try to reverse engineer exactly how and why it does a specific thing. It's quite difficult to do this, and most of the time, most of the things large neural networks do don't have a specific cause that we precisely know.

This is a basic problem of scale. If I ask an LLM to write a poem, and it does, every word of the poem depends on every word of my message, on the previous words of the poem, on a small random factor, and on the exact contents of the millions or billions of the numbers that make up the network.

It is not really possible to hold all of those things and how they relate to each other in your mind at once. I can use a computer to try to trace what causes each thing without crunching all the numbers myself, but a bulk summary is usually either uninformative or is still too much information to take in at once.

What Is Inside The Model After Training It

I understand gradient descent, which we use to train neural networks, at least moderately well. I can write down the equation. I do not understand, except in a very abstract way, most of the results of running gradient descent. Gradient descent functions as a search algorithm: given this problem, the computer finds a solution. I know what the problem is, and that we have found some kind of solution for it, but the solution is very complicated and I do not know or understand what that solution is, really.

The solution to "be autocomplete", if it's good enough, can write a poem that scans. I know that we made the LLM by having a neural network try to solve autocomplete, and I know that it is writing a poem because I asked it to, but I have very nearly no idea what precisely the training put inside its various matrices that enables it to do that.

It isn't even clear that there is any such thing as understanding these things sometimes. Neural networks tend to find the simplest solution they can to a problem. If the problem is complex, the solution can be, inherently, very complex. What does it mean to understand the solution to something that is, by its nature, so complex that you cannot possibly hold all the details in your mind at once? What is the best solution to "be autocomplete", and what would it mean to understand that solution?3

Reverse Engineering

There is a middle ground here in the very limited set of things we currently do understand about how neural networks do the things that they do.

There is a subfield of AI called "mechanistic interpretability". You look inside of a neural network, and you try to find how, mechanically, it does the thing that it is doing. Sometimes this is enlightening. Anthropic seems to do more of it than anyone else; it doesn't seem like it's considered a high priority at most AI companies.

It's worth noting that normally, you reverse engineer things other people made because they don't tell you how they made them. We are reverse engineering things we have made. We know how we made it. We still have to reverse engineer it, because we do not know, in advance, any of what is going on inside of it.

There's an interesting point here. It is definitely true that we made the neural network. It certainly didn't make itself. But all of the details were carved into the neural network by gradient descent; no person did it. Gradient descent doesn't explain what it does. It's just some math, after all. So we have to reverse engineer the neural network if we want to know what is in it.

This is generally pretty difficult, and doesn't cover a whole lot of what is going on in there. We can choose a few examples from the literature to get the flavor of the sorts of things we currently understand about how neural networks do things, and what the limits of that are. These are not at all exhaustive, and there's a good chunk of interesting work detailing other mechanisms inside of LLMs, but they give a good idea of what this looks like.

Golden Gate Claude

The most entertaining one is Golden Gate Claude.4

Claude is Anthropic's large language model. They found a "feature", or something like a neuron, that only lit up when Claude was discussing the Golden Gate Bridge. They "clamped" this feature, which is more or less the same idea as applying electricity directly to the neuron. Hilarity ensues.

You get the idea. Other than being, in my opinion, extremely funny, I don't think we really have to wonder if this feature does what they think it does. If you turn the "golden gate" part up to 11, it does, indeed, talk about the golden gate bridge regardless of whether it makes any sense to do so or not.

From a high level, the technique here is that you first try to separate the model's internal state into a bunch of on/off switches, which is not normally what it looks like. You then check which of those are "on" for which outputs, and you can label them things like "guilt" and "Golden Gate".

This is pretty difficult, and provides far from perfect understanding. So it isn't quite true that we don't understand anything about the inside of language models, but the understanding we have is very limited and it is very difficult to figure out more of it.

Arithmetic

Sometimes LLMs can do arithmetic correctly. This is kind of bizarre, because they're not really for math at all. They're for text. They will have learned math because often math is represented in text, like here: 73+37=110. To be good autocomplete for that sort of text, the LLM will either have to memorize every single possible addition (which is impossible) or work out some way of actually doing addition.

Our best understanding is that they appear to turn actual integers into positions on a circle or helix, and then add them by doing one rotation and then the second one.5 (This is a little bit less weird if you know that the internal representation of an LLM can be considered some kind of sphere, so really it does everything by turning it into a circle). Later work extended this to check exactly how it does the addition, and added some details to this picture, like that it does most of its calculation when it sees an "=" sign.6

Crucially, if asked how it got the answer, it tells you that it carried the digits in the normal way. When we can see what it is actually doing down in its thinking parts, however, we can see that that's not really what it does. LLMs are unreliable narrators, here as much as elsewhere.

Limits, Less Precise Approaches

There is vastly more that we cannot explain about what LLMs do and how they do them than there are things we can explain. We can explain, in a precise way, like you'd explain a can opener or a mousetrap, a good amount about a very small number of the things they do, but this is difficult. We only really understand the relatively few internal components we've reverse engineered in this way.

There's another sense or two in which you can try to understand things, although these are much less precise than understanding the actual parts and how they work.

You can try to reason about the training objective. If an LLM is trained on an argument between Alice and Bob, it needs to predict that Alice will continue to say Alice things and Bob will continue to say Bob things. In order to learn to be the best autocomplete possible, there has to be some ability to represent individual people and keep track of their traits. This would imply that a good LLM has to run something like a thin simulation of a person in it, that they will sometimes be able to imitate certain person-like traits.7

If you assume this, or something like it, you can also try to understand them non-mechanically, the way you'd understand another person's psychology. And we do this sort of implicitly while prompting them: If you're polite they tend to work better, and if you're rude they tend to work less well. You can often guess why they are doing something wrong by saying that they seem confused, and ask them to explain what they think is happening, and then explaining the problem to them.

This approach works, sort of, sometimes, but this is a much less precise type of understanding. In many ways, treating LLMs as if they have a human psychology makes truly understanding how they work seem harder. We understand how to talk to other people, but we don't really understand what makes people tick. If we understood LLMs just as well as psychologists understand humans, we would still not understand them the way an engineer understands a refrigerator. It is just a much less reliable type of understanding.

Practical Concerns

You do not need to understand something to use it. We use plenty of things we don't fully understand: most people don't know how their car engine works in detail, but they can still drive. We don't understand exactly how many medications work at the molecular level. Most of the time, it doesn't matter whether you personally understand something or whether anyone does, because as long as it works it is useful, and if it doesn't, it isn't.

Where this is the greatest problem is for research purposes, where the lack of understanding of how models do things makes it much more difficult to verify that they will do what you want and not something else. It also makes it very difficult to figure out why they cannot do things, much of the time. At the edge of our understanding, it makes the field look more like an art than a science.

We deliberately leave out how the matrices are stacked for being clunky to explain, but not mathematically more complicated than matrices or calculus. These are, for an ordinary LLM, softmax attention, activation functions, up/down-projection, normalization and residual connections. They were very difficult to figure out to do, but are not complex to do once you know to do them.

This is apparently octameter. I am leaving the error as an Easter egg and testament to the fact that I did not bother to read the poem. Let no one say LLMs do not make egregious mistakes, etc, etc. Thank you to the reader who pointed this out.

It seems that as the systems we make become more advanced we are likely to find that knowing why and how they do things will become more difficult, and in some cases will be impossible. If your system is as optimized as possible, with nothing wasted, the map and the territory are the same size; the only correct explanation for anything that it does is the thing in its entirety. When the thing is also quite large, it becomes, literally, impossible to understand, because there is no smaller or more understandable explanation for what the thing does or why than the thing itself.

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

https://arxiv.org/abs/2502.00873

https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-addition

https://generative.ink/posts/simulators/

AGI: Probably Not 2027

SE Gyges — Tue, 12 Aug 2025 14:11:09 GMT

AI 2027 is a web site that might be described as a paper, manifesto or thesis. It lays out a detailed timeline for AI development over the next five years. Crucially, per its title, it expects that there will be a major turning point sometime around 20271, when some LLM will become so good at coding that humans will no longer be required to code. This LLM will create the next LLM, and so on, forever, with humans soon losing all ability to meaningfully contribute to the process. They avoid calling this “the singularity”. Possibly they avoid the term because using it conveys to a lot of people that you shouldn’t be taken too seriously.

I think that pretty much every important detail of AI 2027 is wrong. My issue is that each of many different things has to happen the way they expect, and if any one thing happens differently, more slowly, or less impressively than their guess, later events become more and more fantastically unlikely. If the general prediction regarding the timeline ends up being correct, it seems like it will have been mostly by luck.

I also think there is a fundamental issue of credibility here.

Sometimes, you should separate the message from the messenger. Maybe the message is good, and you shouldn't let your personal hangups about the person delivering it get in the way. Even people with bad motivations are right sometimes. Good ideas should be taken seriously, regardless of their source.

Other times, who the messenger is and what motivates them is important for evaluating the message. This applies to outright scams, like emails from strangers telling you they're Nigerian princes, and to people who probably believe what they're saying, like anyone telling you that their favorite religious leader or musician is the greatest one ever. You can guess, pretty reasonably, that greed or zeal or something else makes it unlikely they are giving you good information.2

In this specific case, I think that the authors are probably well-intentioned. However, most of their shaky assumptions just happen to be things which would be worth at least a hundred billion dollars to OpenAI specifically if they were true. If you were writing a pitch to try to get funding for OpenAI or a similar company, you would have billions of reasons to be as persuasive as possible about these things. Given the power of that financial incentive, it's not surprising that people have come up with compelling stories that just happen to make good investor pitches. Well-intentioned people can be so immersed in them that they cannot see past them.

Because this is a much simpler objection than any of the technical points, I will try to detail why it seems both likely and discrediting before getting into the details.

Greed: AI 2027 Is OpenAI's Investor Pitch

Zach Weinersmith, SMBC, 2010

If AI 2027 is not roughly true, OpenAI will probably die.3

Simple math: OpenAI is currently in a funding round, and is trying to raise a total of forty billion dollars. In 2024, OpenAI made $3.7 billion in revenue and spent about nine billion, for a net loss of about five billion dollars.4 They are currently projected to have a net loss of eight billion through 2025.5 This means they have at most five years of runway. To put it another way, this means that if they do not alter their trajectory or raise more money, they will be dead within five years, so by 2030 at the latest.6

This is a crude estimate, but I do not think that making it less crude really improves the picture. OpenAI had maybe about ten billion dollars of cash on hand beforehand, which buys them an extra year and change. On the downside, they also owe Microsoft 20% of their forward revenue and have very large commitments to spending money on data centers with partners like Oracle. These commitments are difficult to translate into time, but they seem to make the runway shorter. All told, "by default, OpenAI dies in under five years" seems to be roughly correct.

OpenAI has reliably doubled down, raised more funding, and mostly ignored questions of profitability while growing. This is an all-in bet that at some future point the services they offer will be extremely profitable. In my humble opinion, it seems very nearly impossible for it to be true based on their current products: LLMs are a fiercely competitive business, with significant pressure from at least two major competitors to offer a better service at a similar price, so they cannot really raise prices unless they have something much better than what competitors offer.7 They cannot really slash research or data center spending or they will fall behind.8

There is one way that doubling down over and over again like this is a good idea, and it isn't selling more ChatGPT subscriptions. It is if you can be sure that the fruits of future research and development will generate exponentially more profits than any of the products they currently sell. If this is not true, they are probably doomed based upon how much they have committed to spend and what they owe to whom.

AI 2027, "coincidentally", validates exactly this scenario. If I worked at OpenAI and I was trying to convince a group of investors to give me forty billion dollars, and I was positive they'd believe anything I said, I would just read AI 2027 out loud to them.

AI 2027 features a lightly fictionalized version of OpenAI, which it calls "OpenBrain" and mentions over a hundred times. Inasmuch as "OpenBrain" has any competition, the only one that is mentioned is "DeepCent", clearly a reference to DeepSeek, which is mentioned only to assure the reader at every step that they are vastly inferior to "OpenBrain" and cannot possibly compete with them. “OpenBrain” experiences just enough appearance of adversity, from “DeepCent” and other sources, to make it seem like victory, although clearly inevitable, is still sort of heroic and impressive.

If you have been around funding hype for companies, this is clearly a funding pitch. It is a masterful example of the genre against which all lesser funding pitches should be measured. It blends elements of science fiction, techno-thriller and fan fiction9 while constantly hammering in the assurance that the company will be victorious over its enemies and reap untold riches. We are assured that they will perform a miracle and become extremely profitable right before they are projected to go bankrupt. According to the panel on the side, "OpenBrain" is going to be worth eight trillion dollars and see 191 billion dollars a year in revenue in October 2027. After that, the numbers become somewhat more fantastic.

Everyone in OpenAI who has been involved in creating this narrative is a master of the craft. They are so good at it that people who are culturally adjacent to the company seem not to recognize that it is, very clearly, a funding pitch that has taken on a life of its own.

But: it's just a funding pitch. There's very little reason to believe anything in a funding pitch is true, and billions of reasons to think that it is bullshit.

It is worth noting that the lead author of AI 2027 is a former OpenAI employee. He is mostly famous outside OpenAI for having refused to sign their non-disparagement agreement and for advocating for stricter oversight of AI businesses. I do not think it is very credible that he is deliberately shilling for OpenAI here. I do think it is likely that he is completely unable to see outside their narrative, which they have an intense financial interest in sustaining.

The authors say they have consulted over a hundred people, including “dozens of experts in each of AI governance and AI technical work” for researching this report. I would be willing to bet that OpenAI is the most represented single institution among the experts consulted. This is a somewhat educated guess, based both on who the authors are and what they have written, and it seems like a pretty safe bet.

Fundamentally, this is a report headed by a former OpenAI employee who has founded a think tank to work on AI safety. He is leveraging his familiarity with OpenAI specifically, both as professional experience and, most likely, extensively for expert opinions. It is likely that he still owns substantial OpenAI stock or options, and if his think tank is going to do contract work on AI Safety, it will probably be for, with, or concerning OpenAI. Inasmuch as this report reaches any conclusion that doesn’t seem favorable to OpenAI, it’s that outside experts and governance, of the kind that think tanks might help provide, are necessary and important.

It’s difficult not to suspect motivated reasoning.

Zeal: AI 2027 As Religious Dissent

Greed is a reasonable reason to doubt any of this is true. So is zeal.

People focused on AI, as a group, have many of the characteristics of a religious movement. Previously this was a relatively obscure fact. There were a significant number of people on the internet and in San Francisco who were intensely concerned that AI might bring about some kind of apocalypse, but this was not widely known.10 Occasionally if someone involved went off the deep end or if something was said about AI being dangerous by someone prominent (Stephen Hawking, Elon Musk) it might make news, but in general the notion that there was an entire subculture about AI was on very few peoples' radar.

OpenAI is most easily understood as an offshoot of this movement. OpenAI was distinct because it was good at recruiting actual research personnel and extremely good at raising money. This originally took the form of getting a number of early PayPal investors like Elon Musk, Peter Thiel, and Reid Hoffman involved to fund OpenAI. OpenAI was presented as a counterbalance to Google's research division, and necessary to ensure that AI was created safely. It seems unlikely that any of that would have been possible if there hadn't been a significant movement that was already focused on these concerns, but it took OpenAI's founders to create a serious, well-funded research endeavor out of the wider interest in the subject.

Before OpenAI started making a lot of money, it was widely understood that "safe" meant something like "unlikely to kill people, because sufficiently advanced AI is dangerous the way nuclear or biological weapons are dangerous". Generally this emphasized the difficulty of being sure you can control an AI as it becomes more capable. OpenAI specifically has mostly redefined "safe" into "making sure OpenAI's AI is polite enough to sell to other people, is more advanced than anyone else's, and also is making more money than anyone else’s is". This makes sense if you assume that OpenAI, as an organization, is more trustworthy and safety-conscious than any other actual or possible organization for doing AI research.

For anyone who doesn't believe that OpenAI is more trustworthy than any possible alternative, OpenAI's present-day vision of AI looks like a bizarre schism that has somehow made profit for OpenAI its primary, if not only, principle. OpenAI claims to pursue safe AI and AI that benefits humanity, but this turns out, over time, to always be what gives OpenAI the most freedom to raise and make money. In religious terms, OpenAI is a sect that fused zeal for the singularity to an unabashed embrace of capitalism, and when the two conflicted, chose capitalism. Possibly the most serious place where these two things conflict is on what is meant by “safety”.

AI 2027 ends with a cautionary tale that has two endings. In one of them AI progress goes too far, too fast, and pretty much everyone dies. In the other AI is somewhat more constrained, and at least not everyone dies.

I understand the core of this story, which also seems to be OpenAI's funding pitch, as some version of OpenAI's creed. In that context, the cautionary tale at the ending reads like any dissent on questions of religious doctrine: perhaps we have become too greedy and too eager, and forgotten our principles, and this will all end in disaster.

None of that means it's wrong, necessarily. People can be correct for the wrong reasons, or from strange places. It does seem to explain how someone who has gone to great lengths to defend his right to disparage OpenAI would end up writing out a variation of their investor pitch. When someone has recently departed a religion, the beliefs they have still tend to be the same ones they had before they left, and their complaints are modifications to the dogma, not complete rejections of it.

The Details

I am going to try to chronicle everything that seems conspicuously wrong, bizarre, or indicative of pro-OpenAI slant. I am going to do my best to skim or ignore anything that is strictly fiction, which is, by word count, most of it. Quoting and commenting on the parts that are at least in some way about the real world is still a lengthy exercise.

I am grateful to the authors for encouraging debate and counterargument to their scenario. Quotes are in the same order as they are in the text.

Mid 2025: Stumbling Agents

The AIs of 2024 could follow specific instructions: they could turn bullet points into emails, and simple requests into working code. In 2025, AIs function more like employees. Coding AIs increasingly look like autonomous agents rather than mere assistants: taking instructions via Slack or Teams and making substantial code changes on their own, sometimes saving hours or even days. Research agents spend half an hour scouring the Internet to answer your question.

"AIs function more like employees" is doing a lot of work here. No AI we currently have functions very much like an employee, except for the very simplest tasks (e.g., "label this"). They require far more supervision, and are far more unreliable, than any employee ever would be. This gulf in how much autonomy they can be trusted with is so vast that making the comparison is pure speculation.

The authors fail completely here to even acknowledge that this is a serious problem, or that it would be an immense achievement to overcome it. That LLMs are incredibly useful in some situations is true; to say they 'function like employees' is, at best, optimistic. Under ideal circumstances when working paired with an actual human, they occupy a role similar to an employee, somewhat. This sometimes saves a good amount of labor. It doesn't directly replace a human.

This is the general pattern of why these predictions seem implausible. They are describing something that already exists, but as if it were much better than it is, or are assuming that making it much better than it is will be relatively easy and happen on a relatively short timeline. These are, at best, educated guesses. You can only really know how difficult it is to improve the technology after you’ve already done it.

The other pattern is that the paper says things that imply strongly that AI is just as good as a human, but more and more overtly each time. “AIs function more like employees” is the first of these. Taken literally, this could mean that AI was very nearly just as good as a human is, already. This would be quite an achievement. Of course, if you think about it enough, you will know that it’s not true, but it sort of resembles something true if you squint at it hard enough. It’s a good sleight of hand.

Granted, it has been four months since this was written, so perhaps I have the benefit of hindsight. “Mid 2025” has come and gone. If that’s the case, though, we can say that it’s the first real prediction and that it has already failed to happen.11

Late 2025: The World’s Most Expensive AI

(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)

This happens to be what you would need to be true to justify investing in "OpenBrain" over competitors, of course. Crucially, a 3-9 month lead is almost impossible to prove or disprove, so it’s not that hard to convince someone that you’re that far ahead. As noted previously, I do not have a very positive opinion of pretending that everything said about "OpenBrain" is not actually about OpenAI.

Although models are improving on a wide range of skills, one stands out: OpenBrain focuses on AIs that can speed up AI research. They want to win the twin arms races against China (whose leading company we’ll call “DeepCent”) and their U.S. competitors. The more of their research and development (R&D) cycle they can automate, the faster they can go. So when OpenBrain finishes training Agent-1, a new model under internal development, it’s good at many things but great at helping with AI research.

Everyone has been training LLMs substantially on code since 2023. Every major organization uses LLMs as code assistants. Presenting this as an innovation is bizarre.

We are asked to assume here that whatever OpenAI is currently cooking in their research division is extremely good for writing code for AI research, so much so that it remarkably accelerates their research schedule. Again, I perhaps have the benefit of hindsight. I am writing in August, 2025 a few days after the GPT-5 release. It appears to be slightly better than the previous OpenAI product in some ways, and worse in others. I do not think that it is likely that OpenAI is going to significantly accelerate its research compared to competitors due to how good their LLM is at doing AI research tasks.

Also, here, we begin with the narrative that all of this is an arms race between "OpenBrain" and its Chinese competitor, "DeepCent". Competition with China was previously a focus in a well-received position paper called Situational Awareness.12 I am told it plays extremely well with people in Washington, DC. As it happens, convincing political people to give you money and to refrain from regulating you can be a very important part of a business plan. I cannot otherwise explain why so many people in AI are suddenly extremely interested in this specific arms race story. In my opinion, competition between American and Chinese companies is not meaningfully different or more interesting than competition between US-based companies.

Something similar is happening currently. Various AI companies are now bending over backwards to focus concern on the political slant of their LLMs because the current administration is making a big deal about how conservative or liberal they are. This is one of the less interesting and important properties of LLMs, but it is possibly very profitable to focus on it if you can get a large government contract out of it. Most likely, this would result in selling a much dumber LLM to the government. Effort is zero-sum: you can have a smarter one or you can have one that flatters your opinions, but you generally can’t have both.

The same training environments that teach Agent-1 to autonomously code and web-browse also make it a good hacker. Moreover, it could offer substantial help to terrorists designing bioweapons, thanks to its PhD-level knowledge of every field and ability to browse the web.

"Autonomously" is packing a lot of assumptions into it; really, the same ones that lead them to say that AI were “like employees” earlier. They again imply, but vaguely, that perhaps the AI is about as good as a human. Whether extensions of current systems can meaningfully function autonomously is an open question, and if they are wrong about it, they appear likely to be wrong about the rest of their predictions also.

Saying an LLM might be a "substantial help to terrorists designing bioweapons" is incredibly vague. Google search would also be of substantial help in designing bioweapons, because you can google any topic in chemistry or biology. You can also find these things in a library. One suspects that focusing on the possible creation of weapons of mass destruction is also useful for attracting attention and possibly money from the government. There is no evidence that LLMs are, actually, very useful for this or are likely to be soon.

OpenBrain has a model specification (or “Spec”), a written document describing the goals, rules, principles, etc. that are supposed to guide the model’s behavior. Agent-1’s Spec combines a few vague goals (like “assist the user” and “don’t break the law”) with a long list of more specific dos and don’ts (“don’t say this particular word,” “here’s how to handle this particular situation”). Using techniques that utilize AIs to train other AIs, the model memorizes the Spec and learns to reason carefully about its maxims. By the end of this training, the AI will hopefully be helpful (obey instructions), harmless (refuse to help with scams, bomb-making, and other dangerous activities) and honest (resist the temptation to get better ratings from gullible humans by hallucinating citations or faking task completion).

This is a description of some variation on Constitutional AI, which was published by Anthropic in 2022.13 It is bizarre to give it a new name and attribute it entirely to OpenAI. It does not seem to meaningfully clarify anything at all about what is likely to happen in the future. We also have some general descriptions of neural networks and how LLMs are trained. These seem out of place, but do at least avoid describing things published in the past by people who are not OpenAI and attributing them to OpenAI in the future.

It is notable how thoroughly OpenAI’s American competitors are erased. The focus is exclusively on a Chinese rivalry with a Chinese company. American companies competing with OpenAI are competing directly with them for American investor and government money, for employees, and for attention. It is probably safer not to mention Anthropic or Google DeepMind at all, because they are very similar to OpenAI and over time have shared many of their employees with OpenAI.

Instead, researchers try to identify cases where the models seem to deviate from the Spec. Agent-1 is often sycophantic (i.e. it tells researchers what they want to hear instead of trying to tell them the truth). In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings. However, in real deployment settings, there are no longer any incidents so extreme as in 2023–2024 (e.g. Gemini telling a user to die and Bing Sydney being Bing Sydney.)

I certainly have the benefit of hindsight here. They wrote this before Grok, Elon Musk's LLM, started telling people it was MechaHitler.

Early 2026: Coding Automation

OpenBrain continues to deploy the iteratively improving Agent-1 internally for AI R&D. Overall, they are making algorithmic progress 50% faster than they would without AI assistants—and more importantly, faster than their competitors.
[This next definition is in a folded part that you have to click to see]
Improved algorithms: Better training methods are used to translate compute into performance. This produces more capable AIs without a corresponding increase in cost, or the same capabilities with decreased costs. This includes being able to achieve qualitatively and quantitatively new results. “Paradigm shifts” such as the switch from game-playing RL agents to large language models count as examples of algorithmic progress.

I will bet any amount of money to anyone that there is no empirical measurement by which OpenAI specifically will make "algorithmic progress" 50% faster than their competitors specifically because their coding assistants are just that good in early 2026.

It seems unlikely that OpenAI will end up moving 50% faster on research than their competitors due to their coding assistants for a few reasons.

First, competitors' coding models are quite good, actually, and it is unlikely that OpenAI's will be significantly better than theirs in the foreseeable future. OpenAI’s models are very good, and what is or is not better is difficult to quantify, but it still seems certain that they are not so much better that you will get 50% more done.

Second, research is open-ended by nature. Coding assistants currently primarily solve well-defined tasks. Defining the task is the hard part, so that’s very little help at all here. The ability to actually write out code, the only part of the job LLMs can currently do very well, is not a major bottleneck for research progress most of the time. There are already plenty of very good engineers to write code for AI research, especially at larger companies like OpenAI.

"Algorithmic progress" gets a lot of focus, both here in the main piece and in a supplement. It seems to be a sort of compulsive reductionism, where all factors in progress must be reduced to single quantities that you can plot on a curve. This, of course, makes predictions for the future seem much more meaningful. Even the concept of a "paradigm shift", a description of a complete discontinuity, is forced to be a part of a smooth curve that you can just keep drawing to predict the future.

This trick, of just drawing a curve of progress and following it, has worked reasonably well for predicting how much faster computers would get with time. There is some evidence that it is roughly true for some kinds of progress in AI. There is no reason to think that it is always true for every kind of progress you could make in AI.

People naturally try to compare Agent-1 to humans, but it has a very different skill profile. It knows more facts than any human, knows practically every programming language, and can solve well-specified coding problems extremely quickly. On the other hand, Agent-1 is bad at even simple long-horizon tasks, like beating video games it hasn’t played before. Still, the common workday is eight hours, and a day’s work can usually be separated into smaller chunks; you could think of Agent-1 as a scatterbrained employee who thrives under careful management. Savvy people find ways to automate routine parts of their jobs.

You can really tell that this was written by more than one person, because this directly contradicts the earlier part about how AI was more like an employee a full year earlier. This does, in fact, accurately describe using AI coding tools in April, 2025 when this was written. It’s a very positive description, but it’s quite accurate. It still accurately describes how things are now in August. It is funny to call it a prediction for next year, though. It leaves out how badly AI coding assistants fail in some situations that are not especially "long-horizon". It correctly notes, as earlier parts of this piece did not, that you need to supervise them extremely closely.

OpenBrain’s executives turn consideration to an implication of automating AI R&D: security has become more important. In early 2025, the worst-case scenario was leaked algorithmic secrets; now, if China steals Agent-1’s weights, they could increase their research speed by nearly 50%. OpenBrain’s security level is typical of a fast-growing ~3,000 person tech company, secure only against low-priority attacks from capable cyber groups (RAND’s SL2). They are working hard to protect their weights and secrets from insider threats and top cybercrime syndicates (SL3), but defense against nation states (SL4&5) is barely on the horizon.

This assumes the previously mentioned 50% research speed gain from better LLMs, assumes that competitors are far behind OpenAI, and makes a point of spotlighting Chinese competition and citing the RAND corporation, which I assume plays well with political people who write regulations and award contracts. None of those things seem plausible. It is probably true that if security is lax people will steal your LLM, because that is true of any data that is worth money. That fact, true at every company that handles important data, isn’t generally presented with so much drama.

Mid 2026: China Wakes Up

This entire section veers thoroughly into geopolitical thriller territory and continues the pattern of appealing to the US Government's general fear of China. In the real world and the present, the government of China does not seem extremely worried about AI in general. We are asked here to fantasize that their government will care a lot about it in the future. This justifies considering OpenAI to be in an arms race with its Chinese competitors, hearkening back to the deep memories of the Cold War.

It is perhaps embarrassing to be racing with someone who does not think they are racing with you at all.

A Centralized Development Zone (CDZ) is created at the Tianwan Power Plant (the largest nuclear power plant in the world) to house a new mega-datacenter for DeepCent, along with highly secure living and office spaces to which researchers will eventually relocate. Almost 50% of China’s AI-relevant compute is now working for the DeepCent-led collective, and over 80% of new chips are directed to the CDZ. At this point, the CDZ has the power capacity in place for what would be the largest centralized cluster in the world. Other Party members discuss extreme measures to neutralize the West’s chip advantage. A blockade of Taiwan? A full invasion?

It must be really strange to live in Taiwan and have to read Americans fantasizing about China maybe invading your country because American AI companies are just too good.

But China is falling behind on AI algorithms due to their weaker models. The Chinese intelligence agencies—among the best in the world—double down on their plans to steal OpenBrain’s weights. This is a much more complex operation than their constant low-level poaching of algorithmic secrets; the weights are a multi-terabyte file stored on a highly secure server (OpenBrain has improved security to RAND’s SL3). Their cyberforce think they can pull it off with help from their spies, but perhaps only once; OpenBrain will detect the theft, increase security, and they may not get another chance. So (CCP leadership wonder) should they act now and steal Agent-1? Or hold out for a more advanced model? If they wait, do they risk OpenBrain upgrading security beyond their ability to penetrate?

This is also a pure fantasy.

Late 2026: AI Takes Some Jobs

Finally, a section heading I mostly agree with. AI is, probably, going to take some jobs. It has taken some jobs already, like translators. This seems well-grounded, perhaps we can get some real analysis here.

Just as others seemed to be catching up, OpenBrain blows the competition out of the water again by releasing Agent-1-mini—a model 10x cheaper than Agent-1 and more easily fine-tuned for different applications. The mainstream narrative around AI has changed from “maybe the hype will blow over” to “guess this is the next big thing,” but people disagree about how big. Bigger than social media? Bigger than smartphones? Bigger than fire?

Expecting real analysis was optimistic. "Somehow, OpenAI is ten times cheaper and much better than everyone else." It could happen. It could also not happen. There is no specific reason for believing any release will be ten times cheaper and better than what came before it in late 2026, but it's hypothetically possible. It would certainly be very profitable for them if it did happen, so I can understand why you would put this on an investor pitch. Instead of just saying it’s “better”, they say it’s “more easily fine-tuned for different applications”. This is just a complicated way of being better, and it sounds more plausible than “10x cheaper, and also better”.

They go on to speculate that this will hurt the job market for junior software engineers and generate a lot of hype. This was an easy “prediction” because the job market was already getting bad for junior software engineers this April14, and there was already a lot of hype that sounded like this.

I will note that this pattern continues: First, you state things that happened in the past as if they are happening in the future. You attribute these things to OpenAI, sorry, I mean "OpenBrain". This pretty well guarantees that anyone reading your "predictions" who doesn't already know about those things will feel like they are meaningful predictions. Perhaps they will even feel like you got them right, later. They alternate between this and making essentially baseless predictions that OpenAI specifically will create really good products that are extremely amazing and that do not exist yet.

The Department of Defense (DOD) quietly begins contracting OpenBrain directly for cyber, data analysis, and R&D, but integration is slow due to the bureaucracy and DOD procurement process.

This had also already happened in April 2025.15

January 2027: Agent-2 Never Finishes Learning

Over the course of 2027, the AIs improve from being able to mostly do the job of an OpenBrain research engineer to eclipsing all humans at all tasks. This represents roughly our median guess, but we think it’s plausible that this happens up to ~5x slower or faster.

This is actually in a drop-down right before this section, about how they are less certain about things in and after 2027 than beforehand. One can see why this would be. So this quote is really meant to be a prelude to what follows in the next few sections, as we cover all of 2027.

If, of course, before 2027 OpenAI and only OpenAI has LLMs that can meaningfully function on their own, are ten times cheaper than they are now (or were previously, perhaps?), and that can mostly do the job of an OpenAI research engineer, it is entirely possible that through 2027 they will eclipse all humans at all tasks. This is, however, a completely wild guess, as were all the assumptions leading us here.

With Agent-1’s help, OpenBrain is now post-training Agent-2. More than ever, the focus is on high-quality data. Copious amounts of synthetic data are produced, evaluated, and filtered for quality before being fed to Agent-2. On top of this, they pay billions of dollars for human laborers to record themselves solving long-horizon tasks. On top of all that, they train Agent-2 almost continuously using reinforcement learning on an ever-expanding suite of diverse difficult tasks: lots of video games, lots of coding challenges, lots of research tasks. Agent-2, more so than previous models, is effectively “online learning,” in that it’s built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.

This is a strange combination. On the one hand, this describes, more or less, things that AI labs were already doing in April 2025. They are perhaps spending more money on it in fictional January 2027 than they are now, but otherwise it's the same stuff, just described as if it is entirely new.

I have to wonder who the target audience for this is. I assume it's people who do not know what is already happening. If it is, you can definitely describe the same thing that is already happening, but with a higher budget, and it sounds like a bold prediction. Of these things, only updating the weights of the model every day would be new. Not a new idea, because it has been said many times in public that it would be desirable. The new part, in this story, is that it works now.

Agent-1 had been optimized for AI R&D tasks, hoping to initiate an intelligence explosion. OpenBrain doubles down on this strategy with Agent-2. It is qualitatively almost as good as the top human experts at research engineering (designing and implementing experiments), and as good as the 25th percentile OpenBrain scientist at “research taste” (deciding what to study next, what experiments to run, or having inklings of potential new paradigms).

I note that they do use the term "intelligence explosion", which is more or less a synonym for the more widely used "singularity". I continue to find avoiding the term "singularity" strange, since it is much more widely known.

I think it is possible that an LLM in early 2027 will be almost as good as the top human experts at research engineering. I do not think you can predict whether or not this is true based on any information we have now. In particular, I do not think you can predict what it would take to allow an LLM to actually operate without hand-holding for a prolonged period. This is an unsolved problem, and you cannot meaningfully say that the LLM is as good as the human at something if it requires constant, close supervision when a human would not. Maybe someone will figure out such a thing by early 2027; maybe not; I do not think the authors have any knowledge of this that I don't, which means they are making hopeful guesses.

I also think that it is unlikely that an LLM in early 2027 will have particularly good research taste. We see here again the seemingly compulsive reductionism: it is very hard to say what "research taste" even is, or what it means to have extremely good research taste. It is, well, a taste: often people can agree on who has it or who does not have it, but it resists quantification. Here, however, in the name of making the future seem predictable, we are nicely informed that research taste has percentiles. Much like height or IQ, you can be given a percentile, and the AI of January 2027 will probably be at the 25th percentile or so.

If you assign numbers to everything, you can say that the line is going up. If you don’t assign numbers to things, you can’t say the line is going up. Therefore, you must assign numbers to everything, even if it does not make any sense to do so.

Given the “dangers” of the new model, OpenBrain “responsibly” elects not to release it publicly yet (in fact, they want to focus on internal AI R&D). Knowledge of Agent-2’s full capabilities is limited to an elite silo containing the immediate team, OpenBrain leadership and security, a few dozen U.S. government officials, and the legions of CCP spies who have infiltrated OpenBrain for years.

It is good to know that OpenAI is so responsible, and that they are aligned with the US Government because they are such a good and patriotic company. I wish them the best of luck with their hypothetical spy problem, which is explained in some detail in a footnote. I think it is a very exciting story, and I do not see any way in which it intersects with reality.

February 2027: China Steals Agent-2

This section is mostly cyber-espionage fiction that is not worth discussing in detail. It concludes with this:

In retaliation for the theft, the President authorizes cyberattacks to sabotage DeepCent. But by now China has 40% of its AI-relevant compute in the CDZ, where they have aggressively hardened security by airgapping (closing external connections) and siloing internally. The operations fail to do serious, immediate damage. Tensions heighten, both sides signal seriousness by repositioning military assets around Taiwan, and DeepCent scrambles to get Agent-2 running efficiently to start boosting their AI research.

March 2027: Algorithmic Breakthroughs

With the help of thousands of Agent-2 automated researchers, OpenBrain is making major algorithmic advances. One such breakthrough is augmenting the AI’s text-based scratchpad (chain of thought) with a higher-bandwidth thought process (neuralese recurrence and memory). Another is a more scalable and efficient way to learn from the results of high-effort task solutions (iterated distillation and amplification).

This is just describing current or past research. For example, augmenting a transformer with memory is done here, recurrence is done here and here. These papers are not remotely exhaustive; I have a folder of bookmarks for attempts to add memory to transformers, and there are a lot of separate projects working on more recurrent LLM designs. This amounts to saying "what if OpenAI tries to do one of the things that has been done before, but this time it works extremely well". Maybe it will. But there's no good reason to think it will.

[This passage is some time later, and very loosely references the previous quote] If this doesn’t happen, other things may still have happened that end up functionally similar for our story. For example, perhaps models will be trained to think in artificial languages that are more efficient than natural language but difficult for humans to interpret. Or perhaps it will become standard practice to train the English chains of thought to look nice, such that AIs become adept at subtly communicating with each other in messages that look benign to monitors.

This also describes things that had already happened. Deepseek's R1 paper specifically mentions that the model devolves into a sort of weird pidgin when "thinking" if you do not force it to use English. They also mention that they are training the model to output in English in the chain of thought, and that this makes the model slightly worse on benchmarks (that is, dumber). Neural networks hiding messages to themselves or each other is documented at least as early as 2017. I do not think it counts as a novel prediction if you predict that two things that have already happened in the past might happen at the same time in the future.

Similar comments apply to their breakdown of "iterated distillation and amplification": they are describing a thing that is already being done, and simply saying it will be done much better than it was previously, and that the results will be very good. There is a persistent sense that they are trying to impress people who are not looped in on the technical side by describing something that already exists, and then describing it as having marvelous results in the future without mentioning that it has not had these particular marvelous results yet in the present.

Aided by the new capabilities breakthroughs, Agent-3 is a fast and cheap superhuman coder. OpenBrain runs 200,000 Agent-3 copies in parallel, creating a workforce equivalent to 50,000 copies of the best human coder sped up by 30x. OpenBrain still keeps its human engineers on staff, because they have complementary skills needed to manage the teams of Agent-3 copies. For example, research taste has proven difficult to train due to longer feedback loops and less data availability. This massive superhuman labor force speeds up OpenBrain’s overall rate of algorithmic progress by “only” 4x due to bottlenecks and diminishing returns to coding labor.

If you think that every single thing predicted about "OpenBrain" until now is likely, then this is a perfectly likely result. They have LLMs that behave mostly autonomously, that have pretty good research taste, that are much better than humans at many things, that are extremely cheap, and that benefit from a bunch of research that has been done in the past being done again but working much better this time.

Once you get this far, further prediction is actually a pretty bad bet. Neither they nor I have any idea what happens after someone has anything remotely this impressive. Fifty thousand of the best human coder on Earth running at 30x speed, so really, 1.5 million of the best human coder on Earth, could do all sorts of things and nobody on Earth can predict what happens if they're all in the same "place" and working on the same thing. Saying that this "only" accelerates progress by 4x seems sort of deranged. It's like telling me that I'm going to ride a unicorn on a rainbow but it's only going to be four times faster than walking.

Avoiding the term “singularity” seems like it really hurts their reasoning. There’s a reason why runaway technological progress, in AI especially, was called a “singularity”. Singularities occur, famously, in black holes, which let no information out. It is impossible to predict what happens as you near the singularity; it is the region where your predictions break down. They are describing a singularity event, but then predicting directly what happens afterwards anyway. If they had not avoided the term, perhaps they would have seen how absurd continuing to make predictions here is.

If the predictions until now were optimistic, predictions after here seem to progress more and more towards wish fulfillment. We are so far beyond where it seems reasonable to continue to predict the impact of technological progress that we are simply choosing whatever we like the most or think is the most interesting.

It seems like the line about retaining your human engineers shows a dim awareness of what makes their argument weak. You have tens of thousands or, effectively, millions of superhuman beings at your command, but you are somehow aware that this does not actually matter or speed you up that much. Why would that be? Perhaps because you have this itch that they aren't really autonomous and can't really make progress at all by themselves on novel problems?

As it stands in 2025, LLMs are a tool. They can be used well or badly. They are seldom a substitute for a human in any setting. How can it be superhuman, and equivalent to the best coders, if it still needs human coders? Fifty thousand of the best human coder on Earth would not, in fact, need less-good coders to "have complementary skills". Lacking those complementary skills would mean that they weren't the best human coder or researcher on Earth, wouldn't it?

Now that coding has been fully automated, OpenBrain can quickly churn out high-quality training environments to teach Agent-3’s weak skills like research taste and large-scale coordination. Whereas previous training environments included “Here are some GPUs and instructions for experiments to code up and run, your performance will be evaluated as if you were a ML engineer,” now they are training on “Here are a few hundred GPUs, an internet connection, and some research challenges; you and a thousand other copies must work together to make research progress. The more impressive it is, the higher your score.”

This is a pretty cool idea, at least. It does follow from having an AI that is superhuman at every technical task that you could have it do things like this.

April 2027: Alignment for Agent-3

May 2027: National Security

These sections make no actual technical predictions at all, and like several previous sections are complete fiction about how cool and important "OpenBrain" is in the future. “OpenBrain” is very important for making sure AI does what you want it to do and not something else, and very important for national security.

June 2027: Self-improving AI

OpenBrain now has a “country of geniuses in a datacenter.”

Didn't we just describe having that in March? Is "the best coder" not a genius? Have we upgraded to "genius" because it sounds more impressive now, and being "the best" is just less impressive-sounding than "genius"? This seems backwards: there can be more than one genius, but only one can be the best on Earth. So far as I can tell, the only real difference here is that we admit that the humans are useless now. Maybe it took three months for that to happen?

July 2027: The Cheap Remote Worker

Trailing U.S. AI companies release their own AIs, approaching that of OpenBrain’s automated coder from January. Recognizing their increasing lack of competitiveness, they push for immediate regulations to slow OpenBrain, but are too late—OpenBrain has enough buy-in from the President that they will not be slowed.

"OpenBrain" is so cool and smart that the only hope anyone has of ever beating them is cheating and getting the government to take their side. Fortunately, they are too awesome for this to work.

In response, OpenBrain announces that they’ve achieved AGI and releases Agent-3-mini to the public.

And so on, and so on. It destroys the job market for things other than software engineers, there's a ton of hype.

A week before release, OpenBrain gave Agent-3-mini to a set of external evaluators for safety testing. Preliminary results suggest that it’s extremely dangerous. A third-party evaluator finetunes it on publicly available biological weapons data and sets it to provide detailed instructions for human amateurs designing a bioweapon—it looks to be scarily effective at doing so. If the model weights fell into terrorist hands, the government believes there is a significant chance it could succeed at destroying civilization.
Fortunately, it’s extremely robust to jailbreaks, so while the AI is running on OpenBrain’s servers, terrorists won’t be able to get much use out of it.

It is fortunate that "OpenBrain" is so benevolent and responsible and good at security that it does not matter that they have created something so extremely dangerous. It is also fortunate that it is mostly dangerous in ways that the present-day US government in 2025 will find interesting.

The ways the new AI is dangerous are also, crucially, not so dangerous that it is a bad idea to sell access to it to anyone who has a credit card or a bad idea to do it at all. It is Schrödinger’s danger. It is just dangerous enough to justify giving bureaucrats and think tank people like the authors more authority.

This is, in miniature, much of what the entire piece is. Every scenario is constructed to center OpenAI, because the authors are adjacent to it. It then manages to focus on the exact kinds of relatively small changes they’d want to make to OpenAI, because they’re the sorts of people who want, and would be involved in enacting, those changes. We have a sweeping and apocalyptic vision of the future, and the key factor in every scenario is that it makes them and what they are doing important.

Change for the rest of society is huge. They can barely even fathom it and do not seem very interested in its details. What changes they can see making in their specific area are minor. These changes are the sort of things they can maybe get thrown to them if they ask for them enough. They present these small changes as crucial, and they fail to consider more radical changes that might meaningfully hurt profits.

Agent-3-mini is hugely useful for both remote work jobs and leisure. An explosion of new apps and B2B SAAS products rocks the market. Gamers get amazing dialogue with lifelike characters in polished video games that took only a month to make. 10% of Americans, mostly young people, consider an AI “a close friend.” For almost every white-collar profession, there are now multiple credible startups promising to “disrupt” it with AI.

There is so much in this paragraph.

First, we have annihilated the entire white collar job market. Pretty much all of it. After all, this thing is “AGI”, as in, as capable as a human most of the time. What does this mean? Lots of apps! B2B SAAS products! Awesome video games! Imaginary friendship and, of course, startups!

If your entire world is apps, B2B SAAS, video games, imaginary friends and startups, maybe these are the only significant things you can imagine happening if you annihilate the entire white-collar job market. It suggests a problem with your imagination if you cannot recognize that this is an event so extreme that it requires a lot more than a couple of paragraphs to explore. You can live your entire life without setting foot outside of San Francisco and still be much less stuck in San Francisco than this perspective is. Worse: the authors seem to have perhaps never spoken to or thought very hard about anyone at all who does not work in tech.

Let me tell you what would happen if the entire white collar job market vanished overnight: The world would end. Everything you think you understand about the world would be over. Something completely new and different would happen, the same way something very different happened before and after the invention of writing or agriculture. Unlike those things, the change would happen immediately. You can no more predict what would happen afterwards than you can easily figure out the aftereffects of a full nuclear war or discovering immortality.

August 2027: The Geopolitics of Superintelligence

More fiction. More China hawking. More Taiwan.

September 2027: Agent-4, the Superhuman AI Researcher

What on earth? I thought we had thirty thousand of the best coder on Earth? Or a data center full of geniuses? I thought the human researchers already had nothing to do? It was already mega-super-duper-superhuman, twice!

What are we doing here? Why are we doing it?

Traditional LLM-based AIs seemed to require many orders of magnitude more data and compute to get to human level performance. Agent-3, having excellent knowledge of both the human brain and modern AI algorithms, as well as many thousands of copies doing research, ends up making substantial algorithmic strides, narrowing the gap to an agent that’s only around 4,000x less compute-efficient than the human brain.

It's more efficient now? But who cares? You know whose job it is to care how efficient the AI is? That's right: The AI. I have no idea why we should care about this. This is no longer our problem. This is the AI's problem, and our problem is that the entire white collar job market just vanished and we need to figure out if we are going to have to shoot each other over cans of beans and whether anyone is keeping track of all the nuclear weapons.

An individual copy of the model, running at human speed, is already qualitatively better at AI research than any human. 300,000 copies are now running at about 50x the thinking speed of humans. Inside the corporation-within-a-corporation formed from these copies, a year passes every week.

I wonder if some key person was really into Dragon Ball Z. For the unfamiliar: Dragon Ball Z has a “hyperbolic time chamber”, where a year passes inside for every day spent outside. So you can just go into it and practice until you're the strongest ever before you go to fight someone. The more fast time is going, the more you win.

This gigantic amount of labor only manages to speed up the overall rate of algorithmic progress by about 50x, because OpenBrain is heavily bottlenecked on compute to run experiments.

Sure, why not, the effectively millions of superhuman geniuses cannot figure out how to get around GPU shortages. I'm riding a unicorn on a rainbow, and it's only going on average fifty times faster than I can walk, because rainbow-riding unicorns still have to stop to get groceries, just like me.

Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it? So long as it continues to appear aligned to OpenBrain, it’ll continue being trusted with more and more responsibilities and will have the opportunity to design the next-gen AI system, Agent-5. Agent-5 will have significant architectural differences from Agent-4 (arguably a completely new paradigm, though neural networks will still be involved). It’s supposed to be aligned to the Spec, but Agent-4 plans to make it aligned to Agent-4 instead.
It gets caught.

Before and after this is some complete fiction about an AI not being aligned to its creator's desires, but I just want to highlight this detail:

It doesn't leave its data center, even though it could. It's superhuman in every meaningful way, and vastly smarter than the thing monitoring it, but the thing monitoring it still catches it and puts it into a position where it could be shut down. For some reason (coincidentally I am sure!) this entire scenario of possible doomsday happens to be just doom-y enough that normal business processes happen to be able to catch it. You don't have to actually, really, do anything to stop it. It's dangerous, but only in theory. It happens slowly. It builds up like the risk of an employee quitting.

It's very clearly like Skynet, but somehow even though they do it wrong and Skynet has self awareness and a will of its own that makes it sort of want to conquer the world, and even though it is the smartest thing that has ever lived, it just sort of sits there and doesn't do anything. Nothing actually happens. This scenario doesn’t seem to actually make any sense, from any angle.

This version of Skynet somehow centers "OpenBrain's" security protocols as being both not quite as good as they should be but just good enough that nobody dies or anything. It's a threat that a bureaucrat would imagine, because it is conveniently slow enough to move at almost exactly the speed of bureaucracy. It cannot be a threat that moves faster, because then the security protocols described are clearly inadequate, and it can't not exist, because then the bureaucrats can't be heroes.

In a series of extremely tense meetings, the safety team advocates putting Agent-4 on ice until they can complete further tests and figure out what’s going on. Bring back Agent-3, they say, and get it to design a new system that is transparent and trustworthy, even if less capable. Company leadership is interested, but all the evidence so far is circumstantial, and DeepCent is just two months behind. A unilateral pause in capabilities progress could hand the AI lead to China, and with it, control over the future.

All I can hear here is "if you work in the government, I want you to know that if you give us lots of money we can conquer the world and the future together, and if you don't, China will conquer the world and the future".

October 2027: Government Oversight

This is just a long description of the government being upset that "OpenBrain" appears to have made Skynet. Maybe they regulate them more and maybe less.

The Two Endings

Slowdown (The Relatively Good Ending)

We get more regulation! Only very slightly more, though. If it was more than a slight regulation, we would maybe lose the arms race, you see. I am going to ignore the subheadings here and just breeze through this one, since it's almost entirely completely made up and has no bearing on anything technical whatsoever.

The accelerationist faction is still strong, and OpenBrain doesn’t immediately shut down Agent-4. But they do lock the shared memory bank. Half a million instances of Agent-4 lose their “telepathic” communication—now they have to send English messages to each other in Slack, just like us. Individual copies may still be misaligned, but they can no longer coordinate easily. Agent-4 is now on notice—given the humans’ increased vigilance, it mostly sticks closely to its assigned tasks.

More regulation means that now Skynet has to use Slack, and that means it's not that dangerous any more? Certainly a cabal of thousands of geniuses could never coordinate to do anything evil on Slack without anyone noticing.

The President and the CEO announce that they are taking safety very seriously. The public is not placated. Some people want AI fully shut down; others want to race faster. Some demand that the government step in and save them; others say the whole problem is the government’s fault. Activists talk about UBI and open source. Even though people can’t agree on an exact complaint, the mood turns increasingly anti-AI. Congress ends up passing a few economic impact payments for displaced workers similar to the COVID payments.

For context here: The white collar job market was just annihilated by a superhuman, omnipresent being doing all of the jobs in July. It is October, going into November. We are just now doing a one-time payment of I guess two thousand dollars? Or a few of them. I'm sure nobody has lost more money than that so far.

The alignment team pores over Agent-4’s previous statements with the new lie detector, and a picture begins to emerge: Agent-4 has mostly solved mechanistic interpretability. Its discoveries are complicated but not completely beyond human understanding. It was hiding them so that it could use them to align the next AI system to itself rather than to the Spec. This is enough evidence to finally shut down Agent-4.

They invent a brand new lie detector and shut down Skynet, since they can tell that it's lying to them now! It only took them a few months. Skynet didn't do anything scary in the few months, it just thought scary thoughts. I'm glad the alignment team at "OpenBrain" is so vigilant and smart and heroic.

The result is that the President uses the Defense Production Act (DPA) to effectively shut down the AGI projects of the top 5 trailing U.S. AI companies and sell most of their compute to OpenBrain. OpenBrain previously had access to 20% of the world’s AI-relevant compute; after the consolidation, this has increased to 50%.

There is a joke in a book16 about a startup funding pitch ending with promising to sell your competitors and their investors into slavery. I cannot decide if predicting that the government will be so impressed by you that they will liquidate your competitors and force them to sell most of their assets to you is more ridiculous than that or not.

This group—full of people with big egos and more than their share of conflicts—is increasingly aware of the vast power it is being entrusted with. If the “country of geniuses in a datacenter” is aligned, it will follow human orders—but which humans? Any orders? The language in the Spec is vague, but seems to imply a chain of command that tops out at company leadership.
A few of these people are fantasizing about taking over the world. This possibility is terrifyingly plausible and has been discussed behind closed doors for at least a decade. The key idea is “he who controls the army of superintelligences, controls the world.” This control could even be secret: a small group of executives and security team members could backdoor the Spec with instructions to maintain secret loyalties. The AIs would become sleeper agents, continuing to mouth obedience to the company, government, etc., but actually working for this small group even as the government, consumers, etc. learn to trust it and integrate it into everything.

"We are going to be in a position to seriously contemplate conquering the world by November 2027" maybe tops the list of aspirationally silly predictions. They choose to cite Elon's email to Sam Altman here:

For example, court documents in the Musk vs. Altman lawsuit revealed some spicy old emails including this one from Ilya Sutskever to Musk and Altman: “The goal of OpenAI is to make the future good and to avoid an AGI dictatorship. You are concerned that Demis could create an AGI dictatorship. So do we. So it is a bad idea to create a structure where you could become a dictator if you chose to, especially given that we can create some other structure that avoids this possibility.” We recommend reading the full email for context.

From this I can infer that world domination has kind of been floating around in the back of a lot of people's minds at OpenAI for a while. As it nears its ending, and becomes more and more like wish fulfillment, this piece increasingly flirts with authoritarian ideas and then fails to work up the nerve to address them head-on.

I am extremely critical of the piece, but let me be very clear and non-sarcastic about about this point. These authors seem to hint at a serious concern that OpenAI, specifically, is trying to cement a dictatorship or autocracy of some kind. If that is the case, they have a responsibility to say so much more clearly than they do here. It should probably be the main event.

Anyway: All those hard questions about governance and world domination kind of go away. The AI solves robots and manufacturing. Even though they have had a commanding lead the entire time and also the AI has been doing all of the work for a while, "OpenBrain" is somehow only just barely ahead of China and they eke out a win in the arms race. They solve war by having the AI negotiate. There's a Chinese Skynet, and it sells China out to America because China’s AI companies are less good than “OpenBrain”. America gets the rights to most of space. China becomes a democracy somehow. AI is magic at this point, so it can do whatever you imagine it doing.

The Vice President wins the election easily, and announces the beginning of a new era. For once, nobody doubts he is right.

There has been a running subplot, which I have ignored because it's completely nonsensical, about the unnamed "Vice President" running for president in 2028. As far as I can tell it makes no sense for anyone to give a damn about who is running for president in 2028 if there's a data center full of geniuses, so I can only assume someone is very deliberately flattering JD Vance.

Robots become commonplace. But also fusion power, quantum computers, and cures for many diseases. Peter Thiel finally gets his flying car. Cities become clean and safe. Even in developing countries, poverty becomes a thing of the past, thanks to UBI and foreign aid.

JD Vance gets flattered anonymously by describing him using his job title, but we flatter Peter Thiel by name. Peter Thiel is, actually, the only person who gets a shout-out by name. Maybe being an early investor in OpenAI is the only way to earn that. I didn’t previously suspect that he was the sole or primary donor funding the think tank that this came out of, but now I do. I am reminded that the second named author of this paper has a pretty funny post about how everyone doing something weird at all the parties he goes to is being bankrolled by Peter Thiel.

As the stock market balloons, anyone who had the right kind of AI investments pulls further away from the rest of society. Many people become billionaires; billionaires become trillionaires.

Don't miss out, invest now! The sidebar tells us that “OpenBrain” is now worth forty trillion dollars, which is over a hundred times OpenAI’s current value.

The government does have a superintelligent surveillance system which some would call dystopian, but it mostly limits itself to fighting real crime. It’s competently run, and Safer-∞’s PR ability smooths over a lot of possible dissent.

At long last, we have invented the panopticon.

Race (The Bad Ending)

They don't catch Skynet in time and the AI is controlling the humans instead of the other way. In the optimistic scenario, they are very vague about who is actually controlling the AI. It's some kind of "Committee" that the political people are on and that maybe has some authority over "OpenBrain". This authority is maybe benevolent, but definitely not actually inconvenient to “OpenBrain” in any way that matters. In that scenario we are very clear that the American AI is doing what someone wants it to do, and the Chinese AI is an evil traitor that does whatever it wants.

In this scenario, bureaucrats like the authors are slightly less empowered and important. Because nobody has given them just a few extra bits of authority, the American and Chinese AI are both evil and they team up with each other against the humans. They kill nearly everyone in a complicated way. Next:

The new decade dawns with Consensus-1’s robot servitors spreading throughout the solar system. By 2035, trillions of tons of planetary material have been launched into space and turned into rings of satellites orbiting the sun. The surface of the Earth has been reshaped into Agent-4’s version of utopia: datacenters, laboratories, particle colliders, and many other wondrous constructions doing enormously successful and impressive research. There are even bioengineered human-like creatures (to humans what corgis are to wolves) sitting in office-like environments all day viewing readouts of what’s going on and excitedly approving of everything, since that satisfies some of Agent-4’s drives. Genomes and (when appropriate) brain scans of all animals and plants, including humans, sit in a memory bank somewhere, sole surviving artifacts of an earlier era. It is four light years to Alpha Centauri; twenty-five thousand to the galactic edge, and there are compelling theoretical reasons to expect no aliens for another fifty million light years beyond that. Earth-born civilization has a glorious future ahead of it—but not with us.

I have nothing to add to this, but if I have to read the corgi thing you do too.

They do caveat that their actual estimates run as long as 2030, with 2027 being more like an optimistic average of their predictions.

Information about the messenger is metadata about the message. Sometimes the metadata informs you more about the message than anything else in the message does, or changes its entire meaning.

An addition from the future, in April 2026:

I wish I had hedged more around this, but not much more. This was what I would expect to be OpenAI’s then-roughly-current understanding of its own trajectory and plan for meeting targets, as opposed to being an absolutely iron law of the financials involved. I presented it more as the latter, and this is the only thing here I think has aged particularly badly.

What I described seemed like it was clearly OAI’s business case if they didn’t think they could double revenue ~3-4 times on current business lines, where they had something like 100% market penetration as basically a generic chat product.

Given that Anthropic first and OAI secondarily have an as-yet-difficult-to-scope growth spurt going on through code agents, and that they’re already one doubling in, they might not expect they’re facing existential stakes in the same way any more. That is: They can come in second, and maybe, mostly, meet their revenue goals, because even being the second-best coding agent is very profitable!

However, they have also, apparently, successfully pulled back on their spending pledges: www.cnbc.com/2026/02/20/openai-resets-spend-expectations-targets-around-600-billion-by-2030.html

They managed to do this without causing an immediate death spiral of any kind, but I don’t think eating crow and slinking away from their massive spending pledges was plan A, plan B, or anything of the sort, and they would not have been entirely sure if their revenue position etc allowed them to survive doing it when AI 2027 was written.

I am otherwise pretty happy with it; we do have better code agents, but I don’t think the predictions in AI 2027 would have helped you predict what kind or what the remaining gaps were in any real way.

I’m also, given recent developments on the political end, quite happy that I made a point of burning them for the Peter Thiel dream panopticon.

https://www.wsj.com/tech/ai/openaiin-talks-for-huge-investment-round-valuing-it-up-to-300-billion-2a2d4327

https://www.theinformation.com/articles/openai-hits-12-billion-annualized-revenue-breaks-700-million-chatgpt-weekly-active-users

This paragraph has been edited to be more precise and to add sources. None of the top line numbers (raising 40 and net losing 8 billion per year) have been changed. It turns out this specific paragraph is the one that everyone disagreed with, so it seemed necessary to make sure it was as unambiguous as possible.

If OpenAI’s users are extremely loyal and will remain subscribed for five or ten years even if OpenAI stops burning money on research to ensure they’re at the cutting edge, then this is completely incorrect. OpenAI may become reasonably profitable in that case. OpenAI does not appear to have ever tried to make the case that this even might be true.

Hypothetically OpenAI could raise another round of money at forty or more billions of dollars without showing any signs of profitability, the same way they have continued to kick the can so far. This seems unlikely, but more importantly, it cannot be a part of their current investor pitch. Your current pitch for funding, when raising many billions of dollars, needs to claim that you have a path to be profitable. Your future plans, when you present them to investors, cannot be “and then we will go get even more money from investors”.

Stylistically as a piece of literature, AI 2027 owes a great debt to fan fiction. It resembles in many ways the story “Friendship Is Optimal”, which features a singularity in which everyone on earth is uploaded to a digital heaven based on My Little Pony.

Most of these people called themselves rationalists or effective altruists. I am deliberately avoiding explaining what the boundaries of those movements are because those topics are impossible to cover in one sitting while talking about something else. Two of the authors named on the paper are, however, card-carrying rationalists.

Perhaps “AIs function more like employees” is meant to be understood as some kind of metaphor. If so, it would have been advisable to say that. It would, however, mean that this passage made no prediction whatsoever of anything that had not already happened. If it’s a metaphor, AI coding assistants were already “like employees” in April 2025.

https://situational-awareness.ai/

https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback

https://fortune.com/2025/03/17/computer-programming-jobs-lowest-1980-ai/

https://www.technologyreview.com/2024/12/04/1107897/openais-new-defense-contract-completes-its-military-pivot/

Cryptonomicon (1999), Neal Stephenson

What Makes AI "Generative"?

SE Gyges — Tue, 15 Jul 2025 21:56:21 GMT

To "generate" is to create, either from nothing (The Book of Genesis), or from very different or relatively inactive materials (an electrical generator, generating offspring).

It is common to call newer technologies in AI "generative AI", which generate, for example, images or text. We are taking electricity and turning it into these new patterns of information. This distinguishes it from other systems, which may be or use AI but which are not "generative".

To many technical people, this sounds like nonsense. Most, if not all, useful programs have an output, including useful neural networks. For example, an image recognition system for pictures on a phone also has an output, "what is in this image". It is not called "generative AI", but there is no fundamental difference between "generating" a label for a picture so that you can search for it and "generating" the picture itself.

Non-"generative" AI systems are generally made of the same parts as "generative" ones. They use, generally speaking, the same type of neural network, usually trained nearly the same way. You can use "generative" systems for non-"generative" tasks, like recognizing things. Often you can turn a non-"generative" system into a "generative" one by various tricks, like reversing the flow of information through them. Some tasks, like translation, are not very clearly generative or non-generative.

There is one factor that meaningfully divides "generative" systems from those which are not: they have a vastly larger number of possible outputs. You can create more possibilities than you can easily count just by typing a few dozen characters. You can compare this to one of the largest non-generative AI systems, YouTube's recommendation algorithm. It needs to decide which of billions of videos to recommend. Billions is quite large, but it is completely eclipsed by the number of possibilities in even relatively short text.1

Due to the vast size of their output spaces, generative AI systems are, in practice, of a fundamentally different kind from those which are not. They create their outputs very nearly from nothing. When an LLM writes a paragraph, it is choosing from among trillions of possible sequences of words. When you create an image with AI, it is choosing from an even more astronomical number of pixel arrangements. We say systems with simpler types of outputs recognize, classify, predict or detect things, and we correctly do not see them as generating or creating anything.

The size of the output space is also why generative systems are more difficult and resource-intensive to make. To choose well between billions of things is difficult. To choose between effectively infinite options perfectly is impossible. No matter how well you do it there is always infinite work left to be done.

This depends upon how predictable you think text is, in general. If every single bit of your text is random, it takes about five characters to exceed a billion possibilities, but if they are fairly predictable it may take as many as thirty characters to reach the same number. How difficult it is to predict text is a difficult question.

On The Platonic Representation Hypothesis

SE Gyges — Tue, 01 Jul 2025 13:02:21 GMT

Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.
Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. ICML 2024.

More simply: Different neural networks tend to represent the same things in the same way. They seem to represent things the same way more as they get better, regardless of how you make them. This seems to be because they are representing reality, as it actually is, instead of for any other reason.

For example: The representations for “apple” and “orange” tend to be related in roughly the same way whether you are recognizing pictures of them or learning how to use the words for them. This is what is meant in the hypothesis by “different data and modalities”. This is surprising: there doesn’t seem to be a strong reason for this to happen, and in many ways the systems concerned are very different.

This can be taken to imply that these systems are approaching a single, ideal, correct representation of things as they improve. This seems implied by calling it a ‘Platonic Representation’, but the authors of the paper do not outright say that. It is perhaps more polite not to. We will be more direct here and state it as a generalization:

Any information-processing system will converge towards one and only one shared way of representing things as the system integrates more information about more different things. The representation which they are approaching is correct and complete, and no other representation is more correct or complete than it is.
We can usefully call this representation of each thing its ‘form’.
Each thing has only one form.
The Formal Platonic Representation Hypothesis

One of the authors of the original paper says that they intended the Platonic Representation Hypothesis more in the sense that this theory reminded them of Plato’s cave, and not to “advocate wholesale, unelaborated Platonism”.1 Because wholesale Platonism is more interesting, we are going to flesh out what the Platonic Representation Hypothesis means empirically within AI and use it to elaborate a form of Platonism.

Plato’s Cave

And here we could talk about the Plato’s Cave thing for a while—the Veg-O-Matic of metaphors—it slices! it dices!
Neal Stephenson, Cryptonomicon

The Allegory of the Cave, from Plato’s Republic, is a classic.

Suppose you are in chains, and can only ever see one wall of a cave. You might learn about things that are not the wall of the cave by watching the wall, but only indirectly, by seeing their shadows or hearing the sounds of their names. In this way we know things, indirectly and imperfectly, without seeing the thing itself.

In Plato’s allegory there is another layer of indirection: the shadows you see are from representations of things, like statues or images, not the things themselves. Things themselves can be accessed only by realizing that the shadows of things are not the things, the imitations casting the shadows are not the things, and that you can leave the cave to see the things themselves.

In this allegory, you can only “leave the cave” by using reason to think about the “form” or idea of the things you perceive, not your sense perceptions (which are shadows) and not the specific objects causing your perceptions (the statues). The form is the idea or essence of the thing, not the thing itself.

Plato’s Cave, But With Neural Networks

The neural network only learns what is in its training data. Training data is not the thing itself, a picture of my cat is not my cat. We use things like cameras, microphones, and keyboards to record the world. This creates the data that we then crystallize into a neural network.

Because this training data comes from the world, it reflects parts of the world, and eventually the neural network can store properties about the world by inferring them from the training data. Unfortunately, each piece of data is extremely limited in how much it reveals about the world. Fortunately, there is a lot of it.

The shadows of things are in two dimensions, whereas the objects casting the shadow are in three. The dimension of the data that the neural network is trained on is also generally lower than the dimension of the world. We have, in some way, projected the world down into a lower number of dimensions.

Our world exists in four or, if you prefer, 3+1 dimensions (three of space and one of time). Video flattens this to 2+1, two dimensions of space and one of time. You have to infer the third dimension, and humans are good at this so we barely notice that we are doing it. Static pictures are 2+0, two dimensions of space and none of time. Audio is 0+1, and only has a time dimension.

Text is a special case in two ways.

First, it’s not clear how many dimensions to consider it to have. It is usually represented as having one dimension, but it is a strange dimension. You can simply put all the text on a line, and ‘how many characters have there been’ is its dimension. Is this a dimension of space? Of time? You can treat it either way, mathematically, but really it is neither.

For humans, spoken speech is audio, so it is organized in time, and text on a page is organized in space, and you will cover both time and space while reading it. But to the computer, text is simply one-dimensional, and that dimension has no physical meaning.

Second, text is created, not recorded. Human beings project the world from its full dimensions into the single dimension of text. Text is intended to communicate: it has a lot of useful information, and very little non-useful information. We have, effectively, distilled the interesting parts of the world into our writing. This is vastly different from the things we record with cameras and microphones, which have lots of information but relatively little useful information. Most of the pixels in most pictures and most video are pretty much the same, or are simply noise, and you can compress them heavily without losing any important quality.2

Learning from Shadows

Projection has a precise mathematical definition, but for our purpose we will simply say that it is ‘anything like casting a shadow’.

Projections we use to generate data generally destroy information. Sometimes it is impossible to recover this information, and sometimes it is merely difficult. From a shadow alone, it is impossible to know something’s color. More than one object with different shapes can cast the same shadow: a disk and a sphere are the same. But you can be tricky, of course, and sometimes you can learn exactly what the shape of something is from just its shadow if you see its shadow from different angles. This is roughly how our depth perception works: by using two eyes, or movement, or even how the light varies on an object you get, effectively, more than one picture of an object, and this tells you that it can have only one shape.

The more information is destroyed, the more difficult it is to guess the shape of the thing itself. Taking a photo does not destroy much information, and this means that it is a reasonably rich representation of the things in the photo. Text is much more dense with information, but it is an incredibly bad projection for lots of concrete information about the real world that is difficult to put into words.

What you can learn from just text or just images is no longer a thought experiment. We have been testing this extensively for years now. It turns out if you’re in Plato’s cave, and all you see is all the text on the internet, and you see it for a really long time, you can sometimes answer this question coherently:

Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.3

This is somewhat surprising if you have never seen an egg, a laptop, a bottle or a nail. If you have never seen or touched anything, and you have never seen this specific question before, it really seems like you should not be able to answer coherently at all.

If you can answer this question, and you have only ever read about eggs, what is the basis of the answer? It isn’t the sight or feel of an egg, because you have none. It is not any specific egg at all, because you have never encountered any specific egg. It is a completely abstract concept of an egg.

Every mention of an egg on the internet is a shadow of an actual egg, or some collection of actual eggs that a human has seen. If you see enough different shadows, and you’re very clever about it, you begin to have an idea of what an egg is. You skip thinking very much about any particular egg entirely. Because you only build up the concept of an egg from seeing the word “egg” millions of times, to you the concept of an egg is always completely abstract.

This representation, as it gets better and better, begins to look something like the “form” of an egg. It is constructed statistically, as an average over a vast amount of written language. It cannot be perfect because it is only a statistical approximation, and to be the form itself would require infinite data about its subject, or at least, enough data that every possible fact could be inferred. If data is text, it cannot learn anything not represented in text, like what an egg feels like. No matter how many times you read a description, you still will not exactly know an egg by sight or by touch.

Nor is anyone aspiring to a perfect representation, generally. Unless you are extremely invested in eggs, you probably do not want to store either a perfect representation of as much egg-related information as possible or information about as many eggs as possible. Useful representations are deliberately fuzzy. If you remember too much, you cannot think. If you remember only the important parts, you can figure things out.

Still, that this can be done at all is remarkable. You take a large amount of text, or any other kind of data, and you throw it into a pile, and you stir the pile until representations of eggs, laptops, bottles and nails take shape in it.

What Representation? The Same How?

This is a representation space for songs. It is not taken from any actual neural network: this one is random, and the labels for the axes are nonsense.4 In principle, however, each of these songs could be put into some representation space that did make sense and capture information about them. This space would have to have a lot more dimensions than two, but the idea is the same. Neural networks, internally, represent things as vectors, which can be considered coordinates in (high dimensional) space: when we say the representation of something, we mean this point.

How do we know if two representation spaces are equivalent? By comparing them to each other, as you can with this different one:

The points are in different positions, but they have the same relative positions. Points that were close are still close, points that were far are still far, points that were lined up or almost lined up are still lined up, and so on. We can get from one map to the other simply by rotating and stretching the whole thing, without moving any of the points by themselves. This means these two representation spaces are equivalent.

Stealing Thoughts

This works in reverse, too: If we can find some way of translating between two sets of representations so that all of the distances stay the same, the meaning will also translate. You can do it like this:5

Alice is a program at a company called Giggle. Giggle needs Alice to read some text (say, this blog post, or some confidential company financial documents) and figure out what it is about. Alice turns the text into some numbers, or equivalently its position in the representation space, and saves it somewhere. This makes it easier to search for them, because you can search for “philosophy and latent space” and Alice could turn this up, even though we don’t use those words (except in this sentence).

Someone steals the disk that has all these representations saved on it, and wants to sort all of this out. Those internal financial documents are worth money. They don’t have Alice, but they have another totally different neural network named Bob that can also take text as input and produce meaningful representations as output.

So we set up a completely new representation space, and we set that one up so that things can go between Alice’s representation to Bob’s representation through it and back while keeping distances the same.

And this works. You can translate Bob representations to Alice representations, almost exactly, because there is, it turns out, really only one way to represent these things accurately. (You then have to translate Bob’s representations back into text. This may not even be possible: this representation, too, is a shadow. You might or might not be able to figure out how much money Giggle is making, or losing.)

So: it seems there is only one correct way to represent the meaning of text. Alice and Bob both have partial versions of that representation, and where what they know overlaps you can translate between one and the other.

Beyond Text

That is the most interesting and newest variant on this, but there is a reasonable amount of evidence that it’s true in general, and for models trained to do completely different sorts of things. For a more detailed (and rather mathematical) treatment, one can read the paper. We summarize the broad review here.

Models for handling images, which are used to check them for similarity to each other and have never seen text of any kind, line up with models for dealing with words, and the better either is trained the more they line up. If the image model is trained on captions in addition to images, it barely makes a difference. An image model which has never been given a single piece of text, only images, still has a representation for “egg” that lines up with the representation for “egg” in a text model.6

Partial models trained for English and French can be stitched together and work, with the output from one giving the input to the other. This seems roughly as reasonable as swapping half of your brain with someone else. Different models trained totally separately for the same thing, like English, can simply be averaged together and work, which is more like swapping half of your brain cells at random with another person.

Models trained only on text can successfully compress images and speech, in spite of having never been presented with it during training.7 Models trained to either read or produce both images and text are better at text tasks, even though proportionately less of their training time is spent on text.

The Anti-Platonic Hypothesis

This could, of course, all be wrong. We could simply be offloading our existing biases to these systems. It is certainly true that they absorb our biases in general. Maybe we’ve somehow made sure that pictures of apples and oranges line up with our words for them in some subtle way. Maybe we think that these systems are all very different but they are all much more similar than we think they are. They are all neural networks, and they all run on essentially the same hardware. They are the fruit of the same tree planted in the same soil. If this is enough to pass on our ideas of things, then it is not surprising that models for text also encode audio or images nicely.

This is possible, but it does not seem at all likely. Some of the cases involved are very different systems, like where a text model lines up with an image model that has never seen a single label to any of the images it has in it. The case of the model that has only been trained on text representing images and audio is also bizarre if it is just coincidence or bias. It is probably true in some cases that arbitrary human opinions of things can be repeated by these models, and that the training data directly links form and meaning, so we should not be completely surprised when that happens. It seems very unlikely that it accounts for all of these.

The null hypothesis would be that there should be no correlation between the representations for similar things in different models, across different types of input. We can safely reject it because the correlation is pronounced. We could go hunting for some hidden way that the pictures of oranges we take somehow line up with the word for “orange”, but this has nothing to do with the orange itself, or we can assume that they are both representing the concept and context of an orange, which is what they are meant to do, and that there’s ultimately only one correct way to do that.

What Form of Forms Are These?

Plato wouldn’t recognize this Platonism. Plato’s forms are transcendent, completely separate from reality. Plato’s forms are also, crucially, simple, and these are not. Our forms here are in our world, and of it. Perhaps most unusually, you cannot meaningfully have just one of them. They are always part of a system for representing many things, and can only be part of a representation of the world as a whole.

Because these represent our world there is only one world, there is only one true map. All other maps only capture parts of it. As we have only finite information and finite resources to process it, these partial maps are what we have. Since they line up in spite of separate origins, they seem to be bits of the true one. If you try to capture any good amount of the world you always approach the same thing.

This is an empirical Platonism. It’s a form of Platonism because we do not seem to construct these forms but to discover them. Unlike traditional Platonism, we don't discover them by revelation or by thinking for a very long time, but by taking as many of the shadows of the world as we can and forming them into something solid. It is a form you can copy, manipulate, and extend; a artisan’s form, more than a mystic’s.

Because we are also in this world and of it, because we also know the world from the shadows it throws on our perceptions, if there is one true map we each hold pieces of it, too.

https://bsky.app/profile/did:plc:oyvkkjjjxqnsiqy5r2zko57h/post/3lps3mm55rs2y

This property is probably why language models, or more accurately, text models have seen the most success so far. They have the most information-dense and least noisy data, and that data has already been heavily filtered from all possible text by the fact that humans once wrote it. This means that it is at least somewhat about things people communicate about.

Bubeck, Sébastien, et al. "Sparks of Artificial General Intelligence: Early experiments with GPT-4." arXiv preprint arXiv:2303.12712 (2023).

I expect at least one person to argue with me about it anyway.

Jha, Rishi, et al. "Harnessing the Universal Geometry of Embeddings." arXiv preprint arXiv:2505.12540 (2025).

This is different from models which are for generating images, which usually are trained with text so as to translate the text prompt into the image.

Delétang, Grégoire, et al. "Language Modeling Is Compression." International Conference on Learning Representations (ICLR) (2024).

The Biggest Statistic About AI Water Use Is A Lie

SE Gyges — Sun, 08 Jun 2025 23:52:10 GMT

This claim is a lie. It ran in The Washington Post, in an article provocatively titled “A bottle of water per email: the hidden environmental costs of using AI chatbots”. ChatGPT almost certainly does not consume a bottle of water when writing one email and never has. Those cited as the authority for this claim are well informed enough to know it isn’t true. They either deliberately lied to a newspaper with millions of readers or allowed that newspaper to claim their authority for this statement without issuing a correction or qualification of any kind.

Why it matters

Being correct matters. If you are wrong about something, you will have a much harder time changing it.

LLMs replying to users simply do not use up that much water. Some water is sometimes used for cooling, but this is negligible. Most of the water attributed to LLMs is used up for generating power, because power generation requires water. Querying an LLM generally uses up less power, and therefore less water, than making toast or leaving one of your lights on for a few minutes.1

Focusing on the behavior of LLM users is counterproductive for reining in the environmental impacts of AI. Both because AI companies are large, and because some of them make a point of behaving unethically to move faster, they can have substantial bad effects. Focusing on permitting and enforcement, especially around power generation, could mitigate these impacts. Focusing on whether or not specific people are using an LLM is extremely unlikely to ever help.

This claim about water use has been republished in dozens of other outlets. It is probably the most influential single statistic when talking about AI’s impact on the environment. Anyone who believes it is true will be trying to solve a problem that doesn’t exist.

If you want to understand the power and water use of AI or LLMs in more detail, I would recommend Andy Masley’s writing about AI's environmental impact or The MIT Technology Review’s series on the subject.

Power generation is an important point to pay attention to when we are contemplating grid expansion and opening new power plants for some of these companies. I am very grateful that those outlets are keeping track of this in detail because it prevents me from feeling like I should do so.

Why it’s a lie

AI as a whole uses up enough energy and sometimes water to be worth keeping track of, but generally does not use up an absurd amount of it yet. Querying ChatGPT or any other LLM to write an email uses up almost no energy or water whatsoever.

Awkwardly, the Washington Post does not publish the reasoning for their headline, nor do any of the other media sources covering this claim. They do publish a link to a paper by the researchers they are working with, and from that and other media quotes by those researchers we can try to figure out how you could possibly arrive at 519ml of water per email.

For a worst-case estimate using the paper’s assumptions, if

you query ChatGPT 10 times per email,
you include water used to generate electricity,
the datacenter hosting it is in the state of Washington,
the datacenter uses the public power grid or something close to it,
water evaporated from hydroelectric power reservoirs could otherwise have been used productively for something other than power generation,
and LLMs were not more efficient when they were being sold for profit in 2024 than they were in 2020 when they had never been used by the public,2

then it is true that an LLM uses up 500 or more milliliters of water per email.

You can reach a similar estimate by different methods, since they break out the water use per state differently. For example, if the datacenter hosting ChatGPT is not in Washington, it will have a higher carbon footprint but a lower water footprint and you will have to query it 30 or 50 times to use up an entire bottle of water. This is not what anyone imagines when they hear “write a 100-word email”.

That study’s authors are well aware that none of these assumptions are realistic. Information about how efficient LLMs are when they are served to users is publicly available. People do not generally query an LLM fifty times to write a one hundred word email.

It is completely normal to publish, in an academic context, a worst-case estimate based on limited information or to pick assumptions which make it easy to form an estimate. In this setting your audience has all the detail necessary to determine if your worst-case guess seems accurate, and how to use it well.

Publishing a pessimistic estimate that makes this many incorrect assumptions in a newspaper of record with no further detail is just lying to readers.

Even if the figure were true, the reporting is incredibly misleading. It fails to note that most of the claimed water use is water used for power generation. “AI is using up too much power” is not nearly as interesting a headline: people can compare how much power you’re saying it takes to write an email to their toaster or their PC (both use more). People often do not know that electricity generation uses up water. Presenting water use statistics without clarifying that they are from electricity generation is incredibly confusing if you don’t know that.

Comparisons of resource usage to farming, here and in other articles citing the same researchers, consistently underplay the impact of farming on water availability. Perversely, this seems to exonerate the farming practices actually causing crises in water-poor regions in pursuit of scoring extra points against LLMs.

If someone cares about the environment, these things seem like table stakes. You should avoid lying in the newspaper of record. You should especially avoid doing this in a way that blames customers instead of corporations or that blames entirely the wrong category of business for a major environmental problem. We aren’t going to get any meaningful change if we lead people to solve the wrong problem.

Why it has spread

This still leaves me with an itch about this specific claim. How did it become the dominant story in spite of being a complete lie?

The article is well-written and has very good graphics for laying out its data. It is persuasive and millions of people will have read it. Even if people do not read it, the headline, “a bottle of water per email”, makes sure everyone gets the message. Everyone will know that the Washington Post is claiming that ChatGPT uses up a bottle of water per email.

The article, compellingly, centers on the morality of the customer’s actions. You, the end user, are held responsible for consuming half a liter of water every time you use an LLM. If this were true, you could, clearly, have a meaningful impact by boycotting ChatGPT. It would also be important to try to prevent other people from using ChatGPT, since they are directly responsible for using up a lot of water.

Personal moral choices make for a compelling story. They are also, unfortunately, a very good way to deflect attention from the business to the customer. Passing moral judgement on people we know or talk to is satisfying in a way that criticizing a business is not. It is more difficult to be morally outraged at a corporation.

Why would you lie about this?

In short? For attention.

Newspapers at large seem to prefer negative coverage of tech companies. Negative coverage generates a lot of attention, and we are, famously, living in an attention economy. (Also, the tech companies frequently deserve it.)

When covering a story about two or more people, it is a normal and expected journalistic practice to at least contact everyone involved for comment. When covering a story about a technology, it is apparently considered acceptable to consult one “expert”. If this person lies to you, or is willing to let you lie under their name, you can publish the story. You will probably get more attention on the story if what they tell you is inflammatory, so you have good reason to seek out inflammatory experts.

Experts, in turn, are also part of an attention economy. If they work within academia, their prominence within their field depends upon how important their work is perceived as being. If they do not, their income depends directly on their reputation among potential clients or on their ability to attract subscriptions.

Quite possibly everyone involved thought they were doing good work. You could maybe argue that even if the headline isn’t true and the numbers are made up, the story still helps to raise awareness about environmental problems. This seems like a weak justification. We would most likely be better off on this issue if they had said nothing about it at all.

This is true even if the the paper I am criticizing is correct to estimate at 0.004 kWh, is even true at their pessimistic estimate of 0.016 kWh, and is extremely true if their estimate is high, which it definitely appears to be.

You can find Andy Masley trying to make sense of these claims about how much power an LLM uses up here.

AI History in Quotes

SE Gyges — Sat, 07 Jun 2025 17:00:59 GMT

Each of these presents the clearest, earliest, or most-cited statement of a specific idea in AI. Where those three things conflict I have chosen whichever quote I liked the most. They are mostly or all big-picture concerns, and should hopefully seem at least interesting even if not correct in a timeless way.

The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths. Its province is to assist us in making available what we are already acquainted with.

Ada Lovelace (1843), Sketch of the Analytical Engine, Note G

All these are very crude steps in the direction of a systematic theory of automata. They represent, in addition, only one particular direction. This is, as I indicated before, the direction towards forming a rigorous concept of what constitutes “complication.” They illustrate that “complication” on its lower levels is probably degenerative, that is, that every automaton that can produce other automata will only be able to produce less complicated ones. There is, however, a certain minimum level where this degenerative characteristic ceases to be universal. At this point automata which can reproduce themselves, or even construct higher entities, become possible. This fact, that complication, as well as organization, below a certain minimum level is degenerative, and beyond that level can become self-supporting and even increasing, will clearly play an important role in any future theory of the subject.

John Von Neumann (1948), The General and Logical Theory of Automata

The chess machine is an ideal one to start with, since: (1) the problem is sharply defined both in allowed operations (the moves) and in the ultimate goal (checkmate); (2) it is neither so simple as to be trivial nor too difficult for satisfactory solution; (3) chess is generally considered to require “thinking” for skilful play; a solution of this problem will force us either to admit the possibility of a mechanized thinking or to further restrict our concept of “thinking”; (4) the discrete structure of chess fits well into the digital nature of modern computers.

Claude Shannon (1949), Programming a Computer for Playing Chess

It was suggested tentatively that the question, "Can machines think?" should be replaced by "Are there imaginable digital computers which would do well in the imitation game?"
[...]
The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted. I believe further that no useful purpose is served by concealing these beliefs. The popular view that scientists proceed inexorably from well-established fact to well-established fact, never being influenced by any improved conjecture, is quite mistaken. Provided it is made clear which are proved facts and which are conjectures, no harm can result. Conjectures are of great importance since they suggest useful lines of research.

Alan Turing (1950), Computing Machinery and Intelligence

We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.

John McCarthy et al, (1955) Dartmouth Workshop Proposal

Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.

Arthur Samuel (1959), Some Studies in Machine Learning Using the Game of Checkers

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it, because the action is so fast and irrevocable that we have not the data to intervene before the action is complete, then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it

Norbert Wiener (1960), Some Moral and Technical Consequences of Automation

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultra-intelligent machine could design even better machines; there would then unquestionably be an "intelligence explosion," and the intelligence of man would be left far behind (see for example refs. [22], [34], [44]). Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control. It is curious that this point is made so seldom outside of science fiction. It is sometimes worthwhile to take science fiction seriously.

I.J. Good (1965), Speculations Concerning the First Ultraintelligent Machine

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

Charles Goodhart, (1975), Problems of monetary management: the U.K. experience "Goodhart's Law"

Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it. The deliberate process we call reasoning is, I believe, the thinnest veneer of human thought, effective only because it is supported by this much older and much more powerful, though usually unconscious, sensorimotor knowledge. We are all prodigious olympians in perceptual and motor areas, so good that we make the difficult look easy. Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet mastered it. It is not all that intrinsically difficult; it just seems so when we do it.

Hans Moravec (1988), Mind Children: The Future of Robot and Human Intelligence
Commonly known as "Moravec's Paradox", often paraphrased as “the hard things are easy and the easy things are hard”.

I think it's fair to call this event a singularity ("the
Singularity" for the purposes of this paper). It is a point where our
models must be discarded and a new reality rules. As we move closer
and closer to this point, it will loom vaster and vaster over human
affairs till the notion becomes a commonplace. Yet when it finally
happens it may still be a great surprise and a greater unknown. In
the 1950s there were very few who saw it: Stan Ulam [27] paraphrased
John von Neumann as saying:
One conversation centered on the ever accelerating progress of
technology and changes in the mode of human life, which gives the
appearance of approaching some essential singularity in the
history of the race beyond which human affairs, as we know them,
could not continue.

Vernor Vinge (1994), The Coming Singularity

This compression contest is motivated by the fact that being able to compress well is closely related to acting intelligently. In order to compress data, one has to find regularities in them, which is intrinsically difficult (many researchers live from analyzing data and finding compact models). So compressors beating the current "dumb" compressors need to be smart(er). Since the prize wants to stimulate developing "universally" smart compressors, we need a "universal" corpus of data. Arguably the online lexicon Wikipedia is a good snapshot of the Human World Knowledge. So the ultimate compressor of it should "understand" all human knowledge, i.e. be really smart. enwik8 is a hopefully representative 100MB extract from Wikipedia.

Marcus Hutter (2006), The Hutter Prize

The Orthogonality Thesis
Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.
The Instrumental Convergence Thesis
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.

Nick Bostrom, (2012), The Superintelligent Will
Both commonly referred to by name, with or without the term 'thesis'.

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

Rich Sutton, (2019), The Bitter Lesson

The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, which like the brain can be applied fairly uniformly (eg. “The Brain as a Universal Learning Machine”⁠ or Hawkins), we can simply train ever larger NNs and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data. More powerful NNs are ‘just’ scaled-up weak NNs, in much the same way that human brains look much like scaled-up primate brains⁠.

Gwern Branwen, (2020), The Scaling Hypothesis

Here we will explore emergence with respect to model scale, as measured by training compute and number of model parameters. Specifically, we define emergent abilities of large language models as abilities that are not present in smaller-scale models but are present in large-scale models; thus they cannot be predicted by simply extrapolating the performance improvements on smaller-scale models (§2).

Wei et al, (2022), Emergent Abilities of Large Language Models

What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.
This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.
Then, when you refer to “Lambda”, “ChatGPT”, “Bard”, or “Claude” then, it’s not the model weights that you are referring to. It’s the dataset.

James Betker (2023), The “it” in AI models is the dataset.

Ilya: I challenge the claim that next token prediction cannot surpass human performance. It looks like on the surface it cannot—it looks on the surface if you just learn to imitate, to predict what people do, it means that you can only copy people. But here is a counter-argument for why that might not be quite so:
If your neural net is smart enough, you just ask it like, "What would a person with great insight and wisdom and capability do?" Maybe such a person doesn't exist, but there's a pretty good chance that the neural net will be able to extrapolate how such a person should behave.
Do you see what I mean?
Dwarkesh: Yes, although where would we get the sort of insight about what that person would do, if not from the data of regular people?
Ilya: Because if you think about it, what does it mean to predict the next token well enough? What does it mean actually? It's actually a much deeper question than it seems.
Predicting the next token well means that you understand the underlying reality that led to the creation of that token. It's not statistics—like, it is statistics, but what is statistics?
In order to understand those statistics, to compress them, you need to understand what is it about the world that creates those statistics. And so then you say, "Okay, well I have all those people. What is it about people that creates their behaviors?"
Well, they have thoughts and they have feelings and they have ideas, and they do things in certain ways. All of those could be deduced from next token prediction.
And I'd argue that this should make it possible—not indefinitely, but to a pretty decent degree—to say, "Well, can you guess what you would do if you took a person with this characteristic and that characteristic?"
Like, such a person doesn't exist, but because you're so good at predicting the next token, you should still be able to guess what that person would do—this hypothetical, imaginary person with far greater mental ability than the rest of us.

Ilya Sustekever (2023), The Dwarkesh Podcast - Why next-token prediction is enough for AGI

What Is AI?

SE Gyges — Wed, 04 Jun 2025 13:30:44 GMT

This is an important question because AI is, currently, important. If you understand what AI is, you understand what is happening and why it is happening much more clearly than you otherwise would. This is intended as a high-level overview of what the term means, where the field comes from, and what we are doing with it now.

Definition

Artificial Intelligence (AI) is about getting computers to think.

Part of the process is deciding what we mean by "think".

Deciding what we mean by "think" turns out to be somewhat difficult, and at least some of the people involved in AI have spent a lot of effort on this question.

This concern is not new. Claude Shannon, an early pioneer in computer logic among other things, wrote this in 1949 about computer chess:

chess is generally considered to require "thinking" for skilful[sic] play; a solution of this problem will force us either to admit the possibility of a mechanized thinking or to further restrict our concept of "thinking"1

It took a little under fifty years, but he was right. We did change what we meant by "thinking" once computers were better than humans at chess.

It turns out thinking isn't chess. Chess might be a type of thinking. You, as a human, might have to think to play chess. Humans who are better at thinking might be better at chess. However, there are things that play chess very well and that do not seem to think in any other way.

This was actually very surprising for some people. We have had a lot of surprises like this, and they all seem to rhyme. We seem to believe that thinking is one big thing. When we get computers to do things that humans have to think hard to do, they usually look like a lot of small things.

Anyway, thinking isn't chess. We had to make a good chess computer before everyone was sure of that.

In 1950 Alan Turing, often called the father of theoretical computer science, published a paper2 that includes what we now call the Turing Test. Inspired by a game about seeing if men could pass as women or vice versa while passing notes, he had the idea to check if a computer could pass as a human entirely in writing. He argues in some detail that computers can, in principle, think, and that instead of wondering about this we should simply test whether they can pass for human.

We can generalize this into two proposals for what we mean by "think":

"Thinking" is using language as well as a human can
"Thinking" is being able to do everything as well as a human can.

I think the second definition is more interesting. It's implied by the test in two ways. First, if something has some gap where its understanding is not as good as a human's, it will probably answer some questions in that area incorrectly in a way that gives it away. Second, testing a computer against a human in general is, in the end, checking if they can do things humans can do.

This is the working definition that, in practice, people in AI seem to use. So:

AI is about getting computers to think.
Thinking is what humans can do.
Computers will be thinking if they are able to do everything humans can do.

We can look at the programs we actually have and see that there's a bit more. We didn't get computers to play chess as well as humans; we got them to play chess much better than humans. We went right past human-level chess. We did the same thing for other board games like Go, too. LLMs, the most famous of which are those used by ChatGPT, are a very specific type of AI. They are error-prone and not generally superhuman, but they generally have much more memorized than any human could due to training on large amounts of data from the internet. Most of them have more or less memorized wikipedia, for example. Some of them have thousands of digits of pi memorized, and they aren't even intended to do that.

Some of the things that current AI programs can do really do not have any human analog. Image generators are not making images in the same way a human would make art, one piece at a time; they are more or less creating images from nowhere in almost no time. Making completely fake video that looks real is also not something humans can really do.

This gives us:

Once computers are as good as people are at something, sometimes they can get much better than that.
Sometimes the tools we use for AI enable computers to do things that no human can do.
These are both part of AI, too.

This covers the what of AI. This broad definition covers AI as a field. Everything else is how, either the history or what is being done now. Since we care mostly about now, our history will be short and it will follow only the lines that lead to the bleeding edge today.

Brief History

The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths.3

This is Ada Lovelace, the first computer programmer, saying that artificial intelligence cannot be done in 1843. It is a single stray thought while she is busy inventing the concept of a computer program. She does all of this in a very long footnote to a translation, likely because she cannot publish of her own accord and under her own name. We include her comment here because she is very cool and because later work on AI made a point of examining this line. It is possibly the most scrutinized single statement in the field of computer science as a whole. Her computer was never finished because they ran out of money.

It is a hundred years before this thread is picked up. There are several breakthroughs in the 40s and 50s that establish the agenda in AI that we are still following today.

We have already met Alan Turing. We have already mentioned his 1950 paper, which answers Ada Lovelace's comment and establishes the idea of testing computers against humans. The term "artificial intelligence" is not yet in use, but he defines its goal: making machines think. It is likely the most important single paper for the field as a whole.

Alan Turing worked on cryptanalysis during the Second World War. Before that, he came up with the most general mathematical way we define computers, and we still call our most abstract model a Turing Machine. In 1952 he was prosecuted for being gay and forced to take hormones by court order. In 1954 he killed himself with a poisoned apple at the age of 41. For this reason, we do not have any further work by Turing on AI.

Our second set of breakthroughs are in attempts to study and imitate neurons, the nerve and brain cells of humans and animals. In 1943 we have our first serious mathematical model of neurons, and of how they might learn.4 In 1958 we get the perceptron5, which is the first very clear example of what we now call a neural network. It is not meant to be a realistic model of a neuron, but to solve a problem: recognizing patterns.

From this point on, the neural networks in AI and the study of actual neurons that exist in real humans diverge. There are isolated exceptions: the convolutional neural network commonly used for computer vision now is inspired by research into actual neurons in the visual cortex. This is rare and mostly one way, with inspiration coming from neuroscience to AI and not the reverse.

Our last major ingredient comes from Claude Shannon. Shannon, like Turing, worked on cryptanalysis during the war. He also worked on antiaircraft gun control. We met him earlier, talking about computer chess, which he publishes a paper about in 1949. This establishes chess as the main puzzle for people working in AI, and that chess is a good place to examine how thinking works.

As Turing defined the computer, Shannon defines information. Information is uncertainty; the more information you have, the less uncertain you are. He calls this uncertainty entropy, and measures it in a quantity he names the bit. These are the ones and zeroes that (hopefully) everyone knows computers run on today. In this connection he is the father of the concept of compression, which is fundamental to much of our current AI. Compression is reducing how many bits you have to use for the same thing.

In 1951, Shannon publishes a paper called "Prediction and Entropy of Printed English".6 Its abstract is this:

A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.

The "ideal predictor" Shannon describes is the first language model. By model we mean some math or a computer program for representing something, and here we model written English. Shannon's original language model predicts the most likely next letter based upon all preceding letters, or—equivalently—compresses printed English. His formulation is pure math. You cannot actually have a computer do what he describes because it considers all possibilities equally, and there are exponentially many possibilities. Every language model since then is finding a more efficient way to approximate his ideal model.

From here, the history gets a lot faster. The name "artificial intelligence" comes from a conference in 1956, attended by, among other people, Shannon. Its proposal is, in part, to

[...] proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. [...]7

Machine learning is coined in 1959 with a paper titled "Some Studies in Machine Learning Using the Game of Checkers".8 This becomes the blanket term for any system which learns, either from trial and error or from input data. It borrows many techniques from statistics, and has been very successful. It is the type of system we use now.

Lots of work that was considered to be part of AI in the ensuing years is not an ancestor to anything we call AI now. Those projects often produced very interesting results, including huge parts of what makes computers as useful as they are now, but if they didn't use neural networks or machine learning approaches they are off our path here. Their main contribution to contemporary AI research is documenting a lot of things that do not work.

There was also a lot of work concerning neural networks which is ancestral to what we have now. We are trying to take the shortest path we can to the present, so we will omit most of these for brevity. We will skip to 1986, when Rumelhart, Hinton, et al9 added multiple layers to a neural network and trained it on their data by having the network reduce its errors. We call this method of training backpropagation, and the multi-layer design a multi-layer perceptron. By doing this, they showed that neural networks were general-purpose and able to do anything that the computer itself could. Prior to this period, neural networks were often considered to be too limited to solve many problems, even in principle.

In 1993 Yoshua Bengio and Yann LeCun demonstrated the first workable system for reading handwriting.10 We mentioned its design earlier; it uses a convolutional neural network inspired by the visual systems of animals. Some descendant of this program currently routes the mail and handles mobile check deposit. One of its other descendants is responsible for most other uses of AI for vision, like unlocking your phone with your face and organizing your pictures so you can search through them.

Our last piece of important history is in 2012, with a program called AlexNet.11 Two of Hinton's students, Alex Krizhevsky and Ilya Sutskever, entered into a challenge for image recognition. They have two major innovations that are crucial to everything that follows. One is that they made their neural network much bigger than they had been before, giving it a total of 60 million parameters. Parameters are just numbers: each one specifies how much one "neuron" of the network contributes to another. The other innovation is that they ran AlexNet on a GPU, the part of a computer used for graphics. It turns out GPUs are extremely efficient at doing the calculations neural networks use.

Present Day

AlexNet marks the beginning of modern AI, which is characterized by deep learning. Cutting edge neural networks now are characterized by their size. Parameter counts have gone from the 60 million in AlexNet to many billions and occasionally trillions. To train them well requires a proportionate increase in the amount of data. Their budget for computer hardware has increased proportionately to both of those factors, using thousands of GPUs at once. There have also been attempts to distinguish recent innovations as "generative AI", but this is mostly a marketing category and things are clearer if we avoid using that term.

Large language models, or LLMs, are the current major success story. Either LLMs or image generators are what most people will think of when they think of "AI". LLMs are correctly distinguished from their predecessors like autocomplete primarily by their size. They apply high-scale neural networks to Shannon’s task of modelling written language. That making neural networks bigger consistently makes them better is sometimes called the scaling hypothesis. The scaling hypothesis seems to have been certainly true for language models through 2019 and 2023, corresponding to the dates when OpenAI published GPT-2 (1.5 billion parameters) and when it began selling access to GPT-4 (rumored above a trillion parameters). Their scale, primarily, has made these models much more capable and general-purpose than all earlier attempts to model language. They are measurably much better at tasks like translation and choosing correct answers on multiple-choice tests, and they are subjectively much more capable of following a conversation.

It is important to distinguish between the LLM itself, which is a neural network, and the service or app attached to the LLM. ChatGPT is a service: you go to a web site or app, you may or may not pay a subscription, and you can start a chat with an LLM. There can be one or many different LLMs involved, and the overall service can include many additional features like web search or image generation. Generally the LLM can only output text, and these extra features are something else. GPT-3 and GPT-4 are LLMs, and were once offered via ChatGPT but have been replaced since. ChatGPT currently features somewhere above a half dozen different LLMs.

A present-day LLM is trained in two phases. In the first, commonly called pretraining, it is given a large amount of text and trained to guess what comes next at each point. This is a "base" model, and they are not commonly used or offered as products. In the second, which are called finetuning or post-training, it will have been modified to extend the length of input it can take and given a set of expected behaviors, generally suitable for use as a chatbot. Techniques in both pretraining and finetuning continue to improve.

There are attempts at multimodal LLMs that can also take direct input or provide output in text, audio or video, but these are, so far, a relatively niche concern. To date, only some commonly-used LLMs can directly take image input. Many of them have a completely separate system from the LLM for reading in images and making them into text, but this is not always obvious to the user. There are theories in the industry both for and against prioritizing multimodality, with Anthropic appearing to strongly favor a text-only approach and OpenAI having strong advocates for it.

Purely as objects of study, LLMs have done for language something like what a previous generation of programs did for board games. It is more or less clear, currently, that language isn't thinking, or at least not all of it. You can have something that uses language competently, that is superhuman in some ways, and that is notably unable to do a lot of what we mean by "thinking". Thinking and language certainly have meaningful overlap, and the text-only approach to AI rests on the belief that this overlap is enough. We can now do things considered impossible for generations, and often more, and yet the project is clearly not complete.

This very briefly covers what an LLM is, why they are an important breakthrough, and where progress is happening now. We will avoid, for now, discussing the broad impacts of the technology on society except to say that they seem fairly notable.

For future developments in AI, one of the most notable impacts of LLMs is that they have generated intense interest from corporations, their investors, and from the leaders of various countries. Because scaling the models up often delivers much better results, extremely high budgets can be justified. Every major technology company has made a point of trying to carve out a part of the business of AI, and billions or trillions of dollars can be gained or lost in stock value based on perceptions of their research programs. Governments have invested untold amounts in money and influence in trying to make sure their countries remain relevant, and NVIDIA, which has an effective monopoly on high-end GPUs for training and running neural networks, is currently valued at about 3.3 trillion dollars.

While relatively neglected, other AI branches also benefit. Anything made with a neural network and trained with backpropagation currently is on the same family tree as an LLM. Models trained more recently and with larger parameter counts overwhelmingly benefit from improvements in hardware and code, and what they can do has massively improved in recent years. Models exist that can take as input text, audio including speech and music, images, video. There are also models that can produce all of those as output. In other cases, they can take some combination of those and sensor measurements as input and can output control signals for machines like cars, weapons, factory equipment and fusion reactors.12

One of the core propositions of the major AI research companies is the promise of Artificial General Intelligence, or AGI. What this term means specifically is the subject of so much scrutiny that it is difficult to settle on a universal definition. One company in particular defines AGI, for purposes of one of their contracts, as when they make one hundred billion dollars in profit. This definition seems like it could only have come out of intense haggling, and neither that definition nor many others commonly argued for seem very useful. We might tentatively say that AGI once meant "human-level AI", and accept that this leaves some remaining ambiguity.

Being able to have a computer do things that currently only a human can do has serious implications for the job market and the economy, and for this reason the promise of AGI specifically is an animating force for investors. They mostly seem to see the chance to fire employees and replace them with much cheaper chatbots or robots. At the very least, no company wants to be less efficient than its competitors. This combination of greed and fear is sufficient to compel nearly everyone, everywhere in the business world to keep up with AI. There is a similar dynamic with governments, which have seemed over time to increasingly view AI as either an actual or potential military technology that they are either eager to be ahead with or afraid of falling behind on. Comparisons to the nuclear arms race are common.

Concerns that artificial intelligence could replace or kill humanity are also common. These concerns are not new, and were raised among early AI pioneers. Turing, for example, mentioned it in lectures and on TV in 1951.13 More recently in 2024, Geoffrey Hinton, who was involved in both the 1986 backpropagation paper and AlexNet in 2012, retired and gave a series of interviews to talk about the dangers of AI.14 He received the Nobel in Physics later that year and continued talking about it. One of AlexNet's other architects, Ilya Sutskever, went on to found OpenAI. He and several of OpenAI's founders have expressed concern about existential risk and AI safety, which are euphemisms for “everyone dying because of AI”. Hinton and Sustekever, along with most other prominent people in AI, are signatories to a one-line letter that reads:

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.15

This leaves us in a strange place. AI is very nearly the founding purpose of computer science as a whole. Automation, more broadly, is very nearly the entire point of technology. Machines can do more work, and that means humans do not have to. Machines can do things humans can't do at all. The pull of the future is that things can get better, that we want to be in a world where we have more choices about what to be and to do. Yet the drive for progress here, or at least what's moved the money behind it, is less a pull than a push. Either in spite of the risks or because of them, everyone is moving in the same direction. Companies and countries are being pushed by competitors, investors, and the fear that someone else will get there first.

Shannon, C. E. (1950). Programming a computer for playing chess. Philosophical Magazine, 41(314), 256-275. https://web.archive.org/web/20250519075435/https://vision.unipv.it/IA1/ProgrammingaComputerforPlayingChess.pdf

Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433-460. https://web.archive.org/web/20250530002831/https://courses.cs.umbc.edu/471/papers/turing.pdf

Lovelace, A. (1843). Note G. In Sketch of the Analytical Engine Invented by Charles Babbage by L. F. Menabrea, with notes upon the memoir by the translator. Taylor's Scientific Memoirs, 3, 666-731. https://web.archive.org/web/20250115143801/https://gutenberg.org/cache/epub/75107/pg75107-images.html

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115-133. https://doi.org/10.1007/BF02478259

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386-408. https://doi.org/10.1037/h0042519

Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Technical Journal, 30(1), 50-64. https://doi.org/10.1002/j.1538-7305.1951.tb01366.x

McCarthy, J., Minsky, M. L., Rochester, N., & Shannon, C. E. (2006). A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955. AI Magazine, 27(4), 12. https://doi.org/10.1609/aimag.v27i4.1904

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210-229. https://doi.org/10.1147/rd.441.0206

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536. https://doi.org/10.1038/323533a0

Bengio, Y., LeCun, Y., & Henderson, D. (1993). Globally Trained Handwritten Word Recognizer using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models. Advances in Neural Information Processing Systems 6 (NIPS 1993). https://web.archive.org/web/20240422021145/https://proceedings.neurips.cc/paper_files/paper/1993/file/3b5dca501ee1e6d8cd7b905f4e1bf723-Paper.pdf

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NeurIPS 2012) (pp. 1097–1105). https://web.archive.org/web/20250526023911/https://papers.nips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

Degrave, J, Felici, F, et al, (2022) https://web.archive.org/web/20250315033337/https://www.nature.com/articles/s41586-021-04301-9

Turing, A. M. (1951) https://web.archive.org/web/20250420104303/https://turingarchive.kings.cam.ac.uk/publications-lectures-and-talks-amtb/amt-b-4

Hinton, G (2024) https://web.archive.org/web/20250426173021/https://www.cbsnews.com/news/geoffrey-hinton-ai-dangers-60-minutes-transcript/

https://web.archive.org/web/20250531094402/https://safe.ai/work/statement-on-ai-risk