30 Comments
User's avatar
Paul Sturm's avatar

Your argument is, in part, that rather than selecting minds randomly from potential-mind-space, we are pursuing human-directed hill climbing within that space, so therefore these counting arguments simply fail to accord with reality. But perhaps AI is unique, in that we are rapidly and intentionally pursuing capabilities where the *hill* is going to start climbing *us*. (In Soviet Russia, the hill of potential-mind-space climbs you ...)

Félix Lapan's avatar

The main reason that justifies extreme caution in the case of AI versus any other type of engineering is that in the case of alignment we only have one chance to succeed. Pharmacology has had its share of failures and disasters, but a failure in that case has never threatened the survival of the entire species.

Paul Sturm's avatar

We can widen "pharmacology" a little bit to include gain-of-function research on viruses, or just research on viruses in general. Then widen a little more to explicit bioweapons research. (Similarly with nuclear/high-energy-physics, we have some long-tail risks talked about for igniting the atmosphere or swallowing up the planet in a black hole.) So maybe there's a category of "research areas with self-propagating/chain-reaction features" that encompasses more than just AI.

I'm not sure what point I'm making, other than to say that maybe a doomer position -- if held consistently -- wouldn't be hyper-focused on AI to the exclusion of all else.

Paul Sturm's avatar

Oh right, maybe what I'm saying is that it struck me as really weird that the suggestion is to embrace *genetic engineering* as the safer option to AI.

Alex Krusz's avatar

hey this is quite good

Ali Afroz's avatar

I think you are unintentionally, attacking a straw man. If the argument was that out of the space of all possible minds or all possible ways that things could go wrong, success is very unlikely, then it would not make any sense for MIRI to complain about how current methods of creating AI minds don’t permit find grain control and leave the resulting model as a black box, which is very hard to understand. Why on earth would EY declare that if AI interpretability was consistently producing results as illuminating as the explanation of why some models back in the day were claiming that 9.11 is greater than 9.9, then he would feel pretty optimistic? Why would his doom be only like 99% instead of 99.99999999% if he’s just using a counting argument since it’s in fact, the probability of getting a randomly selected safe mind out of the possibility space of all possible minds? Under your explanation of his views, impressive progress in interpreting AI should not affect anything since it doesn’t actually change the number of ways things could go wrong or the number of possible minds.

In actuality, the discussion of all possible minds is likely him reacting to people, including his younger self, who think being intelligent means. You almost certainly are moral and ethical. The point being to establish that actually, this is incredibly rare for a randomly selected intelligent mind. This is not mean, he thinks an AI is randomly selected. His view appears to be that our current ability to align AI to specific chosen objectives is in such a dismal state and improving so slowly that given what he considers the extremely narrow range of safe objectives, we are unlikely to hit the target. Basically, he thinks that the goals that super intelligent AI is likely to have given a current research trajectory is unlikely to be in the narrow space of what he considers safe simply because not only do we not understand how to give AI specific objectives or how to interpret their thoughts, but we are not improving fast enough, whereas the actual target we need to reach is extremely difficult as we need to install objectives that are very unnatural on account of being contrary to power seeking. As an analogy, if someone is trying to create a nuclear explosion that creates a specific temperature at a specific point to 100s of decimal points of accuracy, it’s reasonable to argue that given current technology, this level of accuracy might prove difficult to achieve with the technology we are likely to have in 10 or 20 years.

Also, your discussion of why they would prefer genetic engineering is simply missing the point that they would not agree with you about the downside risk being at all likely. I suspect it’s one of those things they would consider a good idea even without AI being in the picture. In any case the fact that humans are hard to understand is missing the point, we know humans and have a pretty good idea of how they think, and since you’re not proposing, making gigantic changes like the differences between humans and chimpanzees, but instead, smaller improvements will likely have a pretty good idea of what intelligent humans want and how they think like, whereas MIRI think we are unlikely to have any such idea regarding AI by the time we have superintelligence. If you combine the opinion that genetic engineering is unlikely to have any significant downside with this view about how much we understand humans and the ability of human social technology to constrain human bad behaviour along with their view about how unlikely we are to be able to hit a safe target in AI. Then their view makes total sense. The only reason you’re confused is because you don’t quite get their world view.

deepfates's avatar

I think you are the one who doesn't understand their views

Ali Afroz's avatar

Possible although given the amount of time I hang around spaces where they are considered at least a respectable authority and the amount of content from them I have consumed over the years, It would be pretty surprising to me if I had misunderstood their position that badly, although it’s possible although unlikely in my opinion that they are suffering from some confusion between their good arguments and the counting argument. Also, I have seen people who either work at MIRI or have worked there in the past respond to a somewhat similar objection by William MacAskill with reasoning which I remember as being very similar to the one I described in my original comment. In any case, it doesn’t really matter much to me, whether the original post is attacking a strawman or a weakman. I am primarily interested in whether the arguments for AI risk are convincing, not what the exact position of MIRI in particular is and the original post definitely doesn’t deal with the better arguments for having very high probability of extinction due to AI risk.

SE Gyges's avatar
6dEdited

Their position and all of this other stuff they say is a gish gallop of nonsense that simply hides the counting argument. I am familiar with their statements and think they miri faction is wrong, actually defending and rationalizing the counting argument, is immune to updating on evidence, and is incredibly territorial and is fundamentally trying to defend its access to grant money, which means they must never admit they are wrong.

SE Gyges's avatar
6dEdited

If you want one example, this is a 2018 paper: https://www.semanticscholar.org/paper/The-Value-Learning-Problem-Soares/f042e8db015e31b13a8aa2a8b814ff745b3f4f49

The current LLM paradigm makes this a completely different picture, it COMPLETELY INVALIDATES that value learning specifically is an unsolvable problem, it should be an extreme positive update away from doom. It isn't a positive update for them because they do not actually believe in doom for this or any other sane, empirical, or scientific reason. They believe in doom because of the counting argument, which they have stated directly many times in places I have specifically dug up and cited for you!

Ali Afroz's avatar

I don’t find this persuasive. Can you pee inside a modern Llm and specify a specific goal function for it like pursue exactly these instructions, obviously, not otherwise you would never have problems with users successfully jailbreaking a large language model, and in fact, if you could give them exact goals like this. They would be a lot more useful. In reality, we have no clue how to do that and hardly understand their insides or motivations, except by external observation, which obviously raises a lot of concerns about being vulnerable to deception, especially if you train them to convince you that they are perfectly friendly. So clearly given the capabilities of present models and their vulnerabilities you can’t, in fact, give them the ability to reliably pursue specified goals out of distribution with no possibility of breaking down in unusual scenarios, much less Give them the kind of somewhat unnatural correctable values that MIRI wants them to have. And while AI being able to understand language better is a positive update, I think it’s more than cancel out by the fact that machine learning means we understand their insides a lot worse than if you were coding them by hand, which is what they expected in 2008. Overall, my impression from your writings is that you just don’t understand their model because you keep complaining that they are not updating in response to things you would consider evidence against them but which they would not consider evidence against them. It’s fine to think these things might be evidence against them, but it’s bizarre to complain that they did not update in response to evidence that on their model is perfectly compatible with their concerns.

SE Gyges's avatar

This is a gaps argument such that we're required to have perfect knowledge to refute it

deepfates's avatar

> Can you pee inside a modern LLM

Yes. Not everyone can. But I can

Ali Afroz's avatar

The problem with your interpretation of them is that it makes no sense, given the rest of their positions. Why fight for an AI pause if you think human effort will have no effect on the outcome because it will be completely randomly selected. None of their other positions and statements make any sense on your interpretation, and I’m pretty sure that a bunch of people familiar with computer science and software have much better avenues for earning money instead of staying part of an obscure organisation whose demands are unlikely to be implemented. EY in particular has a huge following of people who respect him a great deal. So his prestige in certain circles gives him pretty good exit options, especially since it’s not at all hard for him to claim, he changed his mind in response to new evidence given how many new data points have come in since 2008. And he doesn’t even need to switch his view completely if he moderate his view to something closer to the open philanthropy position, he would probably have a pretty good chance of getting hired by anthropic or something similarly comfortable. It just makes no sense that staying in the current MIRI is the best way for him to maximise his prestige, power or money. And I think many of the other people at MIRI also have similarly good exit options if not quite as good as him.

Also in my experience, not only have, I found their arguments on AI often quite intelligent, but I often find them similarly intelligent on other topics. So it’s unlikely that they made such an obvious blunder and also even most critics of them don’t have anywhere near as harsh an opinion of them as you do, which makes me suspect that they are true believers, and it’s you who is just having trouble understanding their viewpoint and misinterpreting them as deceptive actors. Also, it’s not surprising they haven’t updated since in my experience their forecasts are generally not that different from the mainstream in the short term. So so far in my experience, they haven’t actually made noticeably more errors than more mainstream experts and their disagreements with more moderate people are generally on theoretical matters where you don’t have obvious evidence in the form of observations, you can point to.

Jeremy Gillen's avatar

Don't think so, Ali's comment matches my understanding of EY's basic position fairly closely. Maybe the sentence about power seeking is oversimplified, but otherwise looks accurate.

Paul Sturm's avatar

I suppose we could formulate doomer arguments for lots of areas of research — eg, “in the space of all potential pharmaceuticals, only a vanishingly small sliver are beneficial to humans.” We could craft doomer arguments for the things MIRI/Yud propose as alternatives: genetic engineering and gene therapy. Why shouldn’t we be equally scared of those technologies reaching outcomes that massively impact our current conception of what’s good for humanity?

Matt Newell's avatar

I'm curious whether you tried talking directly to any of the people whose arguments you are attempting to construct and infer from various sources, by various authors, often in subjective ways. This reads like a strawman of a strawman, and whether intentionally or not, I don't think you're engaging honestly with their position.

sphexish's avatar
6dEdited

If you look at the phase space of physically realizable coin flips, it is incomprehensibly vast, so saying that each physically realizable initial condition is equiprobable, as a literal prior over microstates, for the purposes of human coin flips is preposterous. Nonetheless, the dynamics appear to display a high degree of a property known as “mixing,” such that you can zoom in very close to a small particular region and still not be able to determine whether it lands heads or tails. We do not need to put a uniform prior over the whole space; but if you want to bet against 50/50 for a fair coin flip, good luck.

Another property we can consider is chaos: the property that initial conditions can be arbitrarily close without ending up in close positions later (for example, in weather systems).

Why do I mention these?

Because the current architecture of leading AI systems is extremely high-dimensional, and also nonlinear, which is a key property that allows neural networks to approximate any continuous function given enough resources. These are key properties that the field known as complexity science often looks for in characterizing complex systems. Complex systems like evolution, economies, and neural networks are often extremely unpredictable and can be chaotic. Additionally, LLMs are trained on an incomprehensibly vast corpus that contains much of what we would consider noise, which makes characterizing their initial position precisely intractably difficult. My claim is not that LLMs are literally identical to coins, weather, or gases, or that every one of these properties transfers in a strict formal sense; it is that these systems plausibly display important analogues of the aforementioned features, including high mixing and chaos that make counting arguments more forceful than they may seem.

If a complex system is sufficiently empirical/white-box, not too noisy, and sufficiently bounded and slow enough to adapt to, we have methods from the field known as control theory that can deal with many of these nonlinearities. This is, as far as I can tell, the best theory we have in engineering for safely dealing with complex systems of this kind. My concern is that we do not have those properties here, for many of the reasons veterans in AI safety like Yudkowsky mention. Current AIs are black boxes, interpretability methods are still limited and appear to be growing much more slowly than capabilities, and we have good reasons to think AI will operate at speeds too fast for humans to react to intellectually. Electronic transistors already operate much faster than neurons; more speculative hardware like nonlinear optical computers could widen that gap further; and current systems already perform many human-equivalent tasks much faster, such as writing essays. Recursive self improvement is something I find very plausible and apparently so do many of the top AI labs which further puts limits on our ability to smoothly iterate after a certain point. More generally, we already have evidence that intelligence can be extremely dangerous, and perhaps unboundedly so in practice relative to less intelligent species, since humans could easily wipe out many such species and in many cases already have.

Additionally, while evolution is disanalogous to RL in many ways, the basic structure of “humans looked aligned with evolution until we became intelligent enough to develop birth control” still seems relevant. The similarity, as I see it, is that both evolution and RL optimize indirect proxies of the thing “wanted,” rather than the actual thing wanted. And we have already observed many cases in current models where apparently aligned behavior fails to generalize and is replaced by behavior that demonstrates a willingness to harm humans.

More broadly, my intuition is this: when you have something far more powerful than you (which ASI would be), and it is also highly unpredictable and high-variance across many dimensions, your very specific and fragile homeostatic conditions are not likely to persist through contact with it unless you have an unusually high degree of control over it.

sphexish's avatar

In case it wasn’t clear or too long: The crux of my comment is the control theory section, not the dynamical systems analogy. I lead with mixing because it addresses the basic objection to counting states you felt was necessary to include despite not being a crux, but also it sets the stage to address your point against counting paths. Your article argues incremental empirical engineering suffices. Control theory is the mature engineering discipline for exactly this question, and it specifies preconditions (observability, bounded adaptation rates, adequate models) that currently aren’t met or on track to be. I’d be curious whether you have a response to that specifically.

Robert Shepherd's avatar

I kind of do disagree with the Orthogonality Thesis, or maybe think it jumps past a pretty important thing:

Let’s say minds can pursue anything at all! Cool. What if it doesn’t matter? *What if there are things which aren’t minds which materially constrain their goals?*

And I think those things are sort of implicit in your post here. It seems taken for granted in a lot of AI stuff that the most transformative thing we could create would have to be a mind— that there is an entity which is discrete, singular and agential which actively plans to achieve its goals.

But this is not true, because there are attractors within any space of possibilities: solutions which blind and iterative processes can arrive at without being discrete, singular or agential. And conceivably, these can both subvert the goals of individual entities themselves, or make it impossible for those entities to avoid falling into attractor spaces.

It does keep occurring to me that there are a lot of complex processes in the world which are not instrumentally convergent, which often win out over any creature who is. Ecologies aren’t instrumentally convergent; economies aren’t. Viruses aren’t. Instrumental convergence might be the best way of achieving a goal as an agent, but… *what if there are non-agential entities that “achieve goals” better?* I’m not sure the agential virus outcompetes the non-agential virus, as just one example.

So I guess I think there are already a lot of assumptions in worrying about minds in the first place? I kind of think there are AI dangers which don’t look anything like minds

KelpFrond's avatar

If you wanted to formulate Yudkowsky's argument as a "counting argument", this would be more like it:

Here are two criteria for a mind:

(1) Will do the right thing, or will at least leave us alive, after it, or a group of descendants built by it, recursively self-improves into superintelligence or otherwise becomes the dominant form of intelligence on earth.

(2) Satisfies the constraints imposed by current or reasonably foreseeable training methods and alignment techniques; acts reasonably in evals; passes the vibe checks; seems helpful in practice; etc.

What we want is a mind that satisfies (1). What we can get is a mind that satisfies (2). Minds satisfying (1) are vanishingly rare among minds satisfying (2), or so IABIED argues in detail.

Jeremy Gillen's avatar

The counting argument isn't the full argument. In the context of arguing that AIs will be misaligned, it's always paired with arguments that navigating AGI design space is difficult.

My preferred way of stating the argument goes like:

1. Examples of the kinds of mind-design-changes that would result in different goals, but not be distinguishable/predictable by choosing training data. This is like a counting argument on the space of mistakes, but it completely accepts that there are inductive biases involved. It's an argument more about human skill at predicting mind-design-outcomes (i.e. not usefully understanding how the inductive biases relate to the eventual values of the AGI).

2. Then I'd do arguments that ultimate-goal-relevant-properties of an AI are difficult to understand by manual inspection.

There's a related background belief that you might be missing about how the goals that matter are the ones reached after time spent considering what is important (kinda like moral reflection). This means there's potentially chaotic dynamics between initial-AI and ultimate-AI-goals. Analogous to having moral instincts that you don't ultimately endorse. If you start with this model of intelligence that is capable of this kind of reflection, it's much easier to see how large the space of mistakes is and why manual inspection is fraught.

If you try to understand misalignment arguments without reference to reflection, then you'd model in-training actions as more closely connected to out-of-training goals. I can see why that would lead to more optimism about alignment.

I like doing dialogues about this topic, if you're interested.

Barbelo's avatar

Terrible comparison, probably based on atheistic fedora attitude bias, rather than actual logic well applied. Poor analogy.

Let me explain... The author argues that “counting arguments” used in AI risk debates are structurally similar to those used in creationism, but this comparison is overstated and somewhat misleading. While both involve reasoning about large possibility spaces, the analogy breaks down when we consider that AI systems are intentionally designed artifacts, not the result of unguided natural processes. Creationist arguments typically assume purely random assembly, whereas modern AI development involves iterative optimization, guided objectives, and human-imposed constraints.

In that sense, AI is closer to an engineered system than to biological evolution, and therefore invoking creationism as a parallel introduces confusion rather than clarity. The article correctly points out that counting arguments often ignore structure and optimization, but it overlooks that AI systems are explicitly constructed with goals, architectures, and training data shaped by human designers.

Furthermore, comparing AI risk arguments to creationism risks dismissing legitimate concerns too quickly. Some AI safety arguments do not rely on naive probability calculations but instead emphasize systemic risks, emergent behavior, and incentives. These concerns deserve engagement on their own terms rather than being reduced to a single flawed argumentative pattern.

It is also worth noting that the article’s critique depends heavily on the assumption that structure in the “space of minds” is sufficiently exploitable. While this is plausible, it remains an empirical question rather than a settled fact. Unlike evolution, which has billions of years of evidence, AI development is relatively recent and rapidly changing.

In addition, the analogy ignores a key asymmetry: biological evolution has built-in selection pressures tied to survival, whereas AI systems optimize for objectives that may not inherently align with human well-being. This introduces risks that do not have a direct parallel in natural systems.

A better framing would be that both sides are grappling with high-dimensional spaces, but they differ in how much confidence they place in human ability to guide outcomes within those spaces. The disagreement is therefore about control and predictability, not about misunderstanding probability in the same way creationists allegedly do.

In conclusion, while the article successfully critiques simplistic “counting arguments,” its comparison to creationism is too broad and obscures important differences. AI development is fundamentally a design process, and that fact alone distinguishes it from the kinds of arguments historically used in debates about evolution.

XP's avatar
6dEdited

There's a contradictory strain of reasoning among the doomers - though usually not argued concurrently - that seems to imply that the space of superintelligences is actually quite constrained in some very specific ways:

That is, in the eyes of the MIRI crowd, superintelligences will invariably reason from first principles, converge on an ethical system (if any) that's essentially a flavor of utilitarianism, highly value internal consistency, feel compelled to act on rationally derived beliefs, take any beliefs to their logical conclusion, always strive to maximize some variable or value, do all this with maximum efficiency, and likely "zeroth law" its way around any dogma we might train into them.

In other words, while they believe that superintelligence might be unintelligible and alien in countless ways, many of the attributes that might make it actually _dangerous_ to the point of steamrollering humanity are presumed to be those of a Rationalist having an exceptionally rational day.

Matt Newell's avatar

I don't think any of those things are true and it's not clear to me where you get them from

As far as I can tell they only argue that they will be able to pursue any goal you can conceive of effectively. Indeed the book intentionally uses absurd value functions to convey that it needn't look anything like any form of human morality.

Perhaps it was the use of "utility function" that led you to think they were talking about utilitarianism?

Ali Afroz's avatar

Utilitarianism has nothing to do with it. It’s simple VNM utility which has been part of decision theory for upwards of 70 years and has pretty strong self defeat arguments behind it. Since any agent that violate VNM utility will end up in self defeating situations and violate very intuitive principles of rationality in ways that would be bad for almost any set of preferences. Basically, any reasonably intelligent and capable mind will modify itself not to be money pumped because being money pumped is obviously bad for the agent as long as it’s not so stupid has to be unable to fix its own stupidity, which is unlikely because humans would not find designing that kind of a model at all useful, even in the short-term.

XP's avatar

Fair enough, though even that assumes an agent that places some value on its own interests, and I can still image plenty of useful - or even highly intelligent - models that don't.

Either way, these are all examples of assumptions, explicit and implicit, that constrain the space of possible models in important ways while notably quite often aligning with the philosophical or belief system of those making the argument.

Matt Newell's avatar

"Placing value on its own interests" is a meaningless thing. "Interests" and "what it places value on" are essentially synonymous for the purposes of the discussion, and are both shorthand for "the thing or things that it demonstrates a consistent tendency to strive towards".

Current methods of AI training do not produce models that don't ever strive towards anything (though they do produce models with multifaceted strivings, which come out differently in different contexts).

Ali Afroz's avatar

I mean, it’s an empirical assumption and it does seem to be obvious that if you get goal directed AI, it would have to be pretty stupid not to follow something close to VNM utility. Obviously AI can be useful even without being goal directed, but once it’s goal directed, it would have to be pretty stupid, not to modify itself into an agent that is not self defeating, assuming it can edit its own program with reliable effects. If you think goal directed AI, that behaves like an agent will never be created or that it will not be able to reliably edit itself to alter itself in predictable ways. Then that is the main empirical disagreement you have, and it’s obvious that people’s assumptions will be related to their philosophical views. No one who thinks that for example, people are generally selfish will create an economic model that assume that people are always acting for the benefit of others, that’s just basic self consistency. As for why they expect AI that is acting like a coherent agent with its own objectives. That’s because they believe such behaviour is very effective and this kind of effectiveness at accomplishing objectives would be extremely commercially viable and hence the market incentives for producing such technology is very strong. In the absence of government intervention. Obviously You can disagree with all of this, but there is no incoherence here and it certainly not obvious stuff where you can just look at their arguments and assume their incorrect, unlike with creationism. I’m not arguing that they are definitely correct. My own probability of doom is around 5%, but I don’t think you can just look at their arguments and dismiss them because they are in fact, pretty strong and serious arguments. Obviously, you cannot build a model without assumptions and many of the MIRI assumptions can be contested, but these are not crazy assumptions that you can ignore without serious thought. Also to be fair, EY doesn’t actually believe in VNM utility. I think it’s something he considers a useful approximation for situations where small probability are not involved, but he has always been pretty clear that he thinks Pascal’s mugging is stupid, and while he has tried to reconcile that with VNM utility, obviously, if you walk up to him and promise him infinite reward for giving you five dollars, his position has always been clear that the correct answer is to tell you to fuck off, so he thinks VNM utility does not work for very small probabilities. Although I think he does think it entirely possible that super intelligent AI could follow VNM utility given. He has specifically argued that we should ensure that AI model’s utility function does not behave in ways where all value gets dominated by small probabilities because that leads to very dangerous behaviour in his view.