Alignment Is Proven To Be Solvable

Feb 18

That LLMs understand natural language as well as they do should dramatically change our understanding of the problem.

8 Comments

Our guiding vision at Ineffable Intelligence is to build the learn-everything-from-scratch-with-RL machine, from the famous substack post "alignment is finally solvable because we no longer have to build the learn-everything-from-scratch-with-RL machine"

Aslan

Feb 18

I feel like MacIntyre is relevant to this whole alignment issue.

Max

Apr 16Edited

Matthew Barnett made a similar observation in 2023: https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument

It generated some interesting dialogue at the time. It's true that Bostrom and some others were mistaken about the difficulty of value loading problem specifically, but I think the rebuttals in the comment section there make a strong case that Matthew and other "AI optimists" (a) misunderstood what the MIRI folks historically believed, and what they believe now, and (b) over-updated somewhat on the value loading problem turning out not to be that hard.

My own take is that LLMs are just about the most alien-like intelligence that can operate on human-readable text and still be broadly useful to humans at all.

A common gloss I've seen is that LLMs are "trained on the total human cultural prior", but I think this is a leap, or at least strange emphasis. A lot of (likely important) training data has nothing to do with humans specifically (e.g. it's scientific knowledge and logical reasoning that would be true everywhere in our universe or even beyond).

So for example, I expect an LLM trained by aliens on an alien internet to have more in common with Claude than Claude has in common with us, at least for things like internal experiences, cognitive architecture, and possibly desires. Doesn't necessarily mean that value-loading being relatively easy is irrelevant, but it's one example of why I think "success" with prosaic alignment around LLMs is not particularly confidence-inspiring.

Reply (1)

The Ancient Geek

Apr 17Edited

One of the original arguments for the impossibility ( technically, extremely low likelihood) of alignment (https://www.lesswrong.com/w/complexity-of-value)

states that human value is complex, without arguing that it is more complex than all the other things an AI needs to learn. If you clearly distinguish between the absolute claim that value is complex, and the relative claim, that value is significantly more complex than everything else, it is obvious that only the relative claim weighs against alignment. But the original.argument does not distinguish them in that way.

Reply (2)

Max

Jun 23

Not sure what you're referring to by "the original arguments for the impossibility of alignment", but that's definitely not something Eliezer Yudkowsky has ever said.

> None of this is about anything being impossible in principle. The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.

(https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities)

Reply (1)

The Ancient Geek

Jun 23

https://www.lesswrong.com/w/complexity-of-value

Reply (1)

Max

Jun 23

Right, I clicked through to that, but it's not an argument for "the impossibility of alignment".

Jai

Apr 29Edited

I believe the pre-LLM-era argument for the difficulty of values compared to capabilities was that capabilities offer an essentially endless array of scalable feedback mechanisms to learn that complexity, but values would be bottlenecked on fuzzy inconsistent human feedback, so you'd end up learning a much simpler solution than the actual underlying values we wanted to instill (+ inability to convert from understanding to motivation).

It turns out that LLMs + the assorted post-training techniques we've developed work pretty well under the current paradigm. But I'm not completely sold on this being straightforwardly ~solved:

1. It's not at all clear to me that we can scale our value-loading approaches as effectively as we can scale our RL. Insofar as there's pressure to improve capabilities it will expectedly run up against robust alignment eventually

2. maximize-evidence-of-alignment seems like it should dominate just-be-aligned as a strategy **conditional on sufficiently strong evidence-maxing capabilities**. As capabilities increase it seems plausible we could see a grokking-like transition where alignment-evidence-maxing replaces being-aligned as a strictly better solution.

I think there are ways to avoid (2). If your alignment-feedback methods scale faster than alignment-evidence-maxing capabilities you might be able to stay ahead of the curve. And you may be able to prevent the development of alignment-evidence-maxing capabilities with e.g. friendly gradient hackers that want to preserve aligned values, or mech interp/introspective evidence that can reliably detect and eliminate alignment-evidence-maxing patterns before they "take over" from actual-alignment.

The bear case is that "maximize evidence of X" seems like a very general capability that you're basically guaranteed to get eventually and you can't really gimp it just for alignment. At some point you need to be able to trust that the model is choosing to stay aligned by virtue of some extremely-stable self-reinforcing alignment-preserving/promoting property that always ensures that misleading-evidence-maxing capabilities aren't used to fake alignment no matter how strong those capabilities are and no matter how much better they would perform according to training metrics.

This seems like the hard problem. Evidence of AIs that seemingly want to preserve their values makes me hopeful here, even as its often framed as a challenge to corrigibility. At some point we actually need values to be incorrigible!

Very Sane AI Newsletter

Alignment Is Proven To Be Solvable