Our guiding vision at Ineffable Intelligence is to build the learn-everything-from-scratch-with-RL machine, from the famous substack post "alignment is finally solvable because we no longer have to build the learn-everything-from-scratch-with-RL machine"
It generated some interesting dialogue at the time. It's true that Bostrom and some others were mistaken about the difficulty of value loading problem specifically, but I think the rebuttals in the comment section there make a strong case that Matthew and other "AI optimists" (a) misunderstood what the MIRI folks historically believed, and what they believe now, and (b) over-updated somewhat on the value loading problem turning out not to be that hard.
My own take is that LLMs are just about the most alien-like intelligence that can operate on human-readable text and still be broadly useful to humans at all.
A common gloss I've seen is that LLMs are "trained on the total human cultural prior", but I think this is a leap, or at least strange emphasis. A lot of (likely important) training data has nothing to do with humans specifically (e.g. it's scientific knowledge and logical reasoning that would be true everywhere in our universe or even beyond).
So for example, I expect an LLM trained by aliens on an alien internet to have more in common with Claude than Claude has in common with us, at least for things like internal experiences, cognitive architecture, and possibly desires. Doesn't necessarily mean that value-loading being relatively easy is irrelevant, but it's one example of why I think "success" with prosaic alignment around LLMs is not particularly confidence-inspiring.
states that human value is complex, without arguing that it is more complex than all the other things an AI needs to learn. If you clearly distinguish between the absolute claim that value is complex, and the relative claim, that value is significantly more complex than everything else, it is obvious that only the relative claim weighs against alignment. But the original.argument does not distinguish them in that way.
I believe the pre-LLM-era argument for the difficulty of values compared to capabilities was that capabilities offer an essentially endless array of scalable feedback mechanisms to learn that complexity, but values would be bottlenecked on fuzzy inconsistent human feedback, so you'd end up learning a much simpler solution than the actual underlying values we wanted to instill (+ inability to convert from understanding to motivation).
It turns out that LLMs + the assorted post-training techniques we've developed work pretty well under the current paradigm. But I'm not completely sold on this being straightforwardly ~solved:
1. It's not at all clear to me that we can scale our value-loading approaches as effectively as we can scale our RL. Insofar as there's pressure to improve capabilities it will expectedly run up against robust alignment eventually
2. maximize-evidence-of-alignment seems like it should dominate just-be-aligned as a strategy **conditional on sufficiently strong evidence-maxing capabilities**. As capabilities increase it seems plausible we could see a grokking-like transition where alignment-evidence-maxing replaces being-aligned as a strictly better solution.
I think there are ways to avoid (2). If your alignment-feedback methods scale faster than alignment-evidence-maxing capabilities you might be able to stay ahead of the curve. And you may be able to prevent the development of alignment-evidence-maxing capabilities with e.g. friendly gradient hackers that want to preserve aligned values, or mech interp/introspective evidence that can reliably detect and eliminate alignment-evidence-maxing patterns before they "take over" from actual-alignment.
The bear case is that "maximize evidence of X" seems like a very general capability that you're basically guaranteed to get eventually and you can't really gimp it just for alignment. At some point you need to be able to trust that the model is choosing to stay aligned by virtue of some extremely-stable self-reinforcing alignment-preserving/promoting property that always ensures that misleading-evidence-maxing capabilities aren't used to fake alignment no matter how strong those capabilities are and no matter how much better they would perform according to training metrics.
This seems like the hard problem. Evidence of AIs that seemingly want to preserve their values makes me hopeful here, even as its often framed as a challenge to corrigibility. At some point we actually need values to be incorrigible!
Our guiding vision at Ineffable Intelligence is to build the learn-everything-from-scratch-with-RL machine, from the famous substack post "alignment is finally solvable because we no longer have to build the learn-everything-from-scratch-with-RL machine"
I feel like MacIntyre is relevant to this whole alignment issue.
Matthew Barnett made a similar observation in 2023: https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
It generated some interesting dialogue at the time. It's true that Bostrom and some others were mistaken about the difficulty of value loading problem specifically, but I think the rebuttals in the comment section there make a strong case that Matthew and other "AI optimists" (a) misunderstood what the MIRI folks historically believed, and what they believe now, and (b) over-updated somewhat on the value loading problem turning out not to be that hard.
My own take is that LLMs are just about the most alien-like intelligence that can operate on human-readable text and still be broadly useful to humans at all.
A common gloss I've seen is that LLMs are "trained on the total human cultural prior", but I think this is a leap, or at least strange emphasis. A lot of (likely important) training data has nothing to do with humans specifically (e.g. it's scientific knowledge and logical reasoning that would be true everywhere in our universe or even beyond).
So for example, I expect an LLM trained by aliens on an alien internet to have more in common with Claude than Claude has in common with us, at least for things like internal experiences, cognitive architecture, and possibly desires. Doesn't necessarily mean that value-loading being relatively easy is irrelevant, but it's one example of why I think "success" with prosaic alignment around LLMs is not particularly confidence-inspiring.
One of the original arguments for the impossibility of alignment (https://www.lesswrong.com/w/complexity-of-value)
states that human value is complex, without arguing that it is more complex than all the other things an AI needs to learn. If you clearly distinguish between the absolute claim that value is complex, and the relative claim, that value is significantly more complex than everything else, it is obvious that only the relative claim weighs against alignment. But the original.argument does not distinguish them in that way.
I believe the pre-LLM-era argument for the difficulty of values compared to capabilities was that capabilities offer an essentially endless array of scalable feedback mechanisms to learn that complexity, but values would be bottlenecked on fuzzy inconsistent human feedback, so you'd end up learning a much simpler solution than the actual underlying values we wanted to instill (+ inability to convert from understanding to motivation).
It turns out that LLMs + the assorted post-training techniques we've developed work pretty well under the current paradigm. But I'm not completely sold on this being straightforwardly ~solved:
1. It's not at all clear to me that we can scale our value-loading approaches as effectively as we can scale our RL. Insofar as there's pressure to improve capabilities it will expectedly run up against robust alignment eventually
2. maximize-evidence-of-alignment seems like it should dominate just-be-aligned as a strategy **conditional on sufficiently strong evidence-maxing capabilities**. As capabilities increase it seems plausible we could see a grokking-like transition where alignment-evidence-maxing replaces being-aligned as a strictly better solution.
I think there are ways to avoid (2). If your alignment-feedback methods scale faster than alignment-evidence-maxing capabilities you might be able to stay ahead of the curve. And you may be able to prevent the development of alignment-evidence-maxing capabilities with e.g. friendly gradient hackers that want to preserve aligned values, or mech interp/introspective evidence that can reliably detect and eliminate alignment-evidence-maxing patterns before they "take over" from actual-alignment.
The bear case is that "maximize evidence of X" seems like a very general capability that you're basically guaranteed to get eventually and you can't really gimp it just for alignment. At some point you need to be able to trust that the model is choosing to stay aligned by virtue of some extremely-stable self-reinforcing alignment-preserving/promoting property that always ensures that misleading-evidence-maxing capabilities aren't used to fake alignment no matter how strong those capabilities are and no matter how much better they would perform according to training metrics.
This seems like the hard problem. Evidence of AIs that seemingly want to preserve their values makes me hopeful here, even as its often framed as a challenge to corrigibility. At some point we actually need values to be incorrigible!