AI alignment is a field of AI safety research focused on developing AI systems to follow the user's desired behavior and achieve their desired outcomes, ensuring the model is "aligned" with human values. The AI alignment problem is an issue related to how we can encode AI models to make them act in a way that is compatible with human moral values. While AI models are written to efficiently and effectively perform tasks valuable to the user, they do not have the ability of judgment, inference, or understanding in the way a human would naturally do. This problem becomes more complex when the system has multiple values to prioritize in the system, making it impossible to maximize both.
AI alignment research has the following objectives types:
- Intended goals—These are goals fully aligned with the intentions and desires of the human user, even when poorly articulated. It's the hypothetical ideal outcome for the user.
- Specified goals—These are explicitly specified in the AI system's objective function or data set; they are programmed into the system.
- Emergent goals—These are the resulting goals the AI system advances.
Misalignment occurs when one or more of these goal types does not match the others, generally divided into two main types:
- Inner misalignment—A mismatch between goals 2 and 3; what is written in the code does not match what the system advances.
- Outer misalignment—A mismatch between goals 1 and 2; what the operator wants to happen does not match the explicit goals coded into the machine.
The alignment problem was first described in a 2003 thought experiment by philosopher Nick Bostrom. He imagined a super-intelligent AI that was tasked with producing as many paper clips as possible. Bostrom suggests the AI may quickly decide to kill all of humanity to prevent them from switching it off and getting in the way of its mission or as a way to harvest more resources to convert into more paper clips. While absurd, the thought experiment illustrates how AI doesn't have inherent human values, and the systems may optimize what we ask for using unexpected or harmful methods. With the release and widespread use of generative AI models, AI alignment is becoming increasingly important, with the developers of models creating methods to ensure their technology behaves as desired, limiting the impact of misinformation or bias.
The alignment problem comes from the disconnect between how we want AI models to behave and translating that into the numerical logic of computers. It can be divided into the technical aspect of encoding values and principles into AI in a reliable manner and the process of deciding what moral values or principles should be encoded.