Is the alignment problem only about AI?
No. Any system — human teams, incentive structures, algorithms — can be misaligned. AI made the concept famous, but the pattern is universal.
Mental Models
The challenge of ensuring that a system's optimized objective truly reflects the designer's actual intent, including edge cases and side effects.
The alignment problem warns that a system can perform exactly as instructed while completely failing to achieve the intended goal, because the objective was specified incorrectly.
Does this metric or objective fully capture what we actually care about, including the things we forgot to specify?
A recommendation algorithm optimized for engagement time serves increasingly extreme content. It perfectly optimizes its objective, but the objective is misaligned with user wellbeing and brand trust.
No. Any system — human teams, incentive structures, algorithms — can be misaligned. AI made the concept famous, but the pattern is universal.
Stress-test objectives with edge cases and adversarial scenarios. Ask: what is the worst thing that maximizes this metric? If the answer is harmful, the metric is misaligned.
People optimize for what they are rewarded for, not what you intend.
Direction matters as much as pace.
Outputs circle back as inputs, amplifying or stabilizing a system.