Mental Models

Alignment Problem

The challenge of ensuring that a system's optimized objective truly reflects the designer's actual intent, including edge cases and side effects.

The alignment problem warns that a system can perform exactly as instructed while completely failing to achieve the intended goal, because the objective was specified incorrectly.

Core question

Does this metric or objective fully capture what we actually care about, including the things we forgot to specify?

Example

A recommendation algorithm optimized for engagement time serves increasingly extreme content. It perfectly optimizes its objective, but the objective is misaligned with user wellbeing and brand trust.

How to apply

1.Define the intended outcome in plain language before choosing a proxy metric.
2.Identify gaps between the proxy metric and the real goal.
3.Add guardrails, constraints, or counter-metrics to catch misalignment.
4.Monitor for Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Pitfalls

·Assuming that a well-defined metric is automatically well-aligned with intent.
·Adding more metrics without resolving fundamental conflicts between them.
·Discovering misalignment only after significant damage has been done.

Questions

Is the alignment problem only about AI?

No. Any system — human teams, incentive structures, algorithms — can be misaligned. AI made the concept famous, but the pattern is universal.

How do you detect alignment problems early?

Stress-test objectives with edge cases and adversarial scenarios. Ask: what is the worst thing that maximizes this metric? If the answer is harmful, the metric is misaligned.

Incentive Design
People optimize for what they are rewarded for, not what you intend.
Velocity vs. Speed
Direction matters as much as pace.
Feedback Loops
Outputs circle back as inputs, amplifying or stabilizing a system.