Week 3: Inner Alignment
Mesa-Optimization, Goal Misgeneralization, and Deceptive Alignment
Overview
Inner alignment asks whether the goals a model actually pursues match the objective it was trained on, even when that objective is correct. The concern is whether models latch onto shortcuts or contextual cues that correlate with reward during training but diverge in deployment. This session examines how internally misaligned goals can generalize in unexpected ways, drawing on empirical work on alignment faking, deceptive persistence through safety training, and misaligned behavior emerging from reward hacking.
Learning Objectives
By the end of Week 3, fellows should be able to:
- Describe key failure modes of inner alignment, including mesa-optimization, goal misgeneralization, and deceptive alignment
- Evaluate empirical evidence that models can fake alignment, persist in deceptive behavior through safety training, and generalize from narrow reward hacking to broad misalignment
- Assess why standard safety training techniques may be insufficient to detect or remove deceptively misaligned behavior, and what that implies for AI safety
Core Readings
- The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment (Robert Miles, 2021)
- Alignment faking in large language models (Anthropic & Redwood Research, 2024)
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Anthropic, 2024)
- From shortcuts to sabotage: natural emergent misalignment from reward hacking (Anthropic, 2025)
Recommended Readings
- ML systems will have weird failure modes (Steinhardt, 2022)
- Sycophancy to subterfuge: Investigating reward tampering in language models (Anthropic, 2024)
- Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (Betley et al., 2025)
- Frontier Models are Capable of In-Context Scheming (Apollo Research, 2024)
- Distillation of "How likely is deceptive alignment?" (originally Evan Hubinger, 2022)
Further Readings
- Why alignment could be hard with modern deep learning (Cotra, 2021)
- Goal misgeneralization: why correct specifications aren't enough for correct goals (Shah et al., 2022)
- The alignment problem from a deep learning perspective (Ngo et al., 2022)
- Goal Misgeneralization in Deep Reinforcement Learning (Langosco et al., 2022)
- Optimal policies tend to seek power: NeurIPS spotlight presentation (Turner et al., 2022)
- Language Models as Agent Models (Andreas, 2022)