Week 3: Inner Alignment

Mesa-Optimization, Goal Misgeneralization, and Deceptive Alignment

Overview

Inner alignment asks whether the goals a model actually pursues match the objective it was trained on, even when that objective is correct. The concern is whether models latch onto shortcuts or contextual cues that correlate with reward during training but diverge in deployment. This session examines how internally misaligned goals can generalize in unexpected ways, drawing on empirical work on alignment faking, deceptive persistence through safety training, and misaligned behavior emerging from reward hacking.

Learning Objectives

By the end of Week 3, fellows should be able to:

  • Describe key failure modes of inner alignment, including mesa-optimization, goal misgeneralization, and deceptive alignment
  • Evaluate empirical evidence that models can fake alignment, persist in deceptive behavior through safety training, and generalize from narrow reward hacking to broad misalignment
  • Assess why standard safety training techniques may be insufficient to detect or remove deceptively misaligned behavior, and what that implies for AI safety

Core Readings

Recommended Readings

Further Readings