Week 2: Outer Alignment
RLHF, Reward Misspecification, and Specification Gaming
Overview
Outer alignment asks a deceptively simple question: how do we get AI systems to optimize for what we actually care about, rather than for an imperfect stand-in? This session focuses on the persistent gap between intended goals and operationalized objectives, moving from classic examples of reward hacking and specification gaming to modern alignment methods like RLHF. Along the way, we will examine why better proxies help without fully solving the problem, and why systems trained to satisfy human feedback can still exploit shortcuts, optimize appearances, or pursue the wrong target in increasingly sophisticated ways.
Learning Objectives
By the end of Week 2, fellows should be able to:
- Explain the difference between intended goals and the proxy objectives used in training
- Describe major failure modes of outer alignment, including specification gaming and reward hacking
- Evaluate how methods like RLHF and reward modeling somewhat improve alignment while still leaving important gaps
Core Readings
- Specification gaming: the flip side of AI ingenuity (GDM, 2020)
- Aligning language models to follow instructions (OpenAI, 2022)
- Language Models Learn to Mislead Humans via RLHF (Wen et al., 2024)
- Large Language Models can Strategically Deceive their Users when Put Under Pressure (Apollo Research, 2023)
Recommended Readings
- Deep RL from human preferences: blog post (Christiano et al., 2017)
- Playing the Training Game (Piper, 2023)
- Natural Emergent Misalignment from Reward Hacking in Production RL (Anthropic + Redwood Research, 2025)
- Constitutional AI: Harmlessness from AI Feedback (Anthropic, 2022)
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (Casper et al., 2023)