Week 2: Outer Alignment

RLHF, Reward Misspecification, and Specification Gaming

Overview

Outer alignment asks a deceptively simple question: how do we get AI systems to optimize for what we actually care about, rather than for an imperfect stand-in? This session focuses on the persistent gap between intended goals and operationalized objectives, moving from classic examples of reward hacking and specification gaming to modern alignment methods like RLHF. Along the way, we will examine why better proxies help without fully solving the problem, and why systems trained to satisfy human feedback can still exploit shortcuts, optimize appearances, or pursue the wrong target in increasingly sophisticated ways.

Learning Objectives

By the end of Week 2, fellows should be able to:

  • Explain the difference between intended goals and the proxy objectives used in training
  • Describe major failure modes of outer alignment, including specification gaming and reward hacking
  • Evaluate how methods like RLHF and reward modeling somewhat improve alignment while still leaving important gaps

Core Readings

Recommended Readings

Further Readings