MAIA - AISF Week 5

Overview

Even if we can't guarantee that a powerful AI is aligned, can we still deploy it safely? This session begins with the AI control framework, which aims to prevent catastrophic outcomes from AI systems that may be actively trying to subvert human oversight. We explore concrete control techniques like resampling and monitoring, then broaden to the problem of scalable oversight — how weaker overseers (including humans) can reliably supervise systems more capable than themselves. We examine empirical results on weak-to-strong generalization that test whether simple oversight signals can elicit trustworthy behavior from stronger models, and consider what these findings imply for the long-term viability of human supervision over increasingly capable AI.

Learning Objectives

By the end of Week 5, fellows should be able to:

Explain the AI control framework and how it differs from alignment as a safety strategy, including the assumption that the model may be intentionally subversive
Describe concrete control techniques (e.g., resampling, monitoring, trusted models) and evaluate their strengths and limitations
Articulate the core challenge of scalable oversight — supervising systems more capable than the overseer — and assess approaches like weak-to-strong generalization and debate

Reading packet

Week 5: Control & Scalable Oversight

Overview

Learning Objectives

Core Readings

Recommended Readings

Further Readings