Week 5: Control & Scalable Oversight
AI Control, Monitoring, and Supervising Systems Smarter Than Us
Overview
Even if we can't guarantee that a powerful AI is aligned, can we still deploy it safely? This session begins with the AI control framework, which aims to prevent catastrophic outcomes from AI systems that may be actively trying to subvert human oversight. We explore concrete control techniques like resampling and monitoring, then broaden to the problem of scalable oversight — how weaker overseers (including humans) can reliably supervise systems more capable than themselves. We examine empirical results on weak-to-strong generalization that test whether simple oversight signals can elicit trustworthy behavior from stronger models, and consider what these findings imply for the long-term viability of human supervision over increasingly capable AI.
Learning Objectives
By the end of Week 5, fellows should be able to:
- Explain the AI control framework and how it differs from alignment as a safety strategy, including the assumption that the model may be intentionally subversive
- Describe concrete control techniques (e.g., resampling, monitoring, trusted models) and evaluate their strengths and limitations
- Articulate the core challenge of scalable oversight — supervising systems more capable than the overseer — and assess approaches like weak-to-strong generalization and debate
Core Readings
Recommended Readings
- The AI Control LessWrong sequence
- An Overview of Control Measures (Redwood, 2025)
- Simple probes can catch Sleeper Agents (Anthropic, 2024)
- Debating with More Persuasive LLMs Leads to More Truthful Answers (Khan et al., 2024)
- Combining W2SG with Scalable Oversight (Leike, 2023)
- Humans consulting HCH (Christiano, 2016)
- Measuring Progress on Scalable Oversight for Large Language Models (Anthropic, 2022)
- Debate update: obfuscated arguments problem (OpenAI, 2020)
Further Readings
- AI Safety via Debate (OpenAI, 2018)
- Supervising Strong Learners by Amplifying Weak Experts (OpenAI, 2018)
- How to prevent collusion when using untrusted models to monitor each other (Redwood, 2024)
- Thoughts on the conservative assumptions in AI control (Redwood, 2025)
- A Sketch of an AI Control Safety Case (UK AISI / Redwood, 2025)
- Subversion Strategy Eval (Mallen, Griffin, Wagner, Abate, Shlegeris, 2024)
- An Overview of Areas of Control Work (Redwood, Apr 2025)
- Four Places Where You Can Put LLM Monitoring (Redwood, 2025)
- Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking (Redwood, May 2025)
- Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs (LASR Labs, 2024)
- Detecting Strategic Deception Using Linear Probes (Apollo Research, 2025)
- Debate Helps Supervise Unreliable Experts (Michael et al., 2023)
- How to keep improving when you're better than any teacher — IDA (Robert Miles, 2019)
- The Stop Button Problem and corrigibility (Robert Miles / Computerphile, 2017)
- Rob Miles YouTube explainer of the AI Control paper (Robert Miles, 2024)
- AXRP Episode 27: AI Control with Buck Shlegeris and Ryan Greenblatt (2024)
- 80,000 Hours: Buck Shlegeris on controlling AI that wants to take over (2024)
- 80,000 Hours: Ryan Greenblatt on AI R&D automation and misaligned takeover (2025)
- Detecting Misbehavior in Frontier Reasoning Models (OpenAI, 2025)
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety (Korbak et al., 2025)
- Reasoning Models Don't Always Say What They Think (Chen et al., Anthropic/OpenAI, 2025)
- Monitoring Monitorability (Guan, Wang, Carroll et al., OpenAI, 2025)
- Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity (Meek et al., 2025)
- Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps (EMNLP 2025)
- Language Models Don't Always Say What They Think (Turpin et al., 2023)
- Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al., 2023)
- Chain-of-Thought Reasoning in the Wild: Faithfulness (Arcuschin et al., 2025)
- When Chain of Thought Is Necessary, Language Models Struggle to Evade Monitors (Emmons et al., 2025)
- CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring (Arnav et al., 2025)
- LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring (Li, Phuong, Siegel, 2025)
- Stress Testing Deliberative Alignment for Anti-Scheming Training (Schoen et al., Apollo Research, 2025)
- Auditing Language Models for Hidden Objectives (Marks et al., Anthropic, 2025)
- Preventing Language Models From Hiding Their Reasoning (Roger & Greenblatt, Redwood Research, 2023)