Week 6: Interpretability & Evals
Mechanistic Interpretability, Evaluations, and Alignment Auditing
Overview
This week introduces two of the most important empirical approaches to understanding what frontier models are doing under the hood: mechanistic interpretability (reverse-engineering model internals into human-understandable structures) and evaluations (measuring capabilities, propensities, and safety-relevant behaviors). The week emphasizes how these two threads are converging into alignment auditing — using interp tools to check whether a model has the goals it appears to have.
Learning Objectives
By the end of Week 6, fellows should be able to:
- Describe at a high level what attribution graphs and linear probes are, and what each lets researchers discover about model internals
- Point to at least one concrete finding from modern interp on a frontier model (planning, multi-step reasoning, unfaithful chain of thought, etc.)
- Distinguish capability evals, propensity evals, dangerous-capability evals, and alignment audits
Core Readings
- Tracing the Thoughts of a Large Language Model (Anthropic, 2025)
- Simple Probes Can Catch Sleeper Agents (Anthropic, 2024)
- Difficulties with Evaluating a Deception Detector for AIs (GDM, 2025)
- Auditing Language Models for Hidden Objectives (Anthropic, 2025)
- Every Benchmark is Broken (Jonathan Gabor, 2026)
Recommended Readings
- Interpretability Dreams (Olah, 2023)
- Responsible Scaling Policies (METR / ARC Evals, 2023)
- Frontier Models Are Capable of In-Context Scheming (Apollo Research, 2024)
- Measuring AI Ability to Complete Long Tasks (METR, 2025)
- Demystifying Evals for AI Agents (Anthropic, 2024)
- EIS III: Broad Critiques of Interpretability Research (Stephen Casper, 2023)
Further Readings
- Toy Models of Superposition (Elhage et al., 2022)
- Towards Monosemanticity (Bricken, Templeton et al., Anthropic, 2023)
- Mapping the Mind of a Large Language Model / Scaling Monosemanticity (Templeton et al., Anthropic, 2024)
- On the Biology of a Large Language Model (Lindsey et al., Anthropic, 2025)
- Refusal in Language Models Is Mediated by a Single Direction (Arditi et al., 2024)
- Emergent Misalignment (Betley et al., ICML 2025)
- Persona Features Control Emergent Misalignment (Wang et al., OpenAI, 2025)
- AI Control: Improving Safety Despite Intentional Subversion (Greenblatt et al., ICML 2024)
- Agentic Misalignment: How LLMs Could Be Insider Threats (Lynch et al., Anthropic, 2025)
- Auditing Language Models for Hidden Objectives — full paper (Marks et al., Anthropic, 2025)
- Building and Evaluating Alignment Auditing Agents (Bricken, Marks et al., Anthropic, 2025)
- EvilGenie: A Reward Hacking Benchmark (Gabor et al., 2025)
- Sabotage Evaluations for Frontier Models (Benton et al., Anthropic, 2024)
- AI Sandbagging (van der Weij et al., 2024)
- The WMDP Benchmark (Li et al., ICML 2024)