Week 6: Interpretability & Evals

Mechanistic Interpretability, Evaluations, and Alignment Auditing

Overview

This week introduces two of the most important empirical approaches to understanding what frontier models are doing under the hood: mechanistic interpretability (reverse-engineering model internals into human-understandable structures) and evaluations (measuring capabilities, propensities, and safety-relevant behaviors). The week emphasizes how these two threads are converging into alignment auditing — using interp tools to check whether a model has the goals it appears to have.

Learning Objectives

By the end of Week 6, fellows should be able to:

  • Describe at a high level what attribution graphs and linear probes are, and what each lets researchers discover about model internals
  • Point to at least one concrete finding from modern interp on a frontier model (planning, multi-step reasoning, unfaithful chain of thought, etc.)
  • Distinguish capability evals, propensity evals, dangerous-capability evals, and alignment audits

Core Readings

Recommended Readings

Further Readings