MAIA - AISF Week 6

Overview

This week introduces two of the most important empirical approaches to understanding what frontier models are doing under the hood: mechanistic interpretability (reverse-engineering model internals into human-understandable structures) and evaluations (measuring capabilities, propensities, and safety-relevant behaviors). The week emphasizes how these two threads are converging into alignment auditing — using interp tools to check whether a model has the goals it appears to have.

Learning Objectives

By the end of Week 6, fellows should be able to:

Describe at a high level what attribution graphs and linear probes are, and what each lets researchers discover about model internals
Point to at least one concrete finding from modern interp on a frontier model (planning, multi-step reasoning, unfaithful chain of thought, etc.)
Distinguish capability evals, propensity evals, dangerous-capability evals, and alignment audits

Reading packet

Week 6: Interpretability & Evals

Overview

Learning Objectives

Core Readings

Recommended Readings

Further Readings