Week 4: Threat Models

Orthogonality, Instrumental Convergence, and Catastrophic Risk Scenarios

Overview

The previous two weeks explored how AI systems can fail to align with human intentions. This session zooms out to ask: what could actually go wrong at scale, and why? We begin with two foundational ideas, the orthogonality thesis and instrumental convergence, that explain why high capability alone doesn't guarantee good values, and why a wide range of possible goals lead to dangerous subgoals like self-preservation and resource acquisition. From there, we survey concrete catastrophic risk scenarios, from AI-enabled bioterrorism and cyberwarfare to structural risks like gradual human disempowerment and power concentration. We close with a narrative scenario that ties many of these threads together, illustrating how these risks might unfold in practice over the next few years.

Learning Objectives

By the end of Week 4, fellows should be able to:

  • Explain the orthogonality thesis and instrumental convergence, and articulate why these ideas are central to concerns about advanced AI systems
  • Connect the technical alignment failures discussed in Weeks 2–3 to broader threat models involving real-world harm at scale

Core Readings

Recommended Readings

Further Readings