Week 4: Threat Models
Orthogonality, Instrumental Convergence, and Catastrophic Risk Scenarios
Overview
The previous two weeks explored how AI systems can fail to align with human intentions. This session zooms out to ask: what could actually go wrong at scale, and why? We begin with two foundational ideas, the orthogonality thesis and instrumental convergence, that explain why high capability alone doesn't guarantee good values, and why a wide range of possible goals lead to dangerous subgoals like self-preservation and resource acquisition. From there, we survey concrete catastrophic risk scenarios, from AI-enabled bioterrorism and cyberwarfare to structural risks like gradual human disempowerment and power concentration. We close with a narrative scenario that ties many of these threads together, illustrating how these risks might unfold in practice over the next few years.
Learning Objectives
By the end of Week 4, fellows should be able to:
- Explain the orthogonality thesis and instrumental convergence, and articulate why these ideas are central to concerns about advanced AI systems
- Connect the technical alignment failures discussed in Weeks 2–3 to broader threat models involving real-world harm at scale
Core Readings
- What is the orthogonality thesis? (AISafety.info)
- Why Would AI Want to do Bad Things? Instrumental Convergence (Robert Miles, 2018)
- Statement on AI Risk (Center for AI Safety, 2023)
- An Overview of Catastrophic AI Risks (Center for AI Safety, 2023)
- We're Not Ready for Superintelligence (AI in Context, 2025)
Recommended Readings
- AI 2027 (AI Futures Project, 2025)
- The Problem (MIRI, 2025)
- Intelligence and Stupidity: The Orthogonality Thesis (Robert Miles, 2018)
- The Basic AI Drives (Omohundro, 2008)
- Gradual Disempowerment (Kulveit et al., 2025)
- The Superintelligent Will: Motivation And Instrumental Rationality in Advanced Artificial Agents (Nick Bostrom, 2012)
- Existential Risk from Power-Seeking AI (Joe Carlsmith, 2022)
Further Readings
- Instrumental convergence (Eliezer Yudkowsky, 2025)
- Why Would AI "Aim" To Defeat Humanity? (Cold Takes, 2022)
- AI Could Defeat All Of Us Combined (Cold Takes, 2022)
- Two types of AI existential risk: decisive and accumulative (Atoosa Kasirzadeh, 2025)
- International AI Safety Report 2026
- The Vulnerable World Hypothesis (Nick Bostrom, 2019)
- AI-enabled coups: a small group could use AI to seize power (Forethought, 2025)
- Impact of AI on cyber threat from now to 2027 (NCSC, 2025)
- How AI Threatens Democracy (Kreps & Kriner, 2023)
- Can Democracy Survive the Disruptive Power of AI? (Carnegie Endowment for International Peace, 2024)
- The Authoritarian Risks of AI Surveillance (Lawfare, 2025)
- The Operational Risks of AI in Large-Scale Biological Attacks (RAND, 2024)