6.7920 Fall 2025      

Reinforcement Learning: Foundations and Methods

 

Note: The schedule is subject to minor changes, and will be updated periodically with lecture slides and readings.
Note: All lecture slides will be posted in Canvas.
Date Tuesday Date Thursday Friday
PART 1: Dynamic Programming
09/04 L01. What is sequential decision making?
Introduction, finite-horizon problem formulation, Markov decision processes, course overview
Readings: N1 §3, N2 §1-3, N3 §1; DPOC 1.1-1.3, 2.1; SB Ch1 (skim) [Lec]
R01. Inventory control.
Readings: N3 §3; DPOC 3.2
09/09 L02. Dynamic programming: What makes sequential decision-making hard?
Finite horizon dynamic programming (DP) algorithm, sequential decision making as shortest path
Readings: N2 §1-3; DPOC 3.3-3.4
HW1 assigned [1,2]
09/11 L03. Special structures: What makes some sequential decision-making problems easy?
DP arguments, inventory problem, optimal stopping
Readings: N2 §1-3, N3 §2-3; DPOC 3.2
R02. Combinatorial optimization as DP.
Readings: DPOC App-B
09/16 L04. Special Structures
Linear quadratic regulator
Reading: N3 §1-2; DPOC 3.1
HW1 due, HW2 assigned [3,4]
09/18 L05. Non-discounted Infinite Horizon Problem
Linear quadratic regulator
Readings: N3 §1-2; DPOC 1.4, 4.1-4.2
No class (student holiday)
09/23 L06. Discounted infinite horizon problems - tl;dr: DP still works
Bellman equations, value iteration
Readings: N5 §1-8; DPOC2 1.1-1.2, 1.5, 2.1-2.3
HW2 due, HW3 assigned [5,6]
09/25 L07. Discounted infinite horizon problems - tl;dr: DP still works (Part 2)
Policy iteration, geometric interpretation
Readings: N5 §1-8; DPOC2 1.1-1.2, 1.5, 2.1-2.3
R03. Modified policy iteration
PART 2: Reinforcement Learning
09/30 L08. Model-free Policy Evaluation - Policy evaluation without knowing how the world works
Monte Carlo, policy evaluation, stochastic approximation of a mean, temporal differences (TD), TD(λ)
Readings: NDP 5.1-5.3, SB 12.1-12.2
HW3 due, HW4 assigned [7,8]
Project guidelines posted
10/02 L09. Model-free Policy Learning - Policy learning without knowing how the world works
State-action value function, Q-learning
Readings: NDP 5.6, 4.1-4.3
R04. Convergence of stochastic value iteration
10/07 L10. Convergence of TD Methods - tl;dr: Noisy, bootstrapped updates work
Stochastic approximation of a fixed point
Readings: NDP 4.1-4.3, 6.1-6.2; DPOC2 6.3
HW4 due, HW5 assigned [9,10]
10/09 L11. Convergence of TD Methods - tl;dr: Noisy, bootstrapped updates work (Part 2)
Stochastic approximation of a fixed point
Readings: NDP 4.1-4.3, 6.1-6.2; DPOC2 6.3
R05. Approximate value iteration
10/14 L12. Approximate value-based RL - How to approximately solve an RL problem
Function approximation, approximate policy evaluation
Readings: DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5
HW5 due, HW6 assigned [11,12]
10/16 L13. Approximate value-based RL - How to approximately solve an RL problem (Part 2)
Approximate VI, fitted Q iteration, DQN, DDQN and friends
Readings: DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5
R06. Performance difference lemma
10/21 L14. Policy gradient - Simplicity at the cost of variance
Approximate policy iteration, policy gradient, variance reduction
Readings: NDP 6.1; SB Ch13.1-13.4
HW6 due, HW7 assigned [13,14]
10/23 L15. Actor-critic Methods - Bringing together value-based and policy-based RL
Compatible function approximation, A2C, A3C, DDPG, SAC
Reaqdings: SB Ch13.5-13.8
R07.
Project proposal due
10/28 L16. Advanced policy gradient methods - Managing exploration vs exploitation
Conservative policy iteration, NRG, TRPO, PPO
Readings: RLTA 11.1-11.2, Ch12
HW7 due
HW8 assigned [15,16,18]
10/30 L17. Multi-arm bandits - The prototypical exploration-exploitation dilemma
Readings: Bandits Ch 1, Ch 8
R08.
PART 3: Special topics
11/04 L18. Quiz
11/06 L19. Evaluation in RL - Is the RL method working?
Sensitivity analysis of RL, benchmarking, statistical methods, overfitting to benchmarks, model-based transfer learning
Readings: Lecture Appendices A-C
R09
11/11 NO CLASS (Student Holiday) 11/13 L20. Guest Lecture

HW8 due
R10
11/18 L21. Guest Lecture
11/20 L22. Guest Lecture

R11
11/25 L23. Monte Carlo Tree Search
Online planning, MCTS, AlphaGo, AlphaGoZero, MuZero
Readings: SB 16.6
11/27 NO CLASS (Thanksgiving) No class
12/02 L24. Guest Lecture

12/04 Final Project Presentations (~3 hours)
Presentations due (before class), Reports due (Wed 12/10 5pm)

12/09 Final Project Presentations (~3 hours)