6.7950 Fall 2022 (formerly 6.246)      

Reinforcement Learning: Foundations and Methods

 

Note: The schedule is subject to minor changes, and will be updated periodically with lecture slides and readings.
Date Tuesday Date Thursday Friday
PART 1: Dynamic Programming
09/08 L01. Dynamic programming: What makes sequential decision making hard?
Introduction, finite-horizon problem formulation, Markov decision processes, dynamic programming algorithm, sequential decision making as shortest path, course overview [L] [R]
Readings: N1 §3, N2 §1-3, N3 §1; DPOC 1.1-1.3, 2.1; SB Ch1 (skim)
R01. Inventory control.[H]
Readings: N3 §3; DPOC 3.2
09/13 L02. Special structures: What makes some sequential decision-making problems easy?
DP arguments, optimal stopping [L] [R]
Readings: DPOC 3.3-3.4
09/15 L03. Special structures: What makes some sequential decision-making problems easy? (Part 2)
Linear quadratic regulator[L] [R]
Readings: N3 §2; DPOC 3.1
R02. Combinatorial optimization as DP. LQR. [H]
Readings: DPOC App-B
HW0 (not due)
09/20 L04. Discounted infinite horizon problem: tl;dr; DP still works
Bellman equations, value iteration [L] [R]
Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3
09/22 L05. Discounted infinite horizon problems: tl;dr: DP still works (Part 2)
Policy iteration, geometric interpretation [L] [R]
Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3
R03. Infinite horizon [H]
HW1 [1,2,3]
PART 2: Reinforcement Learning
09/27 L06. MDPs and (PO)MDPs: Nuances, simplifications, generalizations
MDP assumptions, policy classes, imperfect state information, separation principle [L] [R]
Readings: DPOC 1.4, 4.1-4.2
09/29 NO CLASS R04. Reward shaping, MDPs with LP. [H]
10/04 L07. From DP to RL: Policy evaluation without knowing how the world works
Monte Carlo, stochastic approximation of a mean, temporal differences [L] [R]
Readings: NDP 5.1-5.3, SB 12.1-12.2
10/06 L08. Value-based reinforcement learning: All about "Q"
Q-learning, stochastic approximation of a fixed points [L] [R]
Readings: NDP 5.6, 4.1-4.3
R05. Supermartingale convergence theorem. [pH]
HW2 [4,5,6]
10/11 NO CLASS (Student Holiday) 10/13 L09. Value-based reinforcement learning: All about "Q" (Part 2)
Stochastics approximation of a fixed point (cont...) [L] [R]
Readings: NDP 4.1-4.3, 6.1-6.2; DPOC2 6.3
R06. Recap, approximate value iteration. [pH]
10/18 L10. Advanced value-based RL methods
Function approximation, approximate PI / VI /PE, fitted Q iteration, DQN and friends [L] [R]
Readings: NDP DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5
10/20 L11. Policy space methods: Simplicity at the cost of variance
Policy gradient [L] [R]
Readings:NDP DPOC2 6.1; SB Ch13
R07.
HW3 [7,8,9,10]
10/25 L12. Policy space methods: Simplicity at the cost of variance (Part 2)
Variance reduction, actor-critic[L] [R]
Readings:NDP DPOC2 6.1; SB Ch13
10/27 L13. Advanced policy space methods: Managing the learning rate
Conservative policy space methods, TRPO, PPO [L] [R]
Readings: Bandits Ch1-3, 5
R08. Performance difference lemma [pH]
Project proposal
PART 3: Special topics
11/01 L14. RL in healthcare (Adam Yala, Berkeley)
[L] [R]
11/03 L15. Multi-arm bandits: Exploration-Exploitation Dilemma
[L] [R]
Readings: Bandits Ch1-3, 5
R09. Concentration inequalities, quiz review [pH]
HW4 [11,12]
11/08 L16. Quiz
11/10 L17. Empirical design in RL (Adam White, UAlberta / Deepmind)
[L] [R]
Readings: Cookbook

11/15 L18. Safe RL (Jaime Fisac, Princeton)
[L] [R]
11/17 L19. Solving MDP families (Taylor Killian, University of Toronto)
[L] [R]
11/22 L20. Learning for combinatorial optimization (Elias Khalil, University of Toronto)
[L] [R]
11/24 NO CLASS (Thanksgiving)
11/29 NO CLASS (NeurIPS) — Work on projects
HW5 [13,15,17]
12/01 NO CLASS (NeurIPS) — Work on projects
12/06 L21. Multi-agent RL (Eugene Vinitsky, Apple Special Projects / NYU)
[L]
12/08 L22. Recent theoretical results in RL (Sasha Rakhlin, MIT)
12/13 L23. Project presentations (extended session) Slides due (before class), Reports due (Wed 12/14 5pm)