Note: The schedule is subject to minor changes, and will be updated periodically with lecture slides and readings.
Date | Tuesday | Date | Thursday | Friday |
---|---|---|---|---|
09/08 | L01. Dynamic programming: What makes sequential decision making hard? Introduction, finite-horizon problem formulation, Markov decision processes, dynamic programming algorithm, sequential decision making as shortest path, course overview [L] [R] Readings: N1 §3, N2 §1-3, N3 §1; DPOC 1.1-1.3, 2.1; SB Ch1 (skim) |
R01. Inventory control.[H] Readings: N3 §3; DPOC 3.2 |
||
09/13 | L02. Special structures: What makes some sequential decision-making problems easy? DP arguments, optimal stopping [L] [R] Readings: DPOC 3.3-3.4 |
09/15 | L03. Special structures: What makes some sequential decision-making problems easy? (Part 2) Linear quadratic regulator[L] [R] Readings: N3 §2; DPOC 3.1 |
R02. Combinatorial optimization as DP. LQR. [H] Readings: DPOC App-B HW0 (not due) |
09/20 | L04. Discounted infinite horizon problem: tl;dr; DP still works Bellman equations, value iteration [L] [R] Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3 |
09/22 | L05. Discounted infinite horizon problems: tl;dr: DP still works (Part 2) Policy iteration, geometric interpretation [L] [R] Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3 |
R03. Infinite horizon [H] HW1 [1,2,3] |
09/27 | L06. MDPs and (PO)MDPs: Nuances, simplifications, generalizations MDP assumptions, policy classes, imperfect state information, separation principle [L] [R] Readings: DPOC 1.4, 4.1-4.2 |
09/29 | NO CLASS | R04. Reward shaping, MDPs with LP. [H] |
10/04 | L07. From DP to RL: Policy evaluation without knowing how the world works Monte Carlo, stochastic approximation of a mean, temporal differences [L] [R] Readings: NDP 5.1-5.3, SB 12.1-12.2 |
10/06 | L08. Value-based reinforcement learning: All about "Q" Q-learning, stochastic approximation of a fixed points [L] [R] Readings: NDP 5.6, 4.1-4.3 |
R05. Supermartingale convergence theorem. [pH] HW2 [4,5,6] |
10/11 | NO CLASS (Student Holiday) | 10/13 | L09. Value-based reinforcement learning: All about "Q" (Part 2) Stochastics approximation of a fixed point (cont...) [L] [R] Readings: NDP 4.1-4.3, 6.1-6.2; DPOC2 6.3 |
R06. Recap, approximate value iteration. [pH] |
10/18 | L10. Advanced value-based RL methods Function approximation, approximate PI / VI /PE, fitted Q iteration, DQN and friends [L] [R] Readings: NDP DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5 |
10/20 | L11. Policy space methods: Simplicity at the cost of variance Policy gradient [L] [R] Readings:NDP DPOC2 6.1; SB Ch13 |
R07. HW3 [7,8,9,10] |
10/25 | L12. Policy space methods: Simplicity at the cost of variance (Part 2) Variance reduction, actor-critic[L] [R] Readings:NDP DPOC2 6.1; SB Ch13 |
10/27 | L13. Advanced policy space methods: Managing the learning rate Conservative policy space methods, TRPO, PPO [L] [R] Readings: Bandits Ch1-3, 5 |
R08. Performance difference lemma [pH] Project proposal |
11/01 | L14. RL in healthcare (Adam Yala, Berkeley) [L] [R] |
11/03 | L15. Multi-arm bandits: Exploration-Exploitation Dilemma [L] [R] Readings: Bandits Ch1-3, 5 |
R09. Concentration inequalities, quiz review [pH] HW4 [11,12] |
11/08 | L16. Quiz |
11/10 | L17. Empirical design in RL (Adam White, UAlberta / Deepmind) [L] [R] Readings: Cookbook |
|
11/15 | L18. Safe RL (Jaime Fisac, Princeton) [L] [R] |
11/17 | L19. Solving MDP families (Taylor Killian, University of Toronto) [L] [R] |
|
11/22 | L20. Learning for combinatorial optimization (Elias Khalil, University of Toronto) [L] [R] |
11/24 | NO CLASS (Thanksgiving) | |
11/29 | NO CLASS (NeurIPS) — Work on projects HW5 [13,15,17] |
12/01 | NO CLASS (NeurIPS) — Work on projects | |
12/06 | L21. Multi-agent RL (Eugene Vinitsky, Apple Special Projects / NYU) [L] |
12/08 | L22. Recent theoretical results in RL (Sasha Rakhlin, MIT) | |
12/13 | L23. Project presentations (extended session)
Slides due (before class), Reports due (Wed 12/14 5pm) |