Note: The schedule is subject to minor changes, and will be updated periodically with lecture slides and readings.
Date | Tuesday | Date | Thursday | Friday | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
09/07 | L01. What is sequential decision making? Introduction, finite-horizon problem formulation, Markov decision processes, dynamic programming algorithm, sequential decision making as shortest path, course overview Readings: N1 §3, N2 §1-3, N3 §1; DPOC 1.1-1.3, 2.1; SB Ch1 (skim) [L] [R] |
R01. | ||||||||||||
09/12 | L02. Dynamic Programming: What makes sequential decision-making problems hard? DP algorithm, sequential decision making as shortest path Readings: DPOC 3.3-3.4 [L] [R] |
09/14 | L03. Special structures: What makes some sequential decision-making problems easy? DP arguments, optimal stopping,Linear quadratic regulator Readings: N3 §2; DPOC 3.1 [L] [R] |
R02. HW0 (not due) |
||||||||||
09/19 | L04. Discounted infinite horizon problem: tl;dr; DP still works Bellman equations, value iteration Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3 [L] [R] |
09/21 | L05. Discounted infinite horizon problems: tl;dr: DP still works (Part 2) Policy iteration, geometric interpretation Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3 [R] HW1 [1,2,3] |
Student Holiday | ||||||||||
09/26 | L06. MDPs and (PO)MDPs: Nuances, simplifications, generalizations MDP assumptions, policy classes, imperfect state information, separation principle Readings: DPOC 1.4, 4.1-4.2 [L] [R] |
09/28 | L07. Model-free methods - From DP to RL Monte Carlo, stochastic approximation of a mean, temporal differences Readings: NDP 5.1-5.3, SB 12.1-12.2 [L] [R] |
R03. |
||||||||||
10/03 | L08. Value-based reinforcement learning: All about "Q" Q-learning, stochastic approximation of a fixed points Readings: NDP 5.6, 4.1-4.3 [L] [R] |
10/05 | L09. Value-based reinforcement learning: All about "Q" (Part 2) Stochastic approximation of a fixed point (cont...) Readings: NDP 4.1-4.3, 6.1-6.2; DPOC2 6.3 [R] |
R04. HW2 [4,5,6] |
||||||||||
10/10 | NO CLASS (Student Holiday) | 10/12 | L10. Advanced value-based RL methods Function approximation, approximate policy evaluation Readings: DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5[L] [R] |
R05. | ||||||||||
10/17 | L11. Advanced value-based RL methods Approximate VI, fitted Q iteration, DQN, DDQN and friends Readings: DPOC2 2.5.3; NDP 3.1-3.2, SB 13 [R] |
10/19 | L12. Policy gradient - Simplicity at the cost of variance Approximate policy iteration, policy gradient, variance reduction, actor-critic NDP 6.1; SB Ch13 [L] [R] |
R06. HW3 [7,8,9,10,11] |
||||||||||
10/24 | L13. Advanced policy gradient methods - Managing exploration vs exploitation Conservative policy space methods, TRPO, PPO Readings: RLTA 11.1-11.2, Ch12 [L] [R] |
10/26 | L14. Design and analysis of experiments in RL DOX process, factors Readings: Montgomery (Ch1-2), Cookbook [L] [R] |
R07. Project proposal |
||||||||||
10/31 | L15. Design and analysis of experiments in RL (Part 2) Statistical testing, implementation best practices, response variables, choice of design s Readings: Montgomery (Ch1-2), Cookbook [R] |
11/02 | L16. Monte Carlo Tree Search Online planning, MCTS, AlphaGo, AlphaGoZero, MuZero Readings: SB 16.6 [L] [R] |
R08. HW4 [12, 13] |
||||||||||
11/07 | L17. Quiz |
11/09 | L18. Multi-arm bandits - The prototypical exploration-exploitation dilemma Readings: Bandits Ch 1, Ch 8 [L] [R] |
|||||||||||
11/14 | L19. Recommendation Systems - Between the Theory & Practice of Contextual Bandits (Guest Lecture: Lihong Li, Amazon) Readings: News article recommendation [L] [R] | 11/16 | L20. Connections between Control and Learning (Guest Lecture: Prof. Anuradha Annaswamy, MIT) Readings: Adaptive control and RL [L] [R] |
R9. |
||||||||||
11/21 | L21. Learning for discrete optimization Discrete optimization, integer programming, attention, learning-guided vs construction [L] [R] HW5 [14, 15, 18] |
11/23 | NO CLASS (Thanksgiving) | |||||||||||
11/28 | L22. Distributional RL (Guest Lecture: Marc G. Bellemare, Reliant AI; formerly, Google Brain) Readings:Bellemare, Dabney, Rowland [L] [R] |
11/30 | L23. Reinforcement learning from human feedback (Guest Lecture: Dylan Hadfield-Menell, MIT) Inverse RL, RLHF Open problems in RLHF [L] [R] |
|||||||||||
12/05 | NO CLASS See merged class on 12/07 |
12/07 | L24+25. Project presentations (double lecture) Slides due (before class), Reports due (Wed 12/13 5pm) |
R10. Project presentations (2x) |