6.7950

Note: The schedule is subject to minor changes, and will be updated periodically with lecture slides and readings.

Date	Tuesday	Date	Thursday	Friday
PART 1: Dynamic Programming
		09/08	L01. Dynamic programming: What makes sequential decision making hard? Introduction, finite-horizon problem formulation, Markov decision processes, dynamic programming algorithm, sequential decision making as shortest path, course overview [L] [R] Readings: N1 §3, N2 §1-3, N3 §1; DPOC 1.1-1.3, 2.1; SB Ch1 (skim)	R01. Inventory control.[H] Readings: N3 §3; DPOC 3.2
09/13	L02. Special structures: What makes some sequential decision-making problems easy? DP arguments, optimal stopping [L] [R] Readings: DPOC 3.3-3.4	09/15	L03. Special structures: What makes some sequential decision-making problems easy? (Part 2) Linear quadratic regulator[L] [R] Readings: N3 §2; DPOC 3.1	R02. Combinatorial optimization as DP. LQR. [H] Readings: DPOC App-B HW0 (not due)
09/20	L04. Discounted infinite horizon problem: tl;dr; DP still works Bellman equations, value iteration [L] [R] Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3	09/22	L05. Discounted infinite horizon problems: tl;dr: DP still works (Part 2) Policy iteration, geometric interpretation [L] [R] Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3	R03. Infinite horizon [H] HW1 [1,2,3]
PART 2: Reinforcement Learning
09/27	L06. MDPs and (PO)MDPs: Nuances, simplifications, generalizations MDP assumptions, policy classes, imperfect state information, separation principle [L] [R] Readings: DPOC 1.4, 4.1-4.2	09/29	NO CLASS	R04. Reward shaping, MDPs with LP. [H]
10/04	L07. From DP to RL: Policy evaluation without knowing how the world works Monte Carlo, stochastic approximation of a mean, temporal differences [L] [R] Readings: NDP 5.1-5.3, SB 12.1-12.2	10/06	L08. Value-based reinforcement learning: All about "Q" Q-learning, stochastic approximation of a fixed points [L] [R] Readings: NDP 5.6, 4.1-4.3	R05. Supermartingale convergence theorem. [pH] HW2 [4,5,6]
10/11	NO CLASS (Student Holiday)	10/13	L09. Value-based reinforcement learning: All about "Q" (Part 2) Stochastics approximation of a fixed point (cont...) [L] [R] Readings: NDP 4.1-4.3, 6.1-6.2; DPOC2 6.3	R06. Recap, approximate value iteration. [pH]
10/18	L10. Advanced value-based RL methods Function approximation, approximate PI / VI /PE, fitted Q iteration, DQN and friends [L] [R] Readings: NDP DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5	10/20	L11. Policy space methods: Simplicity at the cost of variance Policy gradient [L] [R] Readings:NDP DPOC2 6.1; SB Ch13	R07. HW3 [7,8,9,10]
10/25	L12. Policy space methods: Simplicity at the cost of variance (Part 2) Variance reduction, actor-critic[L] [R] Readings:NDP DPOC2 6.1; SB Ch13	10/27	L13. Advanced policy space methods: Managing the learning rate Conservative policy space methods, TRPO, PPO [L] [R] Readings: Bandits Ch1-3, 5	R08. Performance difference lemma [pH] Project proposal
PART 3: Special topics
11/01	L14. RL in healthcare (Adam Yala, Berkeley) [L] [R]	11/03	L15. Multi-arm bandits: Exploration-Exploitation Dilemma [L] [R] Readings: Bandits Ch1-3, 5	R09. Concentration inequalities, quiz review [pH] HW4 [11,12]
11/08	L16. Quiz	11/10	L17. Empirical design in RL (Adam White, UAlberta / Deepmind) [L] [R] Readings: Cookbook
11/15	L18. Safe RL (Jaime Fisac, Princeton) [L] [R]	11/17	L19. Solving MDP families (Taylor Killian, University of Toronto) [L] [R]
11/22	L20. Learning for combinatorial optimization (Elias Khalil, University of Toronto) [L] [R]	11/24	NO CLASS (Thanksgiving)
11/29	NO CLASS (NeurIPS) — Work on projects HW5 [13,15,17]	12/01	NO CLASS (NeurIPS) — Work on projects
12/06	L21. Multi-agent RL (Eugene Vinitsky, Apple Special Projects / NYU) [L]	12/08	L22. Recent theoretical results in RL (Sasha Rakhlin, MIT)
12/13	L23. Project presentations (extended session) Slides due (before class), Reports due (Wed 12/14 5pm)