6.7920 Fall 2024      

Reinforcement Learning: Foundations and Methods

 

Note: The schedule is subject to minor changes, and will be updated periodically with lecture slides and readings.
Date Tuesday Date Thursday Friday
PART 1: Dynamic Programming
09/05 L01. What is sequential decision making?
Introduction, finite-horizon problem formulation, Markov decision processes, course overview
Readings: N1 §3, N2 §1-3, N3 §1; DPOC 1.1-1.3, 2.1; SB Ch1 (skim) [R]
R01. Inventory control.
Readings: N3 §3; DPOC 3.2
09/10 L02. Dynamic Programming: What makes sequential decision-making hard?
Finite horizon dynamic programming algorithm, sequential decision making as shortest path
Readings: DPOC 3.3-3.4 [R]
HW1 assigned [1, 2]
09/12 L03. Special structures: What makes some sequential decision-making problems easy?
DP arguments, inventory problem, optimal stopping
Readings: N3 §3; DPOC 3.2 [R]
R02. Combinatorial optimization as DP.
Readings: DPOC App-B
09/17 L04. Special Structures
Linear quadratic regulator
Reading: N3 §2; DPOC 3.1 [R]
HW1 due, HW2 assigned [3,4]
09/19 L05. Non-discounted Infinite Horizon Problem
Linear quadratic regulator
Readings: DPOC 1.4, 4.1-4.2 [R]
No class (student holiday)
09/24 L06. Discounted infinite horizon problems - tl;dr: DP still works
Bellman equations, value iteration
Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3 [R]
HW2 due, HW3 assigned[5,6]
09/26 L07. Discounted infinite horizon problems - tl;dr: DP still works (Part 2)
Policy iteration, geometric interpretation
Readings: DPOC2 1.1-1.2, 1.5, 2.1-2.3 [R]
R03. Linear quadratic regulator
Project ideas posted
PART 2: Reinforcement Learning
10/01 L08. Model-free methods - From DP to RL
Monte Carlo, policy evaluation, stochastic approximation of a mean, temporal differences
Readings: NDP 5.1-5.3, SB 12.1-12.2 [L] [R]
HW3 due, HW4 assigned [7,8]
10/03 L09. Value-based reinforcement learning - Policy learning without knowing how the world works
State-action value function, Q-learning
Readings: NDP 5.6, 4.1-4.3 [L] [R]
R04.
10/8 L10. Value-based reinforcement learning - Policy learning without knowing how the world works (Part 2)
Stochastic approximation of a fixed point
Readings: NDP 4.1-4.3, 6.1-6.2; DPOC2 6.3 [L] [R]
HW4 due, HW5 assigned [9,10]
10/10 L11. Approximate value-based RL - How to approximately solve an RL problem
Function approximation, approximate policy evaluation
Readings: DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5 [L] [R]
R05.
10/15 NO CLASS (Student Holiday)
10/17 L12. Approximate value-based RL - How to approximately solve an RL problem (Part 2)
Approximate VI, fitted Q iteration, DQN, DDQN and friends
Readings: DPOC2 2.5.3; NDP 3.1-3.2, SB 16.5 [L] [R]
HW5 due, HW6 assigned [11,12]
R06.
10/22 L13. Policy gradient - Simplicity at the cost of variance
Approximate policy iteration, policy gradient, variance reduction
Readings: NDP 6.1; SB Ch13.1-13.4 [L] [R]
10/24 L14. Actor-critic Methods - Bringing together value-based and policy-based RL
Compatible function approximation, A2C, A3C, DDPG, SAC
Reaqdings: SB Ch13.5-13.8 [L] [R]
HW6 due, HW7 assigned [13,14]
R07.
Project proposal
10/29 L15. Advanced policy gradient methods - Managing exploration vs exploitation
Conservative policy iteration, NRG, TRPO, PPO
Readings: RLTA 11.1-11.2, Ch12 [L] [R]
10/31 L16. Multi-arm bandits - The prototypical exploration-exploitation dilemma
Readings: Bandits Ch 1, Ch 8 [L] [R]
HW7 due
R08.
PART 3: Special topics
11/05 L17. Quiz
11/07 L18. Evaluation in RL - Is the RL method working?
Sensitivity analysis of RL, benchmarking, statistical methods, overfitting to benchmarks, model-based transfer learning
Readings: Lecture Appendices A-C [L] [R]
HW8 assigned [15,16,18]
No class
11/12 L19. Applications of Reinforcement Learning in Criminal Justice and Healthcare (Guest lecture: Pengyi Shi, Purdue)
Readings: (a) (b) [L] [R]
11/14 L20. Monte Carlo Tree Search
Online planning, MCTS, AlphaGo, AlphaGoZero, MuZero
Readings: SB 16.6 [L] [R]
HW8 due
No class
11/19 L21. Rethinking the Theoretical Foundation of Reinforcement Learning (Guest lecture, Nan Jiang, UIUC)
Readings: (a) (b) [L] [R]
11/21 L22. Representation-basedĀ Reinforcement Learning (Guest lecture: Bo Dai, Georgia Tech)
Readings: (a) (b) [L] [R]
R09
11/26 L23. Finite-time Guarantees of Contractive Stochastic Approximation (Guest lecture: Siva Theja Maguluri, Georgia Tech)
Readings: (a) (b) [L] [R]
11/28 NO CLASS (Thanksgiving) No class
12/03 NO CLASS
See merged class on Thu
12/05 L24+25. Project presentations (double lecture)
Slides due (before class), Reports due (Wed 12/11 5pm)

R10. Project presentations (2x)
12/10 NO CLASS
See session on previous Friday