ROLLOUT, POLICY ITERATION, AND DISTRIBUTED REINFORCEMENT LEARNING BOOK:

Athena Scientific, August 2020

The book is available from the publishing company Athena Scientific, and from Amazon.com. It is also available as an Ebook from Google Books.

Click here for a free .pdf copy of the book.

This is a research monograph at the forefront of research on reinforcement learning, also referred to by other names such as approximate dynamic programming and neuro-dynamic programming. The purpose of the monograph is to develop in greater depth some of the methods from the author's recently published textbook on Reinforcement Learning (Athena Scientific, 2019). In particular, we present new research, relating to systems involving multiple agents, partitioned architectures, and distributed asynchronous computation. We pay special attention to the contexts of dynamic programming/policy iteration and control theory/model predictive control. We also discuss in some detail the application of the methodology to challenging discrete/combinatorial optimization problems, such as routing, scheduling, assignment, and mixed integer programming, including the use of neural network approximations within these contexts.

The book focuses on the fundamental idea of policy iteration, i.e., start from some policy, and successively generate one or more improved policies. If just one improved policy is generated, this is called rollout, which, based on broad and consistent computational experience, appears to be one of the most versatile and reliable of all reinforcement learning methods. Among others, it can be applied on-line using easily implementable simulation, and it can be used for discrete deterministic combinatorial optimization, as well as for stochastic Markov decision problems. Moreover, rollout can make on-line use of the policy produced off-line by policy iteration or by any other method (including a policy gradient method), and improve on the performance of that policy,

Much of the new research is inspired by the remarkable AlphaZero chess program, where policy iteration, value and policy networks, approximate lookahead minimization, and parallel computation all play an important role. In addition to the fundamental process of successive policy iteration/improvement, this program includes the use of deep neural networks for representation of both value functions and policies, the extensive use of large scale parallelization, and the simplification of lookahead minimization, through methods involving Monte Carlo tree search and pruning of the lookahead tree. In this book, we also focus on policy iteration, value and policy neural network representations, parallel and distributed computation, and lookahead simplification. Thus while there are significant differences, the principal design ideas that form the core of this monograph are shared by the AlphaZero architecture, except that we develop these ideas in a broader and less application-specific framework.

Among its special features, the book:

Presents new research relating to distributed asynchronous computation, partitioned architectures, and multiagent systems, with application to challenging large scale optimization problems, such as partially observed Markov decision problems.

Describes variants of rollout and policy iteration for problems with a multiagent structure, which allow a dramatic reduction of the computational requirements for lookahead minimization.

Establishes a connection of rollout with model predictive control, one of the most prominent control system design methodologies.

Describes the application of constrained and multiagent forms of rollout to challenging discrete and combinatorial optimization problems.

DISTRIBUTED AND MULTIAGENT REINFORCEMENT LEARNING COURSE MATERIAL

ASU, 2020

Dimitri P. Bertsekas

The book was developed during the 2020 offering of the Topics in Reinforcement Learning course at Arizona State University (abbreviated due to the corona virus health crisis).

Lecture slides from the 2020 course, and two related videolectures:

Slides-Lecture 1, Slides-Lecture 2, Slides-Lecture 3, Slides-Lecture 4, Slides-Lecture 5, Slides-Lecture 6, Slides-Lecture 8.

Video of an Overview Lecture on Distributed RL from IPAM workshop at UCLA, Feb. 2020 (Slides).

Video of an Overview Lecture on Multiagent RL from a lecture at ASU, Oct. 2020 (Slides).

RELATED MATERIAL

The following papers and reports have a strong connection to material in the book, and amplify on its analysis and its range of applications.

Bertsekas, D., "Multiagent Reinforcement Learning: Rollout and Policy Iteration," ASU Report Sept. 2020; published in IEEE/CAA Journal of Automatica Sinica, Vol. 8, 2021, pp. 249-271.

Bertsekas, D., "Multiagent Value Iteration Algorithms in Dynamic Programming and Reinforcement Learning," arXiv preprint, arXiv:2005.01627, April 2020; Results in Control and Optimization J., Vol. 1, 2020.

Bertsekas, D., "Multiagent Rollout Algorithms and Reinforcement Learning," arXiv preprint arXiv:1910.00120, September 2019 (revised April 2020).

Bertsekas, D., "Constrained Multiagent Rollout and Multidimensional Assignment with the Auction Algorithm," arXiv preprint, arXiv:2002.07407 February 2020.

Bhattacharya, S., Badyal, S., Wheeler, W., Gil, S., Bertsekas, D.,"Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration with Application to Autonomous Sequential Repair Problems," IEEE Robotics and Automation Letters, Vol. 5, pp. 3967-3974, 2020.

Bhattacharya, S., Kailas, S., Badyal, S., Gil, S., Bertsekas, D.,"Multiagent Rollout and Policy Iteration for POMDP with Application to Multi-Robot Repair Problems," Proc. CORL, 2020; arXiv preprint, arXiv:2011.04222, November 2020.

D. P. Bertsekas, "Distributed Asynchronous Policy Iteration for Sequential Zero-Sum Games and Minimax Control," arXiv preprint arXiv:2107.10406, July 2021.

D. P. Bertsekas, "On-Line Policy Iteration for Infinite Horizon Dynamic Programming," arXiv preprint arXiv:2106.00746, May 2021.

D. P. Bertsekas and H. Yu, "Distributed Asynchronous Policy Iteration in Dynamic Programming," Proc. of 2010 Allerton Conference on Communication, Control, and Computing, Allerton Park, ILL, Sept. 2010. (Related Lecture Slides) (An extended version with additional algorithmic analysis) (A counterexample by Williams and Baird that motivates in part this paper).

LINK TO THE 2025 "COURSE ON REINFORCEMENT LEARNING" BOOK

Reinforcement Learning Course Material

LINK TO THE 2022 LESSONS FROM ALPHAZERO BOOK

Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control

Visits since February 15, 2020