FORTHCOMING RESEARCH MONOGRAPH: LESSONS FROM ALPHAZERO

Cover_Lessons_From_AlphaZero.jpg

"Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control," draft of a forthcoming (2022) research monograph; an August 2021 version went on line as arXiv preprint arXiv:2108.10315.

This draft will be periodically updated until shortly before its formal publication. Check this site for the latest version.

A brief description: Some of the most exciting success stories in artificial intelligence and reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon). These programs were trained off-line extensively using sophisticated self-play/approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained off-line. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it has greatly improved performance.

Similarly, TD-Gammon performs on-line a policy improvement step using lookahead minimization that is not degraded by neural network approximations. To this end it uses an off-line neural-network trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the position evaluator).

An important lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line play, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is the central point of our work: on-line play amounts to a step of Newton's method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training and may be enhanced by longer lookahead and on-line rollout. This process can be understood in terms of abstract models of dynamic programming and simple geometrical constructions. It manifests itself to some extent in model predictive control, but it seems that it has yet to be fully appreciated within the decision and control community.

In this work we aim to provide analysis and insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. In particular, through a unifying abstract dynamic programming framework, we show that the principal AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.



Abstract Dynamic Programming, 2nd Edition, 2018

by Dimitri P. Bertsekas


Cover_2nd_Edition_small.jpg

The print version of the 2nd edition of the book is available from the publishing company Athena Scientific, or from Amazon.com. It is also available as an Ebook from Google Books.

The 2nd edition aims primarily to amplify the presentation of the semicontractive models of Chapter 3 and Chapter 4 of the first (2013) edition, and to supplement it with a broad spectrum of research results that I obtained and published in journals and reports since the first edition was written (see below). As a result, the size of this material more than doubled, and the size of the book increased by nearly 40%.

The book can be downloaded and used freely for noncommercial purposes. The version below corrects a few errata from the book's first printing, and is identical to the book's second printing (to appear in 2021).


Abstract Dynamic Programming, 2ND EDITION, Complete

In addition to the changes in Chapters 3, and 4, I have also eliminated from the second edition the material of the first edition that deals with restricted policies and Borel space models (Chapter 5 and Appendix C). These models are motivated in part by the complex measurability questions that arise in mathematically rigorous theories of stochastic optimal control involving continuous probability spaces. The restricted policies framework aims primarily to extend abstract DP ideas to Borel space models. Since this material is fully covered in Chapter 6 of the 1978 monograph by Bertsekas and Shreve, and followup research on the subject has been limited, I decided to omit Chapter 5 and Appendix C of the first edition from the second edition and just post them below.

Chapter 5 of 1st Edition

Appendix C of 1st Edition


Supporting series of five videolectures on Semicontractive Dynamic Programming


The following papers and reports have a strong connection to the book, and among others amplify on the analysis and the range of applications of the semicontractive models of Chapters 3 and 4:

  • D. P. Bertsekas, "Regular Policies in Abstract Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3173, MIT, May 2015 (revised August, 2016); to appear in SIAM J. on Control and Optimization (Related Lecture Slides); (Related Video Lectures).

  • D. P. Bertsekas, "Value and Policy Iteration in Deterministic Optimal Control and Adaptive Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3174, MIT, May 2015 (revised Sept. 2015); IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, 2017, pp. 500-509.

  • D. P. Bertsekas and H. Yu, "Stochastic Shortest Path Problems Under Weak Conditions", Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, January 2016.

  • D. P. Bertsekas, "Robust Shortest Path Planning and Semicontractive Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-2915, MIT, Jan. 2016; revised in June 2016; to appear in Naval Research Logistics.

  • D. P. Bertsekas, "Affine Monotonic and Risk-Sensitive Models in Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-3204, MIT, June 2016; to appear in IEEE Transactions on Aut. Control.

  • D. P. Bertsekas, "Stable Optimal Control and Semicontractive Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3506, MIT, May 2017; to appear in SIAM J. on Control and Optimization (Related Lecture Slides).

  • D. P. Bertsekas, "Proper Policies in Infinite-State Stochastic Shortest Path Problems," arXiv preprint arXiv:1711.10129. (Related Lecture Slides).

  • Bertsekas, D., "Multiagent Value Iteration Algorithms in Dynamic Programming and Reinforcement Learning," ASU Report, April 2020, arXiv preprint, arXiv:2005.01627.

  • An updated version of Chapter 4 of the author's Dynamic Programming book, Vol. II, which incorporates recent research on a variety of undiscounted problems and relates to abstract DP topics; (Related Lecture Slides).