LESSONS FROM ALPHAZERO FOR OPTIMAL, MODEL PREDICTIVE, AND ADAPTIVE CONTROL (2022)

Cover_Lessons_From_AlphaZero.jpg

The print copy of the book is available from the publishing company, Athena Scientific and from Amazon.

Click here for the EBOOK version from Google Play.

Click here for a .pdf copy of the complete book.

Lessons from AlphaZero Videolecture, MIT: A summary one-hour presentation at MIT, Oct. 2022. Includes a discussion of the solution of the Wordle/NY Times puzzle using reinforcement learning methods.

Lessons from AlphaZero Videolecture, KTH: A fairly detailed two-hour presentation at KTH, Nov. 2021.

Model Predictive Control, and Reinforcement Learning: A Unified Framework Based on Dynamic Programming, Plenary talk at IFAC NMPC, August 2024.

Slides from the MIT Lecture

Slides from the KTH Lecture

Slides from the IFAC NMPC Lecture

A brief description: Some of the most exciting success stories in artificial intelligence and reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon). These programs were trained off-line extensively using sophisticated self-play/approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained off-line. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it has greatly improved performance.

Similarly, TD-Gammon performs on-line a policy improvement step using lookahead minimization that is not degraded by neural network approximations. To this end it uses an off-line neural-network trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the position evaluator).

An important lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line play, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is the central point of our work: on-line play amounts to a step of Newton's method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training and may be enhanced by longer lookahead and on-line rollout. This process can be understood in terms of abstract models of dynamic programming and simple geometrical constructions. It manifests itself to some extent in model predictive control, but it seems that it has yet to be fully appreciated within the decision and control community.

In this work we aim to provide insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. While we will deemphasize mathematical proofs, there is considerable related analysis, which supports our conclusions and can be found in the author's recent RL books. One of our principal aims is to show through the unifying principles of abstract DP that the AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.


A short early version (August 2021) went on line as arXiv preprint arXiv:2108.10315.

A survey paper summarizing the book: D. P. Bertsekas, "Newton's Method for Reinforcement Learning and Model Predictive Control," Results in Control and Optimization, Vol. 7, 2022, pp. 100-121.

The book relies on the abstract DP framework developed in the author's research monograph "Abstract Dynamic Programming," 2022 (3rd Edition).



Abstract Dynamic Programming, 3rd Edition, 2022

by Dimitri P. Bertsekas


AbstractDP-ED3-COVER-small.jpg

Click here for a .pdf copy of the 3rd Edition of the book.

Click here for the EBOOK version from Google Play.

The print version of the 3rd edition of the book is available from the publishing company Athena Scientific and from Amazon.

The 3rd edition extends the 2nd edition (2018) with a new chapter on sequential zero sum games. It also provides a connecting link with the "Lessons from AlphaZero ..." monograph.


A brief description: This research monograph provides a synthesis of old research on the foundations of dynamic programming (DP), with the modern theory of approximate DP and new research on semicontractive models. It aims at a unified and economical development of the core theory and algorithms of total cost sequential decision problems, based on the strong connections of the subject with fixed point theory. The analysis focuses on the abstract mapping that underlies DP and defines the mathematical character of the associated problem. The discussion centers on two fundamental properties that this mapping may have: monotonicity and (weighted sup-norm) contraction. It turns out that the nature of the analytical and algorithmic DP theory is determined primarily by the presence or absence of these two properties, and the rest of the problem's structure is largely inconsequential. New research is focused on two areas: 1) The ramifications of these properties in the context of algorithms for approximate DP, and 2) The new class of semicontractive models, exemplified by stochastic shortest path problems, where some but not all policies are contractive.

The 3rd edition is similar to the 2nd edition, except for two additions:

  • A new Section 1.3 on abstract visualizations of Bellman's equation and Newton's method. This is the major underlying theme of the "Lessons from AlphaZero ..." book, and allows a fundamental interpretation of the approximation in value space methodology in reinforcement learning in terms of Newton and Newton/SOR methods.
  • A new 40-page Chapter 5, which introduces a contractive abstract DP framework and related policy iteration algorithms, specifically designed for sequential zero-sum games and minimax problems with a general structure, and based on a recent paper by the author. Our algorithms, aside from greater generality, resolve some long-standing convergence difficulties of the natural PI algorithm, which have been known since the Pollatschek and Avi-Itzhak method for finite-state Markov games. Mathematically, this natural algorithm is a form of Newton's method for solving Bellman's equation, but Newton's method, contrary to the case of single-player DP problems, is not globally convergent in the case of a minimax problem, because of an additional difficulty: the Bellman operator may have components that are neither convex nor concave.

    Supporting series of five videolectures on Semicontractive Dynamic Programming


    The following papers and reports have a strong connection to the book, and among others amplify on the analysis and the range of applications of the semicontractive models of Chapters 3 and 4:

  • D. P. Bertsekas, "Regular Policies in Abstract Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3173, MIT, May 2015 (revised August, 2016); SIAM J. on Optimization, 2017, Vol. 27, pp. 1694-1727, (Related Lecture Slides); (Related Video Lectures).

  • D. P. Bertsekas, "Value and Policy Iteration in Deterministic Optimal Control and Adaptive Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3174, MIT, May 2015 (revised Sept. 2015); IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, 2017, pp. 500-509.

  • D. P. Bertsekas and H. Yu, "Stochastic Shortest Path Problems Under Weak Conditions", Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, January 2016.

  • D. P. Bertsekas, "Robust Shortest Path Planning and Semicontractive Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-2915, MIT, Jan. 2016; revised in June 2016; Naval Research Logistics (NRL), 2019, Vol. 66, pp.15-37.

  • D. P. Bertsekas, "Affine Monotonic and Risk-Sensitive Models in Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-3204, MIT, June 2016; IEEE Transactions on Aut. Control, Vol. 64, 2019, pp.\ 3117-3128.

  • D. P. Bertsekas, "Stable Optimal Control and Semicontractive Dynamic Programming", Lab. for Information and Decision Systems Report LIDS-P-3506, MIT, May 2017; SIAM J. on Control and Optimization, Vol. 56, 2018, pp. 231-252 (Related Lecture Slides).

  • D. P. Bertsekas, "Proper Policies in Infinite-State Stochastic Shortest Path Problems," arXiv preprint arXiv:1711.10129. (Related Lecture Slides).

  • D. P. Bertsekas, "Multiagent Value Iteration Algorithms in Dynamic Programming and Reinforcement Learning," ASU Report, April 2020, arXiv preprint, arXiv:2005.01627; Results in Control and Optimization J., Vol. 1, 2020.

  • D. P. Bertsekas, "Multiagent Reinforcement Learning: Rollout and Policy Iteration," IEEE/CAA Journal of Automatica Sinica, Vol. 8, 2021, pp. 249-271; Video of an overview lecture.

  • D. P. Bertsekas, "Distributed Asynchronous Policy Iteration for Sequential Zero-Sum Games and Minimax Control," arXiv preprint arXiv:2107.10406, July 2021.

  • An updated version of Chapter 4 of the author's Dynamic Programming book, Vol. II, which incorporates recent research on a variety of undiscounted problems and relates to abstract DP topics; (Related Lecture Slides).



    Abstract Dynamic Programming, 2nd Edition, 2018

    by Dimitri P. Bertsekas


    Cover_2nd_Edition_small.jpg

    The print version of the 2nd edition of the book is available from Amazon.com. It is also available as an Ebook from Google Books.

    The book can be downloaded and used freely for noncommercial purposes. Abstract Dynamic Programming, 2ND EDITION.

    The 2nd edition aims primarily to amplify the presentation of the semicontractive models of Chapter 3 and Chapter 4 of the first (2013) edition, and to supplement it with a broad spectrum of research results that I obtained and published in journals and reports since the first edition was written (see below). As a result, the size of this material more than doubled, and the size of the book increased by nearly 40%.


    In addition to the changes in Chapters 3, and 4, I have also eliminated from the second edition the material of the first edition that deals with restricted policies and Borel space models (Chapter 5 and Appendix C). These models are motivated in part by the complex measurability questions that arise in mathematically rigorous theories of stochastic optimal control involving continuous probability spaces. The restricted policies framework aims primarily to extend abstract DP ideas to Borel space models. Since this material is fully covered in Chapter 6 of the 1978 monograph by Bertsekas and Shreve, and followup research on the subject has been limited, I decided to omit Chapter 5 and Appendix C of the first edition from the second edition and just post them below.

    Chapter 5 of 1st Edition

    Appendix C of 1st Edition