**"Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control,"** draft of the forthcoming (2022) research monograph

This draft will be periodically updated until shortly before its formal publication. Check this site for the latest version.

A short early version (August 2021) went on line as arXiv preprint arXiv:2108.10315.

A journal paper will be publshed shortly: D. P. Bertsekas, "Newton's Method for Reinforcement Learning and Model Predictive Control," ASU Report, January 2022, to appear in Results in Control and Optimization J.

** Lessons from AlphaZero Videolecture**: A fairly detailed presentation at KTH, Nov. 2021

**A brief description:** Some of the most exciting success stories in artificial intelligence and reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon).
These programs were trained off-line extensively using sophisticated self-play/approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained off-line. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it has greatly improved performance.

Similarly, TD-Gammon performs on-line a policy improvement step using lookahead minimization that is not degraded by neural network approximations. To this end it uses an off-line neural-network trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the position evaluator).

An important lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line play, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is the central point of our work: on-line play amounts to a step of Newton's method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training and may be enhanced by longer lookahead and on-line rollout. This process can be understood in terms of abstract models of dynamic programming and simple geometrical constructions. It manifests itself to some extent in model predictive control, but it seems that it has yet to be fully appreciated within the decision and control community.

In this work we aim to provide insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. While we will deemphasize mathematical proofs, there is considerable related analysis, which supports our conclusions and can be found in the author's recent RL books [Ber19a], [Ber20a]. One of our principal aims is to show through the unifying principles of abstract DP that the AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.

The 3rd edition of the book is available as an Ebook from Google Books. The print version of the 3rd edition of the book will soon be available from the publishing company, Athena Scientific, and Amazon.com.

This research monograph provides a synthesis of old research on the foundations of dynamic programming (DP), with the modern theory of approximate DP and new research on semicontractive models. It aims at a unified and economical development of the core theory and algorithms of total cost sequential decision problems, based on the strong connections of the subject with fixed point theory. The analysis focuses on the abstract mapping that underlies DP and defines the mathematical character of the associated problem. The discussion centers on two fundamental properties that this mapping may have: monotonicity and (weighted sup-norm) contraction. It turns out that the nature of the analytical and algorithmic DP theory is determined primarily by the presence or absence of these two properties, and the rest of the problem's structure is largely inconsequential. New research is focused on two areas: 1) The ramifications of these properties in the context of algorithms for approximate DP, and 2) The new class of semicontractive models, exemplified by stochastic shortest path problems, where some but not all policies are contractive.

The 3rd edition is very similar to the 2nd edition, except for the addition of a new 40-page Chapter 5, which introduces a contractive abstract DP framework and related policy iteration algorithms, specifically designed for sequential zero-sum games and minimax problems with a general structure, and based on a recent paper by the author. Aside from greater generality, the advantage of our algorithms over alternatives is that they resolve some long-standing convergence difficulties of the natural PI algorithm, which have been known since the Pollatschek and Avi-Itzhak method for finite-state Markov games. Mathematically, this natural algorithm is a form of Newton's method for solving Bellman's equation, but Newton's method, contrary to the case of single-player DP problems, is not globally convergent in the case of a minimax problem, because of an additional difficulty: the Bellman operator may have components that are neither convex nor concave. The algorithms address this difficulty by introducing alternating player choices, and by using a policy-dependent mapping with a uniform sup-norm contraction property, similar to earlier works by Bertsekas and Yu, which is described in part in Chapter 2. Moreover, our algorithms allow a convergent and highly parallelizable implementation, which is based on state space partitioning, and distributed asynchronous policy evaluation and policy improvement operations within each set of the partition. They are also suitable for approximations based on an aggregation approach.

The book can be downloaded and used freely for noncommercial purposes: Abstract Dynamic Programming, 3RD EDITION, Complete

The print version of the 2nd edition of the book is available from the publishing company Athena Scientific, or from Amazon.com. It is also available as an Ebook from Google Books.

The 2nd edition aims primarily to amplify the presentation of the semicontractive models of Chapter 3 and Chapter 4 of the first (2013) edition, and to supplement it with a broad spectrum of research results that I obtained and published in journals and reports since the first edition was written (see below). As a result, the size of this material more than doubled, and the size of the book increased by nearly 40%.

The book can be downloaded and used freely for noncommercial purposes. The version below corrects a few errata from the book's first printing, and is identical to the book's second printing (to appear in 2021).

Abstract Dynamic Programming, 2ND EDITION, Complete

In addition to the changes in Chapters 3, and 4, I have also eliminated from the second edition the material of the first edition that deals with restricted policies and Borel space models (Chapter 5 and Appendix C). These models are motivated in part by the complex measurability questions that arise in mathematically rigorous theories of stochastic optimal control involving continuous probability spaces. The restricted policies framework aims primarily to extend abstract DP ideas to Borel space models. Since this material is fully covered in Chapter 6 of the 1978 monograph by Bertsekas and Shreve, and followup research on the subject has been limited, I decided to omit Chapter 5 and Appendix C of the first edition from the second edition and just post them below.

**Supporting series of five videolectures on Semicontractive Dynamic Programming**

**The following papers and reports have a strong connection to the book, and among others amplify on the analysis and the range of applications of the semicontractive models of Chapters 3 and 4:**