"Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control," draft of a forthcoming (2022) research monograph; an August 2021 version went on line as arXiv preprint arXiv:2108.10315.
This draft will be periodically updated until shortly before its formal publication. Check this site for the latest version.
A brief description: Some of the most exciting success stories in artificial intelligence and reinforcement learning have been in the area of games. Primary examples are the recent AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon). These programs were trained off-line extensively using sophisticated self-play/approximate policy iteration algorithms and neural networks. Yet the AlphaZero player that has been obtained off-line is not used directly during on-line play (it is too inaccurate due to approximation errors that are inherent in off-line neural network training). Instead a separate on-line player is used, which is based on multistep lookahead and a terminal position evaluator that was trained off-line. The on-line player performs a form of policy improvement, which unlike the off-line player, is not degraded by neural network approximations. As a result, it has greatly improved performance.
Similarly, TD-Gammon performs on-line a policy improvement step using lookahead minimization that is not degraded by neural network approximations. To this end it uses an off-line neural-network trained terminal position evaluator, and importantly it also extends its on-line lookahead by rollout (simulation with the one-step lookahead player that is based on the position evaluator).
An important lesson from AlphaZero and TD-Gammon is that performance of an off-line trained controller can be greatly improved by on-line play, with long lookahead (involving minimization or rollout with an off-line obtained policy, or both), and terminal cost approximation that is obtained off-line. This performance enhancement is often dramatic and is due to a simple fact, which is the central point of our work: on-line play amounts to a step of Newton's method for solving Bellman's equation, while the starting point for the Newton step is based on the results of off-line training and may be enhanced by longer lookahead and on-line rollout. This process can be understood in terms of abstract models of dynamic programming and simple geometrical constructions. It manifests itself to some extent in model predictive control, but it seems that it has yet to be fully appreciated within the decision and control community.
In this work we aim to provide analysis and insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. In particular, through a unifying abstract dynamic programming framework, we show that the principal AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.
The print version of the 2nd edition of the book is available from the publishing company Athena Scientific, or from Amazon.com. It is also available as an Ebook from Google Books.
The 2nd edition aims primarily to amplify the presentation of the semicontractive models of Chapter 3 and Chapter 4 of the first (2013) edition, and to supplement it with a broad spectrum of research results that I obtained and published in journals and reports since the first edition was written (see below). As a result, the size of this material more than doubled, and the size of the book increased by nearly 40%.
The book can be downloaded and used freely for noncommercial purposes. The version below corrects a few errata from the book's first printing, and is identical to the book's second printing (to appear in 2021).
Abstract Dynamic Programming, 2ND EDITION, Complete
In addition to the changes in Chapters 3, and 4, I have also eliminated from the second edition the material of the first edition that deals with restricted policies and Borel space models (Chapter 5 and Appendix C). These models are motivated in part by the complex measurability questions that arise in mathematically rigorous theories of stochastic optimal control involving continuous probability spaces. The restricted policies framework aims primarily to extend abstract DP ideas to Borel space models. Since this material is fully covered in Chapter 6 of the 1978 monograph by Bertsekas and Shreve, and followup research on the subject has been limited, I decided to omit Chapter 5 and Appendix C of the first edition from the second edition and just post them below.
Chapter 5 of 1st Edition
Appendix C of 1st Edition
Supporting series of five videolectures on Semicontractive Dynamic Programming
The following papers and reports have a strong connection to the book, and among others amplify on the analysis and the range of applications of the semicontractive models of Chapters 3 and 4: