Neuro-Dynamic Programming

The subject matter is sequential decision making under uncertainty (stochastic control). We have a dynamical system whose evolution is influenced (controlled) by our decisions or control actions. The decision made at any given time can, in general, depend on the state of the system. The objective is to select a decision making rule (feedback policy) that optimizes a certain performance criterion. Such problems can be solved, in principle, using the classical methods of dynamic programming. In practice, however, the applicability of dynamic programming to many important problems is limited by the enormous size of the underlying state spaces (Bellman's "curse of dimensionality"). Neuro-dynamic programming (or "Reinforcement Learning", which is the term used in the Artificial Intelligence literature) uses neural network and other approximation architectures to overcome such bottlenecks to the applicability of dynamic programming. The methodology allows systems to learn about their behavior through simulation, and to improve their performance through iterative reinforcement.

In one approach (value function approximation), simulation is used to tune the parameters of a "value function" that quantifies the relative desirability of different states in the state space. In mathematical terms, the objective is to compute an approximate solution to Bellman's equation, which is then used to construct near-optimal policies. This approach is studied in the 1996 book Neuro-Dynamic Programming,, which includes many results that are not available elsewhere. Another approach (optimization in policy space) involves tuning of policy parameters in a direction of improvement.

Some of this research is theoretical in nature, aiming at understanding the convergence and degree of suboptimality of different algorithms, while some involves the application of this methodology to specific problem domains.

Methodology

Policy space and actor-critic algorithms

Instead of tuning the parameters of a value function, can we instead tune directly the parameters of a policy, assuming a parametrically described class of policies? A class of methods that can be interpreted in terms of estimated Q-factors is studied in:

"Simulation-Based Optimization of Markov Reward Processes,"

IEEE Transactions on Automatic Control

Such methods may suffer from large variance and slow convergence. This can be partially mitigated by certain variants, e.g., by introducing a discount factor:

"Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes"

Journal of Discrete Event Dynamical Systems,

Proceedings of the 38th IEEE Conference on Decision and Control

Even better, can we combine learning in policy space with value function approximation? This is what actor-critic methods do. It turns out that once a policy parametrization is fixed, it prescribes a natural set of "features" to be used in value function approximation, and one obtains algorithms with provable convergence properties:

"Actor-Critic Algorithms"

SIAM Journal on Control and Optimization

Appendix

A preliminary version of the previous paper:

"Actor-Critic Algorithms"

Advances in Neural Information Processing Systems 12

Policy learning in actor-critic algorithms takes place at a slower rate than value function approximation. Thus, the convergence analysis of actor-critic algorithms relies on the convergence of certain two-time scale stochastic approximation algorithms:

"Linear Stochastic Approximation Driven by Slowly Varying Markov Chains",

Systems and Control Letters

Average cost temporal difference learning

Temporal difference methods can be applied to average cost problems. The convergence and approximation error guarantees are essentially the same as for discounted problems. Thus, there is no no need to use discounted formulations as a proxy for undiscounted ones:

"Average Cost Temporal-Difference Learning"

Automatica

The properties of average and discounted criterion temporal difference methods are compared in more detail in:

"On Average Versus Discounted Reward Temporal-Difference Learning"

Machine Learning

Convergence of methods based on value function learning

The convergence of a method that uses a lookup table representation of the value function, simulation using a greedy policy, and plain "Monte-Carlo'' (averaging) for learning the value function:

"On the Convergence of Optimistic Policy Iteration"

Journal of Machine Learning Research,

Optimal stopping problems is the only known class of problems for which convergence is guaranteed for methods like Q-learning, with arbitrary linearly parametrized value function approximators, and without a restriction to a fixed policy:

"Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing Financial Derivatives"

IEEE Transactions on Automatic Control

Temporal difference methods (for the single policy case, and with linearly parametrized function approximators) are guaranteed to converge. The approximation error obtained in the limit is not too far from the best possible approximation error under the particular approximation architecture:

"An Analysis of Temporal-Difference Learning with Function Approximation

IEEE Transactions on Automatic Control,

Convergence results and approximation error bounds for Q-learning type methods for certain special types of function approximation (e.g., state aggregation):

Feature-Based Methods for Large Scale Dynamic Programming"

Machine Learning

Q-learning and the TD(0) temporal difference methods (with a lookup table representation) are viewed as stochastic approximation methods for solving Bellman's equation. Their convergence is established by first developing a stochastic approximation theory for the case where the iteration mapping is a contraction with respect to a weighted maximum norm.

"Asynchronous Stochastic Approximation and Q-learning",

Machine Learning

Correction.

Rollout algorithms

Starting with a good heuristic and carrying out what is essentially a single policy iteration (in the dynamic programming sense) provides a systematic method for improving the performance of a heuristic, and has great promise in practical settings.

"Rollout Algorithms for Combinatorial Optimization",

Journal of Heuristics

Applications and case studies

Retailing:

S. Mannor, D. I. Simester, P. Sun, and J. N. Tsitsiklis, "Bias and Variance Approximation in Value Function Estimates," Management Science, Vol. 53, No. 2, February 2007, pp. 308-322; Appendix.

D. I. Simester, P. Sun, and J. N. Tsitsiklis, "Dynamic Catalog Mailing Policies," Management Science, Vol. 52, No. 5, May 2006, pp. 683-696.

Finance: (Pricing of complex American options)

J. N. Tsitsiklis and B. Van Roy, "Regression Methods for Pricing Complex American--Style Options," IEEE Trans. on Neural Networks, Vol. 12, No. 4, July 2001, pp. 694-703.

J. N. Tsitsiklis and B. Van Roy, "Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing Financial Derivatives", IEEE Transactions on Automatic Control; Vol. 44, No. 10, October 1999, pp. 1840-1851.

Inventory management:

B. Van Roy, D. P. Bertsekas, Y. Lee, and J. N. Tsitsiklis, "A Neuro-Dynamic Programming Approach to Retailer Inventory Management", November 1996. Short version in Proceedings of the 36th IEEE Conference on Decision and Control, San Diego, California, December 1997, pp. 4052-4057.

Communication networks:

P. Marbach, O. Mihatsch, and J. N. Tsitsiklis, "Call Admission Control and Routing in Integrated Service Networks Using Neuro-Dynamic Programming," IEEE Journal on Selected Areas in Communications, Vol. 18, No. 2, February 2000, pp. 197-208.

Preliminary versions of the previous paper:

"Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning,"

Proceedings of the 1998 IEEE CDC

P. Marbach, and J.N. Tsitsiklis, "A Neuro-Dynamic Programming Approach to Call Admission Control in Integrated Service Networks: The Single Link Case," Technical Report LIDS-P-2402, Laboratory for Information and Decision Systems, M.I.T., November 1997. Short version in Proceedings of the 2003 IEEE Conference on Decision and Control, Maui, Hawaii, December 2003.

Neural Information Processing Systems