The subject matter is sequential decision making under uncertainty (stochastic control). We have a dynamical system whose evolution is influenced (controlled) by our decisions or control actions. The decision made at any given time can, in general, depend on the state of the system. The objective is to select a decision making rule (feedback policy) that optimizes a certain performance criterion. Such problems can be solved, in principle, using the classical methods of dynamic programming. In practice, however, the applicability of dynamic programming to many important problems is limited by the enormous size of the underlying state spaces (Bellman's "curse of dimensionality"). Neuro-dynamic programming (or "Reinforcement Learning", which is the term used in the Artificial Intelligence literature) uses neural network and other approximation architectures to overcome such bottlenecks to the applicability of dynamic programming. The methodology allows systems to learn about their behavior through simulation, and to improve their performance through iterative reinforcement.
In one approach (value function approximation), simulation is used to tune the parameters of a "value function" that quantifies the relative desirability of different states in the state space. In mathematical terms, the objective is to compute an approximate solution to Bellman's equation, which is then used to construct near-optimal policies. This approach is studied in the 1996 book Neuro-Dynamic Programming,, which includes many results that are not available elsewhere. Another approach (optimization in policy space) involves tuning of policy parameters in a direction of improvement.
Some of this research is theoretical in nature, aiming at understanding the convergence and degree of suboptimality of different algorithms, while some involves the application of this methodology to specific problem domains.
Policy space and actor-critic algorithms
Instead of tuning the parameters of a value function, can we instead tune directly the parameters of a policy, assuming a parametrically described class of policies? A class of methods that can be interpreted in terms of estimated Q-factors is studied in:
Average cost temporal difference learning
Temporal difference methods can be applied to average cost problems. The convergence and approximation error guarantees are essentially the same as for discounted problems. Thus, there is no no need to use discounted formulations as a proxy for undiscounted ones:
The properties of average and discounted criterion temporal difference methods are compared in more detail in:
Convergence of methods based on value function learning
The convergence of a method that uses a lookup table representation of the value function, simulation using a greedy policy, and plain "Monte-Carlo'' (averaging) for learning the value function:
Temporal difference methods (for the single policy case, and with linearly parametrized function approximators) are guaranteed to converge. The approximation error obtained in the limit is not too far from the best possible approximation error under the particular approximation architecture:
Convergence results and approximation error bounds for Q-learning type methods for certain special types of function approximation (e.g., state aggregation):
Q-learning and the TD(0) temporal difference methods (with a lookup table representation) are viewed as stochastic approximation methods for solving Bellman's equation. Their convergence is established by first developing a stochastic approximation theory for the case where the iteration mapping is a contraction with respect to a weighted maximum norm.
Starting with a good heuristic and carrying out what is essentially a single policy iteration (in the dynamic programming sense) provides a systematic method for improving the performance of a heuristic, and has great promise in practical settings.
S. Mannor, D. I. Simester, P. Sun, and J. N. Tsitsiklis, "Bias and Variance Approximation in Value Function Estimates," Management Science, Vol. 53, No. 2, February 2007, pp. 308-322; Appendix.
D. I. Simester, P. Sun, and J. N. Tsitsiklis, "Dynamic Catalog Mailing Policies," Management Science, Vol. 52, No. 5, May 2006, pp. 683-696.
Finance: (Pricing of complex American options)
J. N. Tsitsiklis and B. Van Roy, "Regression Methods for Pricing Complex American--Style Options," IEEE Trans. on Neural Networks, Vol. 12, No. 4, July 2001, pp. 694-703.
J. N. Tsitsiklis and B. Van Roy, "Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing Financial Derivatives", IEEE Transactions on Automatic Control; Vol. 44, No. 10, October 1999, pp. 1840-1851.
B. Van Roy, D. P. Bertsekas, Y. Lee, and J. N. Tsitsiklis, "A Neuro-Dynamic Programming Approach to Retailer Inventory Management", November 1996. Short version in Proceedings of the 36th IEEE Conference on Decision and Control, San Diego, California, December 1997, pp. 4052-4057.Communication networks:
P. Marbach, O. Mihatsch, and J. N. Tsitsiklis, "Call Admission Control and Routing in Integrated Service Networks Using Neuro-Dynamic Programming," IEEE Journal on Selected Areas in Communications, Vol. 18, No. 2, February 2000, pp. 197-208.
Preliminary versions of the previous paper:
P. Marbach, and J.N. Tsitsiklis, "A Neuro-Dynamic Programming Approach to Call Admission Control in Integrated Service Networks: The Single Link Case," Technical Report LIDS-P-2402, Laboratory for Information and Decision Systems, M.I.T., November 1997. Short version in Proceedings of the 2003 IEEE Conference on Decision and Control, Maui, Hawaii, December 2003.P. Marbach, O. Mihatsch, M. Schulte, and J. N. Tsitsiklis, "Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks," presented at the Neural Information Processing Systems, Denver, Colorado, November 1997.