Simulation-Based Optimization of Markov Reward Processes
(Joint work with Peter Marbach)

John Tsitsiklis
Professor of Electrical Engineering and Computer Science
MIT

We discuss simulation-based methods for optimizing the average reward in large scale Markov Reward Process. We compare two main approaches, namely value function approximation, and optimization in policy space. We focus on the latter and introduce a stochastic gradient-like method for optimizing the parameters of a parametrically described policy. The algorithm involves the simulation of a single sample path, and can be implemented on-line. A convergence result (with probability 1) is provided.