An Analysis of Temporal-Difference Learning with Function Approximation

Technical report LIDS-P-2322, March 1996.

John N. Tsitsiklis and Benjamin Van Roy

Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139; e-mail: jnt@mit.edu, bvr@mit.edu

Abstract

We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain, using a function approximator involving linear combinations of fixed basis functions. The algorithm we analyze performs on-line updating of a parameter vector during a single endless trajectory of an ergodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. In addition to proving new and stronger results than those previously available, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. Finally, we prove that on-line updates, based on entire trajectories of the Markov chain, are in a certain sense necessary for convergence. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning.