An Analysis of Temporal-Difference Learning with Function Approximation
Technical report LIDS-P-2322, March 1996.
John N. Tsitsiklis and Benjamin Van Roy
Laboratory for Information and Decision Systems,
Massachusetts Institute of Technology,
Cambridge, MA 02139;
e-mail: jnt@mit.edu, bvr@mit.edu
Abstract
We discuss the temporal-difference learning algorithm, as applied to
approximating the cost-to-go function of an infinite-horizon discounted
Markov chain, using a function approximator involving linear combinations
of fixed basis functions. The algorithm we analyze performs on-line updating
of a parameter vector during a single endless trajectory of an ergodic
Markov chain with a finite or infinite state space. We present a proof of
convergence (with probability 1), a characterization of the limit of
convergence, and a bound on the resulting approximation error. In addition
to proving new and stronger results than those previously available, our
analysis is based on a new line of reasoning that provides new intuition
about the dynamics of temporal-difference learning. Finally, we prove that
on-line updates, based on entire trajectories of the Markov chain, are in a
certain sense necessary for convergence. This fact reconciles positive and
negative results that have been discussed in the literature, regarding the
soundness of temporal-difference learning.