Overcoming the Curse
of Dimensionality with Reinforcement Learning
Dr. Richard S. Sutton
AT&T
Shannon Labs
Technological
advances in the last few decades have made computation and memory vastly cheaper
and thus available in massive quantities. The field of reinforcement learning
attempts to take advantage of this trend when solving large-scale stochastic
optimal control problems. Dynamic programming can solve small instances of
such problems, but suffers from Bellman's "curse of dimensionality," the tendency
of the state space and thus computational complexity to scale exponentially
with the number of state variables (and thus to quickly exceed even the "massive"
computational resources now available). Reinforcement learning brings in two
new techniques: 1) parametric approximation of the value function, and 2)
sampling of state trajectories (rather than sweeps through the state space).
These enable finding approximate solutions, improving in quality with the
available computational resources, on problems too large to even be attempted
with conventional dynamic programming. However, these techniques also complicate
theory, and there remain substantial gaps between the reinforcement learning
methods proven effective and those that appear most effective in practice.
In this talk we focus on two of the most recent controversies. First, I extend
the convergence result of Tsitsiklis and Van Roy for on-policy evaluation
with linear function approximation to the off-policy case, reviving the possibility
of convergence results for value-based off-policy control methods such as
Q-learning. Second, I present formal and empirical results comparing the efficiency
of methods which directly approximate a policy rather than a value function.
I conclude with a perspective on the relationship of reinforcement learning
to animal learning, neuroscience, and artificial intelligence.