Overcoming the Curse of Dimensionality with Reinforcement Learning

Dr. Richard S. Sutton

AT&T Shannon Labs

 

 

Technological advances in the last few decades have made computation and memory vastly cheaper and thus available in massive quantities. The field of reinforcement learning attempts to take advantage of this trend when solving large-scale stochastic optimal control problems. Dynamic programming can solve small instances of such problems, but suffers from Bellman's "curse of dimensionality," the tendency of the state space and thus computational complexity to scale exponentially with the number of state variables (and thus to quickly exceed even the "massive" computational resources now available). Reinforcement learning brings in two new techniques: 1) parametric approximation of the value function, and 2) sampling of state trajectories (rather than sweeps through the state space). These enable finding approximate solutions, improving in quality with the available computational resources, on problems too large to even be attempted with conventional dynamic programming. However, these techniques also complicate theory, and there remain substantial gaps between the reinforcement learning methods proven effective and those that appear most effective in practice. In this talk we focus on two of the most recent controversies. First, I extend the convergence result of Tsitsiklis and Van Roy for on-policy evaluation with linear function approximation to the off-policy case, reviving the possibility of convergence results for value-based off-policy control methods such as Q-learning. Second, I present formal and empirical results comparing the efficiency of methods which directly approximate a policy rather than a value function. I conclude with a perspective on the relationship of reinforcement learning to animal learning, neuroscience, and artificial intelligence.