Distributed
Reinforcement Learning in a Network
I
will discuss a distributed reinforcement learning protocol for optimizing the
dynamic behavior of a network of simple electronic components, such as a sensor
network, an ad hoc network of mobile devices, or a network of communication
switches. This protocol requires only local communication and simple
computations which are distributed among devices. As a motivating example, I
will discuss a problem involving optimization of power consumption, delay, and
buffer overflow in a simplified model of a sensor network.
This
approach builds on policy gradient methods for reinforcement learning. The
protocol can be viewed as an extension of policy gradient methods to a context
involving a team of agents optimizing aggregate performance through
asynchronous distributed communication and computation. The dynamics of the
protocol approximate the solution to an ordinary differential equation that
follows the gradient of the performance objective.
A
shortcoming of customary policy gradient approaches – for centralized as well
as distributed reinforcement learning – is that signal-to-noise ratios of
gradient estimates diminish as the network grows. I will discuss how the use of
"local value functions" may alleviate this shortcoming.