Solving Large MDPs Quickly with Partitioned Value Iteration


MIT | BCS | CoCoSci | David Wingate

 

This on-line appendix supports my thesis "Solving Large MDPs Quickly with Partitioned Value Iteration". Below are links to movies which graphically demonstrate the effect different prioritization metrics have, as well as the source code for the GPS algorithm and the double-arm pendulum simulator.

All videos run at the same "timescale," with a notable caveat. For each video, a new frame is generated every time a new partition is selected. Recall that the GPS algorithm must value iterate over a partition, which is not shown. Thus, what appears to be one frame in any of the GPS videos may actually involve multiple iterations over a given partition. Although sometimes each partition is iterated over only once, sometimes they are iterated over multiple times; this is somewhat unfair to value iteration, which always processes each partition once. However, this does not affect the intuition that value iteration will take far longer to solve a problem than any of the other algorithms.

Movies

Mountain Car

To generate these movies, a slightly modified version of the traditional mountain car function was employed. Normally, the agent receives a reward upon exiting the space either to the left or to the right; the reward on the left is always -1, and the reward on the right is linear, peaking at a velocity of zero. Instead of this continuous reward, we opted for a point reward: the agent must exit the space with a velocity within epsilon of zero. All other rewards are zero. This does not substantially change the optimal control policy, but clarifies the way the algorithm functions.

The X axis represents position, the Z axis represents velocity, and the Y axis represents the value of the states. Red and green colors show different control policies. A blue color indicates that the state has never been backed up. A 400x400 state discretization was used. Partitions were generated by overlaying another 10x10 grid. Epsilon was set at 0.0001.


(2.4 MB)
The value iteration algorithm running on the Mountain Car problem. Note how slowly correct information (both value and policy) backpropagates through the state space, and how much effort is wasted as value iteration performs useless backups. Variable reordering was not enabled. Only 752 frames are shown.

(2.2 MB)
The PVI-H1-VRE algorithm running on the Mountain Car problem. Although value function mass is propagated quickly, it is not propagated cleanly. Several "steps" are visible, as well as policy and value imperfections. As a result, the algorithm must "clean up" the value function by working again and again on some regions. This run took 406 frames.

(1.3 MB)
The PVI-H2-VRE algorithm running on the Mountain Car problem. This algorithm tends to ensure that regions are fully converged before moving on. The policy is always very clean as it propagates backwards. This run took 234 frames.

Single-Arm Pendulum

The state space for the single-arm pendulum is described by the angle and the velocity of the link. The agent must learn to balance the pendulum vertically at a zero degree angle, with zero velocity. There is a single reward in the system, at the point (0,0). Recall that the angle is a circular dimension, so the right-hand side and the left-hand side of these movies are actually connected.

The X axis represents the angle (spanning -pi on the left to +pi on the right), and the Z axis represents the velocity. The goal state (0,0) is directly in the middle of the picture. The Y axis represents the value of the states. Red and green are different control policies. A blue color indicates that the state has never been backed up. A 400x400 state discretization was used. Partitions were generated with another 20x20 grid. Epsilon was set at 0.0001.


(13.7 MB)
The PVI-H1-VRE algorithm running on the Single Arm Pendulum problem. The "stepping" behavior of the H1 metric is very clear in this video. Note especially how the "arms" on either side of the peak must be processed multiple times. This value function is characterized by many imperfections. The video was terminated after 2230 frames, at which point epsilon was about 0.0009.

(5.3 MB)
The PVI-H2-VRE algorithm running on the Single Arm Pendulum problem. Again, note how clean the policy and value function are as they propagate out from the reward. Note also that once the center of the problem has converged, it is never processed again. This allows PVI-H2-VRE to converge in less than half the time of PVI-H1-VRE: this run is 1220 frames long.

Double-Arm Pendulum

The state space for the double-arm pendulum is described by four variables: two angles and two angular velocities. The agent must learn to balance the pendulum vertically at a zero degree angle (for both joints), with zero velocity (for both links). There is a single reward in the system, at the point (0,0,0,0).


(10.7 MB)
Instead of demonstrating the construction of the value function, this video demonstrates the double-arm pendulum simulator in action. At first, the pendulum is allowed is drop straight down. The agent's control policy is then engaged, and the agent balances the pendulum. The pendulum is shown being balanced from several random configurations. This control policy was generated from an MDP with 75,000,000 states.

 

Code

Download GPS 1.1

The GPS source code is designed to operate on general, discrete MDPs. Included in the GPS distribution are several companion discretizers which generate discrete MDPs from continuous state/continuous time control problems, as well as supporting scripts designed to aid with partitioning.

Contained in the source code are the models for the Mountain Car, the Single-Arm Pendulum, and the Double-Arm Pendulum. Some limited functionality for dumping Geomview meshes is provided.

Download dap-sim 0.1

This is a double-arm pendulum simulator based on the wxWindows cross-platform GUI library. It allows execution of control policies, or will allow a user to try and manually balance the pendulum.

The distribution comes with a control policy that successfully balances the pendulum from any position.