Next: Example Up: l9 Previous: Learning an Action-Value Function

Example

Consider grid world where goal state (G) yields reward of 100 and other states yield reward of 0.

Let learning rate be 0.8.

+-----+-----+-----+              +-----+-----+-----+
|    -->74 -->100 |              |     |     |     |
|  A <--66  |  G  |  a = right   |     |  A  |  G  |
|     |  |  |     |              |     |     |     |
+-----+--|--+-----+  --------->  +-----+-----+-----+
|     |  v  |     |              |     |     |     |
|     | 82  |     |              |     |     |     |
|     |     |     |              |     |     |     |
+-----+-----+-----+              +-----+-----+-----+

Initial state s1                   State s2

Q(s1, right) = 74 + 0.8(0 + max{(Q(s2,left), Q(s2,right), Q(s2,down))} - 74)
= 74 + 0.8(0 + max{66, 82, 100} - 74) = 74 + 0.8(26) = 94.8