Consider grid world where goal state (G) yields reward of 100 and other states yield reward of 0.
Let learning rate be 0.8.
+-----+-----+-----+ +-----+-----+-----+ | -->74 -->100 | | | | | | A <--66 | G | a = right | | A | G | | | | | | | | | | +-----+--|--+-----+ ---------> +-----+-----+-----+ | | v | | | | | | | | 82 | | | | | | | | | | | | | | +-----+-----+-----+ +-----+-----+-----+ Initial state s1 State s2
Q(s1, right) = 74 + 0.8(0 + max{(Q(s2,left), Q(s2,right), Q(s2,down))} - 74)
= 74 + 0.8(0 + max{66, 82, 100} - 74) = 74 + 0.8(26) = 94.8