Next: Temporal Difference Learning
Up: l9
Previous: Passive Learning Agent
- Naive
- Least mean squares
- U[i] = RunningAverage(U[i], CumReward, N[i])
- N[i] is number of times visit state i
- Does not use transition probabilities constraints
- Converges slowly
- Adaptive dynamic programming (ADP)
-
= reward for being in state
is probability of transition from state
to state
- Solve
equations in
unknowns,
- Is this practical?