Next: Passive Learning Agent Up: l9 Previous: Passive Learning in a

Markov Decision Processes

Assume

finite set of states
set of actions
at each discrete time agent observes state $i \in S$ and chooses action $a \in A$
then receives immediate reward
and state changes to
Markov assumption: Resulting state depending on and
- Reward and next state depend only on current state and action
- $M_{ij}^a$ is probability of reaching state after executing action in state
- $M_{ij}^a$ can be estimated from observed state transition frequencies