Next: Learning an Action-Value Function Up: l9 Previous: Temporal Difference Learning

Active Learning in Unknown Environment

Consider actions, their outcomes, and possible reward
Select action that maximizes expected reward
From utility theory, the expected utility of an action given evidence can be calculated as $EU(A\vert E) = \sum_i P(Result_i(A)\vert E,Do(A)) U(Result_i(A))$
$U(i) = R(i) + \max_a \sum_j M_{ij}^a * U(j)$