MODELING OF USER BEHAVIOR IN MATCHING TASK BASED
ON PREVIOUS REWARD HISTORY AND PERSONAL RISK FACTOR
By: Helen Belogolova and Amy Daitch
Abstract:
In order to optimize
the reward received from a given situation, people must consider the outcomes
of past decisions. In this experiment,
subjects were given a matching task in which they chose between two buttons and
received a certain reward based on predetermined reward functions. These reward
functions, calculating a reward after each trial, were based on the ratio of
the button presses of the subject within the past 40
trials. We chose to model the human behavior using leaky integration to account
for the memory decay of the subject. In addition we
defined a constant personal risk factor and a variable risk factor based on the
reward history. Our model performed the task best when the risk factors were at
either extreme. By varying these parameters, we created data which covered the
spectrum of that produced by the actual subjects.
Methods:
General Model:
Part I. Exploratory Phase
Since
subjects are initially unaware of the reward functions associated with their
choices, their first few trials are generated randomly for the most part. To
account for this, we assigned the first t trials as exploratory, meaning there
was an equal probability of them choosing either button.
P(button
A) = 0.5
P(button
B) = 0.5
For
all of our tests, we assigned this period a length of 10 trials.
Part II. Choices based on past reward history
The
reward function used in this experiment took into account the choices from a 40
trial buffer which was updated after each trial. We created a vector of these rewards, then weighted them based on
the leaky integrator model, with decay parameter d.
weighted_rewards_vector
= [exp(1*d) exp(2*d) … exp(240*d) ]’ * reward_vector
This
way, the most recent rewards received carried the most influence on the subject
when making the subsequent decision.
To
choose between the two buttons in a given trial, our model sums up the weighted
rewards received in previous trials after A button presses (rewA) and B button
presses (rewB). Then, the probability of choosing button A is:
p(A)
= rewA/(rewA + rewB)
p(B)
= 1- p(A)
Based
on these aggregate rewards, the next choice is generated as followed:
if
rand(1) < p(A) -->choice A
else
--> choice B
Model accounting for risk:
In
our model, we included two parameters associated with risk- essentially the
subject’s willingness to deviate from the optimal choice based on past
trials. The first, which is kept
constant throughout the experiment, is a function of the subject’s personality
and willingness to take risks in general (It can range from 0 to 1). The second risk parameter increases as the
subject’s cumulative reward increases.
cumulative_risk(trial)
= cumulative_reward(trial)/max_cumulative_reward
The
maximum cumulative reward is based on the reward functions- for the experiment
we ran on the subjects, the maximum reward was 6 Hershey’s kisses.
The
weighting of the personal risk factor and cumulative reward risk factor add up
to make the total risk factor - another variable which changes for each
subject.
total_risk
= personal_risk*personal_risk_weight + cumulative_risk*cumulative_risk_weight
Given
the total risk parameter along with the weighted rewards from past trials, each
decision is made as follows:
p(A)
= rewA/(rewA + rewB)-(rewA/(rewA + rewB)-0.5)*2*total_risk
p(B)
= 1-p(A)
The
choice of button is then generated the same way as in the general model.
Results:
To
see how the different parameters in our model ultimately affected the results,
we ran the experiment using our model, varying only one parameter at a time. First,
we set both risk parameters to zero. Since our model assumes a stochastic
decision making process, there is some randomness in the decisions made. Therefore,
theoretically one subject would produce somewhat different results if he or she
took the experiment more than once (although that obviously introduces other
problems such as previous knowledge of the reward functions). As a result, we
ran the experiment on the model many times to get an idea of how a subject with
certain characteristics would generally behave. After this base case, we again
ran the experiment on the model, but with the varied parameters. After each
case is a link of graphs of the fraction of button A presses produced by the
model vs. trial.
Case
I: Personal Risk = 0, Cumulative Risk Weight = 0, Decay rate = 1
http://web.mit.edu/helenb/Public/norisk.jpg
Case
II: Personal Risk = 0.25, Cumulative Risk Weight = 0, Decay rate = 1
http://web.mit.edu/helenb/Public/prisk25.jpg
Case
III: Personal Risk = 0.5, Cumulative Risk Weight = 0, Decay rate = 1
http://web.mit.edu/helenb/Public/prisk5.jpg
Case
IV. Personal Risk = 0.75, Cumulative
Risk Weight = 0, Decay rate = 1
http://web.mit.edu/helenb/Public/prisk75.jpg
Case
V: Personal Risk = 1, Cumulative Risk
Weight = 0, Decay rate = 1
http://web.mit.edu/helenb/Public/prisk1.jpg
Case
VI: Personal Risk = 0, Cumulative Risk Weight = 0.25, Decay rate = 1
http://web.mit.edu/helenb/Public/cummrisk25.jpg
Case
VII: Personal Risk = 0, Cumulative Risk Weight = 0.50, Decay rate = 1
http://web.mit.edu/helenb/Public/cummrisk5.jpg
Case
VIII: Personal Risk = 0, Cumulative Risk Weight = 0.75, Decay rate = 1
http://web.mit.edu/helenb/Public/cumrisk75.jpg
Case
IX: Personal Risk = 0, Cumulative Risk Weight = 1.0, Decay rate = 1
http://web.mit.edu/helenb/Public/cummrisk.jpg
Case
X: Personal Risk = 0, Cumulative Risk Weight = 0, Decay rate = 2
http://web.mit.edu/helenb/Public/decay-2.jpg
Case
XI: Personal Risk = 0, Cumulative Risk Weight = 0, Decay rate = 0.5
http://web.mit.edu/helenb/Public/decay-5.jpg
From the graphs which plot
the ratio of the subject’s button presses within the buffer vs. trial number,
we saw general trends in how the parameters affected how well the model
performed the task. When varying only the personal risk factor, the model was
most successful (converged to either 1 or 0 in fraction of A button presses) when
the risk factor was either very low or very high. In this particular
experiment, high and low risk factors are essentially the same, though, since
one receives the most reward when pressing all of one button or the other. We found that increasing the cumulative
reward risk factor weighting increased the success of the model. We tested the
model with decay rates of 0.5, 1.0, and 2.0, all while keeping the risk factors
at 0. The model with decay rate set to 2.0 succeeded the most out of these
three, and the one with decay rate 0.5
had the fewest successes. This indicates that the most important
previous rewards to remember for this experiment are ones in the immediate
past, and that trials further in the past are less relevant to future outcomes.
After generating these
results based on our model, we compared the choices the model produced with
those of the tested subjects. We took
the cross correlation of the choice vector of the subject with the generated
choice vectors from out model in the various cases indicated above. Strong correlations were indicated by peaks
or inverse peaks in the center of the cross correlation graphs. We noticed relatively strong correlations
between our subjects and the models with very high or very low personal risk
factors. As the cumulative reward risk weight of the model increased (and the
personal risk factor stayed at zero), the correlation between the subject’s and
model’s choices improved from a moderate correlation to a fairly strong one.
The following are graphs
of the cross correlation between our models choices and those of the actual
subjects. Refer to the cases above for the specific conditions for each graph:
http://web.mit.edu/helenb/Public/corrcase1.jpg
http://web.mit.edu/helenb/Public/corrcase2.jpg
http://web.mit.edu/helenb/Public/corrcase3.jpg
http://web.mit.edu/helenb/Public/corrcase4.jpg
http://web.mit.edu/helenb/Public/corrcase5.jpg
http://web.mit.edu/helenb/Public/corrcase6.jpg
http://web.mit.edu/helenb/Public/corrcase7.jpg
http://web.mit.edu/helenb/Public/corrcase8.jpg
http://web.mit.edu/helenb/Public/corrcase9.jpg
http://web.mit.edu/helenb/Public/corrcase10.jpg
This is our code for our
model and data analysis: