MODELING OF USER BEHAVIOR IN MATCHING TASK BASED ON PREVIOUS REWARD HISTORY AND PERSONAL RISK FACTOR

 

By: Helen Belogolova and Amy Daitch

 

Abstract:

 

In order to optimize the reward received from a given situation, people must consider the outcomes of past decisions.  In this experiment, subjects were given a matching task in which they chose between two buttons and received a certain reward based on predetermined reward functions. These reward functions, calculating a reward after each trial, were based on the ratio of the button presses of the subject within the past 40 trials. We chose to model the human behavior using leaky integration to account for the memory decay of the subject. In addition we defined a constant personal risk factor and a variable risk factor based on the reward history. Our model performed the task best when the risk factors were at either extreme. By varying these parameters, we created data which covered the spectrum of that produced by the actual subjects.

 

Methods:

 

General Model:

 

Part I. Exploratory Phase

 

Since subjects are initially unaware of the reward functions associated with their choices, their first few trials are generated randomly for the most part. To account for this, we assigned the first t trials as exploratory, meaning there was an equal probability of them choosing either button.

P(button A) = 0.5

P(button B) = 0.5

 

For all of our tests, we assigned this period a length of 10 trials.

 

Part II. Choices based on past reward history

 

The reward function used in this experiment took into account the choices from a 40 trial buffer which was updated after each trial.  We created a vector of these rewards, then weighted them based on the leaky integrator model, with decay parameter d.

 

weighted_rewards_vector = [exp(1*d) exp(2*d) … exp(240*d) ]’ * reward_vector

 

This way, the most recent rewards received carried the most influence on the subject when making the subsequent decision.

 

To choose between the two buttons in a given trial, our model sums up the weighted rewards received in previous trials after A button presses (rewA) and B button presses (rewB). Then, the probability of choosing button A is:

p(A) = rewA/(rewA + rewB)

p(B) = 1- p(A)

 

Based on these aggregate rewards, the next choice is generated as followed:

 

if rand(1) < p(A) -->choice A

else --> choice B

 

 

Model accounting for risk:

 

In our model, we included two parameters associated with risk- essentially the subject’s willingness to deviate from the optimal choice based on past trials.  The first, which is kept constant throughout the experiment, is a function of the subject’s personality and willingness to take risks in general (It can range from 0 to 1).  The second risk parameter increases as the subject’s cumulative reward increases.

 

cumulative_risk(trial) = cumulative_reward(trial)/max_cumulative_reward

 

The maximum cumulative reward is based on the reward functions- for the experiment we ran on the subjects, the maximum reward was 6 Hershey’s kisses.

 

The weighting of the personal risk factor and cumulative reward risk factor add up to make the total risk factor - another variable which changes for each subject.

 

total_risk = personal_risk*personal_risk_weight + cumulative_risk*cumulative_risk_weight

 

Given the total risk parameter along with the weighted rewards from past trials, each decision is made as follows:

 

p(A) = rewA/(rewA + rewB)-(rewA/(rewA + rewB)-0.5)*2*total_risk

p(B) = 1-p(A)

 

The choice of button is then generated the same way as in the general model.

 

Results:

 

To see how the different parameters in our model ultimately affected the results, we ran the experiment using our model, varying only one parameter at a time. First, we set both risk parameters to zero. Since our model assumes a stochastic decision making process, there is some randomness in the decisions made. Therefore, theoretically one subject would produce somewhat different results if he or she took the experiment more than once (although that obviously introduces other problems such as previous knowledge of the reward functions). As a result, we ran the experiment on the model many times to get an idea of how a subject with certain characteristics would generally behave. After this base case, we again ran the experiment on the model, but with the varied parameters. After each case is a link of graphs of the fraction of button A presses produced by the model vs. trial.

 

Case I: Personal Risk = 0, Cumulative Risk Weight = 0, Decay rate = 1

http://web.mit.edu/helenb/Public/norisk.jpg

Case II: Personal Risk = 0.25, Cumulative Risk Weight = 0, Decay rate = 1

http://web.mit.edu/helenb/Public/prisk25.jpg

Case III: Personal Risk = 0.5, Cumulative Risk Weight = 0, Decay rate = 1

http://web.mit.edu/helenb/Public/prisk5.jpg

Case IV. Personal Risk  = 0.75, Cumulative Risk Weight = 0, Decay rate = 1

http://web.mit.edu/helenb/Public/prisk75.jpg

Case V:  Personal Risk = 1, Cumulative Risk Weight = 0, Decay rate = 1

http://web.mit.edu/helenb/Public/prisk1.jpg

Case VI: Personal Risk = 0, Cumulative Risk Weight = 0.25, Decay rate = 1

http://web.mit.edu/helenb/Public/cummrisk25.jpg

Case VII: Personal Risk = 0, Cumulative Risk Weight = 0.50, Decay rate = 1

http://web.mit.edu/helenb/Public/cummrisk5.jpg

Case VIII: Personal Risk = 0, Cumulative Risk Weight = 0.75, Decay rate = 1

http://web.mit.edu/helenb/Public/cumrisk75.jpg

Case IX: Personal Risk = 0, Cumulative Risk Weight = 1.0, Decay rate = 1

http://web.mit.edu/helenb/Public/cummrisk.jpg

Case X: Personal Risk = 0, Cumulative Risk Weight = 0, Decay rate = 2

http://web.mit.edu/helenb/Public/decay-2.jpg

Case XI: Personal Risk = 0, Cumulative Risk Weight = 0, Decay rate = 0.5

http://web.mit.edu/helenb/Public/decay-5.jpg

 

From the graphs which plot the ratio of the subject’s button presses within the buffer vs. trial number, we saw general trends in how the parameters affected how well the model performed the task. When varying only the personal risk factor, the model was most successful (converged to either 1 or 0 in fraction of A button presses) when the risk factor was either very low or very high. In this particular experiment, high and low risk factors are essentially the same, though, since one receives the most reward when pressing all of one button or the other.  We found that increasing the cumulative reward risk factor weighting increased the success of the model. We tested the model with decay rates of 0.5, 1.0, and 2.0, all while keeping the risk factors at 0. The model with decay rate set to 2.0 succeeded the most out of these three, and the one with decay rate 0.5  had the fewest successes. This indicates that the most important previous rewards to remember for this experiment are ones in the immediate past, and that trials further in the past are less relevant to future outcomes.

 

After generating these results based on our model, we compared the choices the model produced with those of the tested subjects.  We took the cross correlation of the choice vector of the subject with the generated choice vectors from out model in the various cases indicated above.  Strong correlations were indicated by peaks or inverse peaks in the center of the cross correlation graphs.  We noticed relatively strong correlations between our subjects and the models with very high or very low personal risk factors. As the cumulative reward risk weight of the model increased (and the personal risk factor stayed at zero), the correlation between the subject’s and model’s choices improved from a moderate correlation to a fairly strong one.

 

The following are graphs of the cross correlation between our models choices and those of the actual subjects. Refer to the cases above for the specific conditions for each graph:

http://web.mit.edu/helenb/Public/corrcase1.jpg

http://web.mit.edu/helenb/Public/corrcase2.jpg

http://web.mit.edu/helenb/Public/corrcase3.jpg

http://web.mit.edu/helenb/Public/corrcase4.jpg

http://web.mit.edu/helenb/Public/corrcase5.jpg

http://web.mit.edu/helenb/Public/corrcase6.jpg

http://web.mit.edu/helenb/Public/corrcase7.jpg

http://web.mit.edu/helenb/Public/corrcase8.jpg

http://web.mit.edu/helenb/Public/corrcase9.jpg

http://web.mit.edu/helenb/Public/corrcase10.jpg

 

 

This is our code for our model and data analysis:

http://web.mit.edu/helenb/Public/corr.m

http://web.mit.edu/helenb/Public/realdatamidproj.m