Landscape Exploration and the Influence of Beliefs

Jason Carver and Kevin Matulef

9.29 Midterm Project

Spring 2004, MIT

Presentation (ppt, pdf)

Abstract

The question of quantifying an individual's willingness to explore the unknown, and to take risks in the hopes of achieving better rewards, is a basic and important one. To address this question, we propose a new class of experiments, based on the notion of a "payoff landscape." We present the results for two of these experiments, one in which subjects are told nothing a priori about the possible payoffs they can achieve, and one in which they are given some extra knowledge about the potential payoffs. We conclude with the suggestion that in such situations, "reinforcement" learning may be too limited to explain human behavior.

Introduction

Recent research in behavioral economics has attempted to explain how people's preferences for different actions might arise from a series of choices, distributed over time (Herrnstein, Prelec, 1991). This includes the model of addiction put forward by Hernstein, Prelec and Vaughan (1986). In their experiment, subjects were repeatedly given the option of pressing two buttons, button A and button B. During a short time period, pressing button A yielded a better reward than pressing button B. However, over a long time period, pressing button A caused the reward for both buttons to go down. Thus, the best strategy was to always press button B (until near the end of the experiment).

In fact, in laboratory settings Hernstein et. al. showed that most of the time, subjects adopted the worst strategy, always pressing button A. We hypothesize that this could be the result of two factors: 1) subjects are averse to adopting long-term strategies that have poor short-term rewards, and 2) subjects have certain expectations, or beliefs, about the way the game works and the effects of button pressing.

In order to explore this hypothesis, we sought to design and implement a simple experiment which would shed light on the following questions:

When faced with a landscape of hidden payoffs, how do people explore unknown terrain?
To what extent do people get stuck in "local maxima," choosing to receive known payoffs, rather than taking a risking on receiving unknown ones.
How do the answers to these questions change when people have different beliefs about the magnitude and structure of the payoffs?

Discrete Payoff Landscapes

In order to formally study the answers to these questions, we restrict the context in which we study them to one-dimensional "payoff landscapes." A one-dimensional payoff landscape is simply a function f from the real numbers to the real numbers. The function may have peaks, valleys, etc (in general, it need not be differentiable or even continuous, though many interesting landscapes are).

We consider a discrete version of the landscape in which the payoff function f is from the integers to the integers. Subjects are given a starting point on the landscape, in other words a value x, and the value of the function f(x). Subjects are then allowed to "explore," by querying for the value of the function on points immediately to the left and right of the previous point. Whenever they query for the value of the function, they accumulate that associated value as a reward.

To illustrate the situation more concretely, consider a single row of boxes, where each box is covering some amount of payment. The subject initially starts out on one of the middle boxes. This box is uncovered, and the payoff underneath is the subject's initial reward. The subject then is allowed to make a limited number of moves. Each move consists of either a step to the left or a step to the right. Upon each move, the box that the subject moves to is uncovered (if it's not uncovered already), and the subject receives the payment underneath. The box remains uncovered throughout the course of subsequent moves. For example, the following represents a possible sequence of movements on a landscape:

Initial Payoff = 5

x x x x x x 5 x x x x x x x

Player moves: right

Total Payoff = 9

x x x x x x 5 4 x x x x x x

Player moves: left

Total Payoff = 14

x x x x x x 5 4 x x x x x x

Player moves: left

Total Payoff = 20

x x x x x 6 5 4 x x x x x x

Player moves: left

Total Payoff = 27

x x x x 7 6 5 4 x x x x x x

Player moves: left

Total Payoff = 33

x x x 6 7 6 5 4 x x x x x x

Player moves: left

Total Payoff = 38

x x 5 6 7 6 5 4 x x x x x x

etc

In this sequence of events, the subject began by exploring to the right. Noticing a lower payoff, the subject then doubled back to the left. The player then continued to move to the left until finding the "peak" at 7, and exploring two squares past it. From there, the behavior is unspecified. But it is a reasonable hypothesis that if the above sequence of events were to actually occur in a trial, the subject might choose to spend the remainder of his or her moves oscillating back and forth around the "peak" reward at 7, without ever choosing to uncover any more squares on the board.

The Influence of Beliefs

Clearly, a player's behavior in this game is dependent upon their beliefs about the structure of the function f. Over time, players in some sense attempt to reconstruct the function f, to predict the values of f on explored regions. They must then make decisions about where to go, based on weighing their predictions against the known values that they can receive on parts of the board that they have already uncovered.

The major hypothesis of this paper is that the way in which subjects reconstruct the function f, and hence the way in which subjects decide to move across the payoff landscape, is highly dependent upon the class of functions they believe f to be drawn from. This is in contrast to any simple reinforcement model, which would postulate an "all-purpose" method of reconstruction, based only upon the known values of f, and not any prior beliefs about the structure of f.

Methods

Subjects were divided into two groups, Group A and Group B. Both groups were given the same landscapes, the same starting points, and the same number of moves. They were also given the same set of instructions, with one crucial exception: Group B, unlike Group A, was told the maximum value on the lanscape. They were not told where on the lanscape the value was achieved, only what the value was. The following represents one of the landscapes that was presented to both groups:

[Fig goes here]

A few aspects of the experiment are worth noting. First, to simplify the experiment, subjects were given a starting point on the left-hand side of the board, so that there were no unexplored squares to the left of the start. This provided a good measure of subjects' "adventurousness," in the form of the maximum distance travelled to the right.

Second, it is important to note that the size of the board was larger than the number of allotted moves. This was to insure that even the informed subjects, who knew what the maximum value on the board was, were not guaranteed to receive that payoff through exploration. The intention was to see how we could influence subjects' beliefs about the structure of the landscape, without giving them enough information to infer it exactly.

Different landscapes were given to different subjects in both Group A and Group B, with the intention of collecting both training data for building a model, and example data for testing the model. The results which will follow are obtained from data on a single landscape (the one shown above).

The experiment was done using a web interface, programmed in PHP. The experiment can be found at http://bookbuoy.phpwebhosting.com/mit/classes/9.29/project1Game.php.

The subjects were rewarded with a number of Hershey's Chocolate Easter Eggs commensurate with the total point value that they received after 25 moves.

Results

As expected, the subjects in group B were much more "adventurous" than those in group A, where adventurousness is quantified by the maximum distance to the right that subjects explored to. The subjects in each group can roughly be divided into three categories: 1) those that didn't explore far enough to the right to find the "peak" at the 12th square, 2) those that found the "peak" but did not explore much past it, 3) those that did explore far enough past the 12th square to find the next peak. The following table states the number of people falling into each category for both groups, along with the average exploration distance for each group, and the average payoff for each group.

	< 12	13-17	> 17	Avg. distance	St. Dev Distance	Avg Payoff
Group A	2	7	1
Group B	0	3	7

Fig 1. Data obtained from the two groups. The first three columns represent the number of subjects whose maximum distance to the right fell within the given range.

As is apparent from the data, the subjects in Group B were more adventurous on average. There was also a lot more variablity in Group B, with some subjects acting just as conservative as their Group A counterparts, and others exploring as far as they possibly could.

Modeling

A preliminary attempt at modeling the behavior of Group A can be found in the following matlab code.

Discussion and Suggestions for Further Research

The inability of the reinforcement learning model to adapt for a priori beliefs raises some interesting philosophical questions. There are counterarguments to our claim that reinforcement is insufficent to explain the behavior observed in our experiment. First, one could argue that our experiment was too short, and that if we had done trials on very large boards, and allowed a very large number of moves (say 700 instead of 25), then we may have observed similar behavior in Group A and Group B, because the effects of the a priori beliefs would have been lessened over time. Second, one could aruge that reinforcement does explain our results, only reinforcement on a much broader level: namely that before the experiment took place, the subjects had received earlier reinforcement in their lives which caused them to believe that "maximum" rewards were attainable (and should be attained).

These criticisms have some validity, hence we refrain from making any overly bold claims about our results. Instead, we simply claim that any reasonable model of human behavior ought to be able to predict how humans explore payoff landscapes. The reinforcement model alone appears to lack this ability. Perhaps a more robust model is one in which a set of beliefs about possible landscapes is maintained, and reinforcement occurs over that set.

We believe that the concept of a payoff landscape is a potentially fruitful one. Human behavior on such landscapes is an intriguing area of further research. One can imagine hundreds of different experimental variations. These include, but are not limited to 1) lanscapes with different "slopes" or patterns of reward, 2) landscapes in higher dimensions, 3) landscapes with probabilistic, rather than deterministic, rewards at each location.

Variation (1) is particularly interesting. Consider a landscape with a high-paying region and a low-paying region, but with no apparent pattern of payoffs within each region. Now juxtapose that with a similar landscape with the same high and low paying regions, but also a very consistent pattern of payoff within each region. If a difference in behavior is observed on these two landscapes, it would give more credence to the theory that beliefs about the structure of the landscape are crucial to a theory of human decision making.

References

R. Herrnstein and D. Prelec, Melioration: A Theory of Distributed Choice, The Journal of Economic Perspectives, Volume 5, Issue 3 (Summer, 1991),

Herrnstein, R. J., D. Prelec, and W. Vaughan Jr., "An Intra-Personal Prisoners' Dilemna," paper present at the IX Symposium on Quantitative Analysis of Behavior: Behavioral Economics, Harvard University, 1986.