Humans have the remarkable capability to infer the motivations of other people's actions, likely due to cognitive skills known in psychophysics as the theory of mind. In this paper, we strive to build a computational model that predicts the motivation behind the actions of people from images. To our knowledge, this challenging problem has not yet been extensively explored in computer vision. We present a novel learning based framework that uses high-level visual recognition to infer why people are performing an actions in images. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their own experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from automatically inferring motivation, our results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image.
Although you have never seen the people below before, you can likely infer that the man on the left is sitting because he wants to watch television, and the woman on the right is sitting because she wants to see a doctor.
Since humans are able to reliably think about other people's thoughts, we believe machines can infer the motivations behind people's actions too.
However, automatically inferring motivation is a challenging vision problem. The reason people perform actions may be outside of the visible image, either spatially or temporally. To solve this task, we suspect computer vision systems will need access to commonsense knowledge.
Check out our paper to learn more.
We are cleaning up the dataset and expect to release it soon. Please contact us for more information.