Antonio Torralba


Director MIT Quest for Intelligence (link)
MIT Director MIT-IBM Watson AI Lab (link)

Computer Science and Artificial Intelligence Laboratory
Dept. of Electrical Engineering and Computer Science
Massachusetts Institute of Technology

Office: 32-D462
Address: 32 Vassar Street, Cambridge, MA 02139
Assistant: Fern Deolivera

My research is in the areas of computer vision, machine learning and human visual perception. I am interested in building systems that can perceive the world like humans do. Although my work focuses on computer vision I am also interested in other modalities such as audition and touch. A system able to perceive the world through multiple senses might be able to learn without requiring massive curated datasets. Other interests include understanding neural networks, common-sense reasoning, computational photography, building image databases, ..., and the intersections between visual art and computation.

Lab Members

Jonas Wulff

Tianmin Shu

Chuang Gan
IBM Researcher

Adrià Recasens
Grad. student

Ching-Yao Chuang
Grad. student

David Bau
Grad. student

Manel Baradad
Grad. student

Pratyusha Sharma
Grad. student

Shuang Li
Grad. student

Tongzhou Wang
Grad. student

Wei-Chiu Ma
Grad. student

Xavier Puig Fernandez
Grad. student

Yunzhu Li
Grad. student

Past students and postdocs

Hang Zhao (Graduated 2019), Jun-Yan Zhu (Postdoc), Bolei Zhou (Graduated 2018), Carl Vondrick (Graduated 2017), Javier Marin (Postdoc), Yusuf Aytar (Postdoc) Andrew Owens (Graduated 2016), Aditya Khosla (Graduated 2016), Agata Lapedriza (Visiting professor, UOC), Joseph J. Lim (Graduated 2015), Lluis Castrejon (Visiting student, 2015), Hamed Pirsiavash (Postdoc), Zoya Gavrilov (Grad. Student). Josep Marc Mingot Hidalgo (Visiting student), Tomasz Malisiewicz (Postdoc), Jianxiong Xiao (Graduated 2013), Dolores Blanco Almazan (Visiting student, 2012), Biliana Kaneva (Graduated 2011), Jenny Yuen (Graduated 2011), Tilke Judd (Graduated 2011) Myung "Jin" Choi (Graduated 2011), James Hays (Postdoc), Hector J.Bernal (Visiting student), Gunhee Kim (Visiting student), Bryan C. Russell (Graduated 2008).


Virtual Home: VirtualHome is a platform to simulate complex household activities via programs. Key aspect of VirtualHome is that it allows complex interactions with the environment, such as picking up objects, switching on/off appliances, opening appliances, etc. Our simulator can easily be called with a Python API: write the activity as a simple sequence of instructions which then get rendered in VirtualHome. You can choose between different agents and environments, as well as modify environments on the fly. You can also stream different ground-truth such as time-stamped actions, instance/semantic segmentation, and optical flow and depth. Check out more details of the environment and platform in

GAN dissection: Visualizing and Understanding Generative Adversarial Networks.

MIT Quest for intelligence: I have been named inaugural director of the MIT Quest for Intelligence. The Quest is a campus-wide initiative to discover the foundations of intelligence and to drive the development of technological tools that can positively influence virtually every aspect of society.

Network dissection: Quantifying Interpretability of Deep Visual Representations, CVPR 2017 paper, and Code release. Also related to: Object Detectors Emerge in Deep Scene CNNs.

Auditory scene analysis: using vision to teach audition. NIPS paper by Yusuf and Carl. Check also Andrew's ECCV paper on using audition to teach vision.

Multimodal scene recognition. The data for this work has thousands of linedrawings and textual descriptions of scenes, done by AMT workers. The dataset is organized with the same categories as the Places database.

Aligning books and movies. Learning to see and read by watching movies and reading books. Check also the MovieQA dataset: MovieQA: Story Understanding Benchmark.

Gaze following demo, and dataset. It follows the gaze of the people inside a picture or video and predicts what are they looking. In this video, frames are first processed independently and then the output is smoothed temporaly.

Places database and scene recognition demo. More details about the demo appear in: "Learning Deep Features for Scene Recognition using Places Database," B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. NIPS 2014 (pdf). The Places database has two releases: Places release 1, contains 205 scene categories and 2,5 million of images. Places release 2, contains 400 scene categories and 10 million of images. Pre-trained models available here.

Cool news

Late show with Stephen Colbert on the work by Carl and Hamed, Anticipating Visual Representations from Unlabeled Video. CVPR 2016.

The Marilyn Monroe/Albert Einstein hybrid image by Aude Oliva on BBC.

German TV science show on accidental cameras. Details about accidental cameras and some of our videos are available here.


ADE20K dataset. 22.210 fully annotated images with objects and many with parts. Check the scene parsing challenge website.

The Places Audio Caption Corpus. The Places Audio Caption 400K Corpus contains approximately 400,000 spoken captions for natural images drawn from the Places 205 image dataset. It was collected to investigate multimodal learning schemes for unsupervised co-discovery of speech patterns and visual objects.

Places database. The database contains more than 10 million images comprising 400+ scene categories. The dataset features 5000 to 30,000 training images per class.

360-SUN Database. A database of 360 degrees panoramas organized along the SUN categories. Xiao et al, CVPR 2012. (pdf)

CMPlaces. CMPlaces is designed to train and evaluate cross-modal scene recognition models. It covers five different modalities: natural images, sketches, clip-art, text descriptions, and spatial text images. (pdf)

Out of context objects. The database contains 218 fully annotated images with at least one object out-of-context. Can you detect the out of context object? Project page

3D IKEA dataset. Dataset for IKEA 3D models and aligned images. J. Lim, H. Pirsiavash, and A.Torralba. ICCV 2013.

80 Million tiny images: explore a dense sampling of the visual world. A portion of this dataset was used to create the CIFAR datasets. By the way, since the web page went online, we have been collected anotations for a portion of the dataset. We haven't used for anything yet, but you can download them here and here. The annotations has all the users' votes, as {1,0,-1} corresponding to {correct, undefined, incorrect}. A very simple visualization of the annotations is available here.

Indoor Scene Recognition Database: 67 indoor scene categories. A. Quattoni, and A.Torralba. CVPR 2009.