Operations Research Center
Seminars & Events
 
Skip to content

Fall 2010 Seminar Series

MASSACHUSETTS INSTITUTE OF TECHNOLOGY
OPERATIONS RESEARCH CENTER
FALL 2010 SEMINAR SERIES

DATE: October 21st
LOCATION: E62-650
TIME: 4:15pm
Reception immediately following in same room

SPEAKER:
Foster Provost

TITLE
Get Another Label? Improving Data Quality and Machine Learning using Multiple, Noisy Labelers

ABSTRACT
I will discuss the repeated acquisition of "labels" for data items when the labeling is imperfect. Labels are values provided by humans for specified variables on data items, such as "PG-13" for "Adult Content Rating on this Web Page." With the increasing popularity of micro-outsourcing systems, such as Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. We present repeated-labeling strategies of increasing complexity, and show several main results: (i) Repeated-labeling can improve label quality and model quality (per unit data-acquisition cost), but not always. (ii) Simple strategies can give considerable advantage, and carefully selecting a chosen set of points for labeling does even better (we present and evaluate several techniques). (iii) Labeler (worker) quality can be estimated on the fly (e.g., to determine compensation, control quality or eliminate Mechanical Turk spammers) and systematic biases can be corrected. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data modelers should have in their repertoire. I illustrate the results with a real-life application from on-line advertising: using Mechanical Turk to help classify web pages as being objectionable to advertisers.

This is joint work with Panos Ipeirotis, Victor S. Sheng, and Jing Wang. An earlier version of the work received the Best Paper Award Runner-up at the ACM SIGKDD Conference.


Back to Seminar Series schedule page