research statement

Yuyang (Bernie) WangEmail: ywang02@mit.edu

 

My primary research interests are in the area of Machine Learning and Statistics, both in the pragmatic and theoretical aspects. Generally, the question to which I have been seeking to answer is how machines or artificial agents can learn from past experiences or data. To be more specific, how can we find proper computational approaches to build (or learn) statistical models from the data, based on which we can perform inferences.

My research philosophy and methodology towards the question belong to the Bayesian school of statistics. Namely, after we build the statistical model, a certain probability over the model parameters is used to represent our prior opinions about what the true relationship might be can be. After the data is observed, our revised opinions are captured by a posterior distribution over parameters. In particular, my favorite topics are learning with probabilistic graphical models, where latent (hidden) variables are utilized to express the observed associations in the data that are due to hidden common causes. Much of the excitement lies in the Bayesian inference of complex hierarchical models, be it variational methods or sampling based techniques. I have focused on the multi-task learning using nonparametric Bayesian approaches, especially Gaussian processes and Dirichlet processes. Furthermore, I am also interested in open theoretic aspects of the Bayesian nonparametric, for example, the consistency of the Gaussian process regression.

Generally speaking, my doctorate work aims to to build an automated classification system using state-of-art machine learning algorithms for astrophysics time series data. This work collaborates with Initiative in Innovative Computing at Harvard University, in conjunction with the Harvard-Smithsonian Center for Astrophysics. To achieve my goal, there are three main parts.

1. A major effort in astronomy research is devoted to sky surveys, where measurements of stars or other celestial objects brightness are taken over a period of time. Classification as well as other analysis of stars lead to insights into the nature of our universe. We are concerned with periodic variable stars, that is, stars whose brightness varies as a periodic function of time. The heuristic under the classification process is that stars in different categories (namely Ceph, EB, RRL) have different typical shift invariant shapes, based on which astrophysicists perform the classification. Following this intuition, we developed a novel Bayesian nonparametric multi-task model (which we call the gmt model) capturing a mixture of Gaussian processes where each task is a sum of a group-specific function and a component capturing individual variation, in addition to each task being phase shifted. We also developed an efficient em algorithm to learn the parameters of the model. As a special case we obtain the Gaussian mixture model and em algorithm for phased-shifted periodic time series. Most recently, we extended the proposed model by using a Dirichlet Process prior over the mixture proportion and thereby leading to an infinite mixture model (the dp-gmt model) that is capable of doing automatic model selection. A Variational Bayesian approach is utilized to perform the inference in this model.

2. Another challenge in classification of astrophysics time series is to determine whether a light curve is a variable star and further estimating its period. The problem of finding the period of a periodic, non-uniformly sampled time series has been studied extensively in statistics and in the astronomy literature since early twentieth century. However, state-of-the-art performance requires a human to verify that the star is in fact periodic, and that the period-finder has returned the true period. In a working paper, I introduced this problem into Machine Learning literature and considered it as a model selection problem under Gaussian process regression framework. Several numeric optimization methods (i.e. low-rank Cholesky factor update) are proposed to reduce the computational costs. We are also plan to address some statistical properties of our estimator like consistency and asymptotic efficiency.

3. The final part to be investigated is the abstaining classification and new class discovery. Published star catalogs in astronomy are required to have high fidelity, and it is preferable to leave some stars to be classified domain experts than to publish wrong categorization. Another issue that arises in astronomy is class discovery. Typical classification approaches assume each instance in test set belong to one of the predefined classes; however, new data collected from the sky survey might contain examples that originate from classes that have not been discovered by astronomers. Thus, these two tasks require an intelligent classifier that knows when to abstain (leave it to the experts) and when to declare new class is found.