Manolis Kellis - Ph.D. Thesis

Biology foundations
Chapter 1: Genome Correspondence
Chapter 2: Gene Identification
Chapter 3: Regulatory Motif Discovery
Chapter 4: Regulatory Motif Function
Chapter 5: Combinatorial Control
Chapter 6: Genome Evolution
Conclusion: The road ahead
Appendix: Statistical Formulae
Thesis (100-page pdf)

Recipient of the MIT Sprowls Award for the best thesis in Computer Science

Understanding the biological signals encoded in a genome is a key challenge
of computational biology. These signals include protein-coding genes and the
regulatory motifs used to control gene expression. They are encoded in the
four-nucleotide alphabet of DNA and can be hidden amidst millions of
non-functional nucleotides. Over evolutionary time, selective pressure
preserves the sequence of functional elements while mutations that occur in
non-functional elements lead to more rapid sequence divergence. Thus,
comparative genome analysis of related species should provide a general
approach for identifying functional elements, by virtue of their stronger

In this thesis we address key issues in the comparative analysis of multiple
species. We present novel computational methods in four areas: (1) the
automatic determination of the correspondence of genomic regions across
multiple species (2) the identification and validation of protein-coding
genes (3) the systematic de novo identification of regulatory motifs (4) the
determination of combinatorial interactions between regulatory motifs. 

We applied these methods to the comparative analysis of four yeast genomes,
including the most well-studied eukaryote, Saccharomyces cerevisiae or
baker's yeast. The gene analysis yielded a major revision to the yeast gene
catalog affecting 15% of all genes and reducing the total count by 500
genes. The motif analysis automatically identified 72 genome-wide elements,
including most known regulatory motifs and numerous novel motifs. We
inferred a putative function for most of these motifs and shed light on
their combinatorial interactions. Finally, we defined regions and mechanisms
of rapid evolution, with important biological implications. 

Our results demonstrate the central role of computational analyses in modern
biology. The methods presented in this thesis are general and may present a
new paradigm for understanding the genome of any species, including the

Manolis Kellis