Genes, Regulation, Evolution

Manolis (Kellis) Kamvysselis

Submitted to the Department of Electrical Engineering and Computer Science

on May 23, 2003 in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy in Computer Science

Understanding the biological signals encoded in a genome is a key challenge of computational biology. These signals are encoded in the four-nucleotide alphabet of DNA and are responsible for all molecular processes in the cell. In particular, the genome contains the blueprint of all protein-coding genes and the regulatory motifs used to coordinate the expression of these genes. Comparative genome analysis of related species provides a general approach for identifying these functional elements, by virtue of their stronger conservation across evolutionary time.

In this thesis we address key issues in the comparative analysis of multiple species. We present novel computational methods in four areas (1) the automatic comparative annotation of multiple species and the determination of orthologous genes and intergenic regions (2) the validation of computationally predicted protein-coding genes (3) the systematic de-novo identification of regulatory motifs (4) the determination of combinatorial interactions between regulatory motifs.

We applied these methods to the comparative analysis of four yeast genomes, including the best-studied eukaryote, Saccharomyces cerevisiae or baker’s yeast. Our results show that nearly a tenth of currently annotated yeast genes are not real, and have refined the structure of hundreds of genes. Additionally, we have automatically discovered a dictionary of regulatory motifs without any previous biological knowledge. These include most previously known regulatory motifs, and a number of novel motifs. We have automatically assigned candidate functions to the majority of motifs discovered, and defined biologically meaningful combinatorial interactions between them. Finally, we defined the regions and mechanisms of rapid evolution, with important biological implications.

Our results demonstrate the central role of computational tools in modern biology. The analyses presented in this thesis have revealed biological findings that could not have been discovered by traditional genetic methods, regardless of the time or effort spent. The methods presented are general and may present a new paradigm for understanding the genome of any single species. They are currently being applied to a kingdom-wide exploration of fungal genomes, and the comparative analysis of the human genome with that of the mouse and other mammals.

Thesis Co-Supervisor: Eric Lander, professor of Biology

Thesis Co-Supervisor: Bonnie Berger, professor of Applied Mathematics

TABLE OF CONTENTS

OVERVIEW... 7

Biological Signals. 7

Contributions of this thesis. 9

BACKGROUND.. 13

0.1. Molecular biology and the study of life. 13

0.2. Gene regulation and the dynamic cell 15

0.3. Evolutionary change and comparative genomics. 17

0.4. Sequence alignment and phylogenetic trees. 19

0.5. Model organisms and yeast genetics. 20

0.6. Genome sequencing and assembly. 22

CHAPTER 1: GENOME CORRESPONDENCE. 25

1.1. Introduction. 25

1.2. Establishing gene correspondence. 26

1.3. Overview of the algorithm.. 27

1.4. Automatic annotation and graph construction. 28

1.5. Initial pruning of sub-optimal matches. 30

1.6. Blocks of conserved synteny. 30

1.7. Best Unambiguous Subsets. 32

1.8. Performance of the algorithm.. 34

1.9. Conclusion. 36

CHAPTER 2: GENE IDENTIFICATION.. 37

2.1. Introduction. 37

2.2. Different conservation of genes and intergenic regions. 38

2.3. Reading Frame Conservation Test 40

2.4. Results: Hundreds of previously annotated genes are not real 42

2.5. Refining Gene Structure. 44

2.6. Analysis of small ORFs. 48

2.7. Conclusion: Revised yeast gene catalog. 50

CHAPTER 3: REGULATORY MOTIF DISCOVERY.. 51

3.1. Introduction. 51

3.2. Regulatory motifs. 52

3.3. Extracting signal from noise. 54

3.4. Conservation properties of known regulatory motifs. 55

3.5. Genome-wide motif discovery. 58

3.7. Results and comparison to known motifs. 63

3.8. Conclusion. 64

CHAPTER 4: REGULATORY MOTIF FUNCTION.. 65

4.1. Introduction. 65

4.2. Constructing functionally-related gene sets. 66

4.3. Assigning a function to the genome-wide motifs. 67

4.4. Discovering additional motifs based on gene sets. 71

4.7. Conclusion. 74

CHAPTER 5: COMBINATORIAL REGULATION.. 75

5.1. Introduction. 75

5.2. Motifs are shared, reused across functional categories. 75

5.3. Changing specificity of motif combinations. 77

5.4. Genome-wide motif co-occurrence map. 78

5.5. Results. 79

5.6. Conclusion. 80

CHAPTER 6: EVOLUTIONARY CHANGE. 81

6.1. Introduction. 81

6.2. Protein family expansions localize at the telomeres. 82

6.3. Chromosomal rearrangements mediated by specific sequences. 84

6.4. Small number of novel genes separate the species. 85

6.5. Slow evolution suggests novel gene function. 86

6.6. Evidence and mechanisms of rapid protein change. 87

6.7. Conclusion. 89

CONCLUSION.. 91

C.1. Summary. 91

C.2. Extracting signal from noise. 92

C.4. The road ahead. 94

REFERENCES. 95

APPENDIX.. 100

ACKNOWLEDGEMENTS

I am indebted to Eric Lander, Bonnie Berger and Bruce Birren for their constant help, advice, support, and mentorship in all aspects of my thesis and graduate career. Many thanks to my colleague Nick Patterson whose help and advice contributed to chapters 3 and 4, to David Gifford and Gerry Sussman for their advice, and to my friends Serafim Batzoglou, Sarah Calvo, James Galagan, Julia Zeitlinger for invaluable advice and support.

I would like to acknowledge the contribution of Matt Endrizzi and the staff of the Whitehead/MIT Center for Genome Research Sequencing Center, who generated the shotgun sequence from the three yeast species; David Botstein, Michael Cherry, Kara Dolinski, Diana Fisk, Shuai Weng and other members of the Saccharomyces Genome Database staff for assistance and discussions, and for making the data available to the community through SGD; Ed Louis and Ian Roberts who provided the yeast strains; Tony Lee, Nicola Rinaldi, Rick Young and the Young Lab for sharing data about chromatin immunoprecipitation experiments and for discussions; Michael Eisen and Audrey Gasch for sharing information about gene expression clusters and for discussions.

Many thanks to Gerry Fink, Martin Kupiec, Sue Lindquist, Andrew Murray, Heather True-Krobb for discussions and understanding of yeast biology. Many thanks to Jon Butler, Gus Cervini, Ken Dewar, Leslie Gaffney, David Jaffe, Joseph Lehar, Li Jun Ma, Abigail Melia, Chad Nusbaum and members of the WICGR for help and discussions.

I owe my gratitude to my parents John and Anna Kamvysselis, to my siblings Peter and Maria for their love and constant support.

OVERVIEW

Biological Signals

Understanding the biological signals encoded in a genome is a key challenge of modern biology. These signals are encoded in the four-nucleotide alphabet of DNA and are responsible for all molecular processes in the cell. In particular, the genome contains the blueprint of all protein-coding genes and the control signals used to coordinate the expression of these genes. The well-being of any cell relies on the successful recognition of these signals, and a large number of biological mechanisms have evolved towards this goal. Specific protein complexes are responsible for the copying of a gene segment from DNA to messenger RNA (transcription) and for its eventual translation into protein following the genetic code to assign an amino acid to every tri-nucleotide codon. A specific class of proteins called transcription factors help recruit the transcription machinery to a target gene by binding their specific DNA signals (regulatory motifs) in response to environmental conditions. An abundance of information within the cell guides these processes, involving protein-protein and protein-DNA interactions between a multitude of players, the state of DNA coiling, and other mechanisms that are still not well-understood.

The computational identification of genes however, can only rely on the primary DNA sequence of the organism. Current programs use properties about the protein-coding potential of DNA segments that are unseen by the transcription machinery. In particular, since genes always start with an ATG (start codon) and end in with TAG, TGA, or TAA (one of three stop codons), programs exist that specifically look for these stretches between a start and a stop codon called ORFs (Open Reading Frames). The basic approach is to identify ORFs that are too long to have likely occurred by chance. Since stop codons occur at a frequency of 3 in 64 in random sequence, ORFs of 60 or even 150 amino acids will occur frequently by chance, but longer ORFs of 300 or thousands of amino acids are virtually always the result of biological selective pressure. Hence, simple computational programs can easily recognize long genes, but many small genes will be indistinguishable from spurious ORFs arising by chance. This is evidenced by the considerable debate over the number of genes in yeast^1-5 with proposed counts ranging from 4800 to 6400 genes. The situation is worse for organisms with large, complex genomes, such as mammals where estimated gene counts have ranged from 30 to 120 thousand genes.

The direct identification of the repertoire of regulatory motifs in a genome is even more challenging. Regulatory motifs are short (typically 6-8 nucleotides), and do not obey the simple rules of protein-coding genes. In any single locus, nothing distinguishes these signals from random nucleotides. Traditionally, their discovery relied on deletion studies of consecutive DNA segments until regulation was disrupted and the control region was identified⁶. With the sequence of multiple genes in the same pathway at hand, it became possible to search for the repetition of these signals in genes controlled by the same transcription factor. Computational methods have been developed to search for enriched sequence motifs in predefined sets of genes (for example, using expectation-maximization⁷ or gibbs-sampling⁸, reviewed in ⁹). As microarray analysis provided genome-wide levels of gene expression under a various experimental conditions, computational methods of gene clustering have resulted in hundreds of such sets of genes. Various computational methods have been used to mine these sets for regulatory motifs, and dozens of candidate motifs have resulted from each search. The vast majority of these candidate motifs are due to noise however, and only a total of about 50 real motifs have currently been discovered.

The current methods of motif identification suffer from a number of limitations. (a) First and foremost is that the weak signal of small motifs is hidden in the noise of relatively large intergenic regions. This inherent signal to noise ratio limits even the best programs from recognizing true motifs in the input data. (b) Additionally, the sets of genes searched, and hence the motifs discovered, are limited by our current biological knowledge of co-regulated sets of genes. The current knowledge is based on the experimental conditions reproduced in the lab, which is likely to be a small fraction of the vast array of environmental responses yeast uses to survive in its natural habitat. (c) Finally, an emerging view of gene regulation has put in question the approaches that search for a single motif responsible for a pathway or environmental response. Pathways are not regulated as isolated components in the cell. Genes and transcription factors have multiple functions and are used in multiple pathways and environmental responses. More importantly, transcription factors do not act in isolation, and protein-protein interactions between factors are as important as protein-DNA interactions between each individual factor and its target genes. Hence, individual gene sets will be enriched in multiple motifs, and individual motifs will be enriched in multiple gene sets. A comprehensive understanding of regulatory motifs requires a novel, more powerful approach.

Comparative genome analysis of related species should provide such a general approach for identifying functional elements without prior knowledge of function. Evolution relentlessly tinkers with genome sequence and tests the results by natural selection. Mutations in non-functional nucleotides are tolerated and accumulate over evolutionary time. However, mutations in functional nucleotides are deleterious to the organism that carries them, and become sparse or extinct. Hence, functional elements should stand out by virtue of having a greater degree of conservation across the genomes of related species. Recent studies have demonstrated the potential power of comparative genomic comparison. Cross-species conservation has previously been used to identify putative genes or regulatory elements in small genomic regions^10-13. Light sampling of whole-genome sequence has been used as a way to improve genome annotation^4,14. Complete bacterial genomes have been compared to identify pathogenic and other genes^15-18. Genome-wide comparison has been used to estimate the proportion of the mammalian genome under selection¹⁹.

Contributions of this thesis

The goal of this thesis is to develop computational comparative methods to understand genomes. We develop and apply general approaches for the systematic analysis of protein-coding and regulatory elements by means of whole-genome comparisons with multiple related species. We apply these methods to Saccharomyces cerevisiae, commonly known as baker’s yeast. S. cerevisiae is a model organism for which many genetic tools and techniques have been developed, leading to a wealth of experimental information. This knowledge has allowed us to validate our biological predictions and assess the power of the methods developed. We generated high-quality draft genome sequences from three Saccharomyces species of yeast related to S. cerevisiae. These data provide us with invaluable comparative information currently unmatched by previous sequencing efforts. Starting with the raw nucleotide sequence assemblies of the three newly sequenced species and the current sequence and annotation of S. cerevisiae, we set out to discover functional elements in the yeast genome based on the comparison of the four species.

We first present methods for the automatic comparative annotation of the four species and the determination of orthologous genes and intergenic regions (Chapter 1). The algorithms enabled the automatic identification of orthologs for more than 90% of genes despite the large number of duplicated genes in the yeast genome.

Given the gene correspondence, we construct multiple alignments and present comparative methods for gene identification (Chapter 2). These rely on the different patterns of nucleotide change observed in the alignments of protein coding regions as compared to non-coding regions, specifically the pressure to conserve the reading frame of proteins. The method has high specificity and sensitivity, and enabled us to revisit the current gene catalogue of S.cerevisiae with important biological implications.

We then turn to the identification of regulatory motifs (Chapter 3). We present statistical methods for their systematic de-novo identification without use of prior biological information. We automatically identified 72 genome-wide sequence elements, with strongly non-random conservation properties. To validate our findings, we compared the discovered motifs against a list of known motifs, and found that we discovered virtually all previously known regulatory motifs, and an additional 41 motifs. We assign function to these motifs using sets of functionally related genes (Chapter 4), and we discover additional motifs enriched in these sets.

We further present methods for revealing the combinatorial control of gene expression (Chapter 5). We study the genome-wide co-occurrence of regulatory motifs, and discover significant correlations between pairs of motifs that were not apparent in a single genome. We show that these correspond to biologically meaningful relationships between the corresponding factors and that motif combinations can change the specific functional enrichment of target genes, thus increasing the versatility of gene regulation using only a limited number of regulatory motifs.

We finally focus on the differences between the species compared and discover the regions and mechanisms of evolutionary change (Chapter 6). We study rapid gene family expansions and discover that they localize in the telomeres. We show that chromosomal rearrangements and inversions are mediated by specific sequence elements. We find specific mechanisms of rapid protein change in environment adaptation genes, as well as stretches of unchanged nucleotides suggesting novel functions for uncharacterized genes.

Our results demonstrate the central role of computational tools in modern biology. Our methods are general and applicable to the study of any organism. They are currently being applied to a kingdom-wide exploration of fungal genomes and the comparative analysis of the human genome with that of the mouse and other mammals. Comparison of multiple related species may present a new paradigm for understanding the genome of any single species.