CONCLUSION

C.1. Summary

In this thesis, we explored the ability to extract a wide range of biological information from genome comparison among related organisms. Our results show that comparative analysis with closely related species can be invaluable in annotating a genome. It reveals the way different regions change and the constraints they face, providing clues as to their use.  Even in a genome as compact as that of S.cerevisiae, where genes are easily detectable and rarely spliced, much remains to be learned about the gene content.  We found that a large number of the annotated ORFs are dubious, adjusted the boundaries of hundreds of genes, and discovered more than 50 novel ORFs and 40 novel introns.  Moreover, our comparisons have enabled a glimpse into the dynamic nature of gene regulation and co-regulated genes by discovering most known regulatory motifs as well as a number of novel motifs.  The signals for these discoveries are present within the primary sequence of S.cerevisiae, but represent only a small fraction of the genome.  Under the lens of evolutionary conservation, these signals stand out from the non-conserved noise.  Hence, in studying any one genome, comparative analysis of closely related species can provide the basis for a global understanding of a wide range of functional elements.

Our results demonstrate the central role of computational tools in modern biology.  The analyses presented in this thesis have revealed biological findings that can not be discovered by traditional genetic methods, regardless of the time or effort spent.  Isolated deletion of every single yeast gene has been carried out without resolving the debate on the number of functional genes.  Promoter analysis of any single gene could not reveal the subtle regulatory signals that become apparent at the genome-wide level.  The approach presented is general, and has the advantage that one can increase its power by increasing the number of species studied.  As sequencing costs lower and sequencing capacity increases, obtaining additional genomes becomes only a question of time.  The comparison of multiple related species may present a new paradigm for understanding the genome of any single species.  In particular, our methods are currently being applied to a kingdom-wide exploration of fungal genomes, and the comparative analysis of the human genome with that of the mouse and other mammals.

C.2. Extracting signal from noise.

For S. cerevisiae, our results show that comparative genome analysis of a handful of related species has substantial power to separate signal from noise to identify genes, define gene structure, highlight rapid and slow evolutionary change, recognize regulatory elements and reveal combinatorial control of gene regulation. The power is comparable or superior to experimental analysis, in terms of sensitivity and precision.

In principle, the approach could be applied to any organism by selecting a suitable set of related species. The optimal choice of species depends on multiple considerations, largely related to the evolutionary tree connecting the species. These include the following:

(1) The branch length t between species should be short enough to permit orthologous sequence to be readily aligned. The yeasts studied here differ by t = 0.23-0.55 substitutions per site and are readily aligned.  The strong conservation of synteny (covering more than 90% of S. cerevisiae chromosomes belong in synteny blocks) allowed the unambiguous correspondence of the vast majority of genes.

(2) The total branch length of the tree should be large enough that non-functional sites will have undergone substantially more drift than functional sites, thereby providing an adequate degree of signal-to-noise enrichment (SNE). For this analysis, the multiple species studied provide a total branch length of 0.83 and a probability of nucleotide identity across all four species in non-coding regions of 49%. The SNE is thus ~2-fold (=1/0.49) for highly constrained nucleotides and correspondingly higher for composite features involving many nucleotides.

(3) The species should represent as narrow a group as possible, subject to the considerations above. Because the comparative analysis above seeks to identify genomic elements common to the species, it can explain only aspects of biology shared across the taxon.  In the present case, the analysis identifies elements shared across Saccharomyces sensu stricto, a closely related set of species such that the vast majority of genes and regulatory elements are shared.

With these considerations in mind, the question remains as to what is the “right” number of species for comparative analysis.  Similarly, one can ask, given a set of previously sequenced species, what is the optimal choice for the next species to sequence.  The answer of course depends on the goal at hand.  In discovering genes, the number of species required depends on the length of the genes sought.  In discovering motifs, the number of species depends on the motif length, its allowed degeneracy, and the total number of conserved instances.  And in each case, the evolutionary distance of the species compared, but also the topology of the phylogenetic tree, will determine our ability to extract signal from noise.  We found that genome-wide methods could increase the power of comparative analysis that is based on a handful of species.  The answer in the general case merits a much more detailed analysis.

C.3. Analysis of mammalian genomes

What are the implications for the understanding of the human genome?

The present study provides a good model for evolutionary distances (substitutions per site in intergenic regions) relevant to the study of the human.  The sequence divergence between S. cerevisiae and the most distant relative S. bayanus (11% indels and 62% nucleotide identity in aligned positions) is similar to that between human and mouse (12% indels and 66% nucleotide identity in aligned positions).

An important difference between yeast and human is the inherent signal-to-noise ratio (SNR) in the genome. Yeast has a high SNR, with protein-coding regions comprising ~70% of the genome coding for protein or RNA genes and regulatory elements comprising perhaps ~15% of the intergenic regions. The human has a much lower SNR, with the corresponding figures being perhaps ~2% and ~3%19. A lower SNR must be offset by a higher SNE. Some enrichment can also be obtained by filtering out the repeat sequences that comprise half of the human genome. Greater enrichment can be accomplished by increasing the number of species studied, taking advantage both of nucleotide level divergence and frequently occurring genomic deletion19.

Such considerations indicate that it should be possible to use comparative analysis, such as explored here for yeast, to directly identify many functional elements in the human genome common to mammals.  More generally, comparative analysis offers a powerful and precise initial tool for interpreting genomes.

C.4. The road ahead

In this thesis, we explored the ability of computational comparative genomics to extract biological signals that govern genes, regulation, and evolution.  The nature of these signals however had been previously established experimentally.  Knowing that genes were translated into amino acids every three nucleotides was central in our test of reading frame conservation.  Knowing that regulatory motifs appear in multiple intergenic regions was crucial to our genome-wide discovery methods.  Knowing the kinds of functional sequences to look for allowed us to examine the ways that they change.  In each case, our methods relied on well-posed questions based on currently established biological knowledge.

In the future however, it will be important to formulate new hypotheses from genomic data.  We cannot begin to imagine the types of information encoded in the human genome.  The basis for intelligence, psychology, immunity, development, emotions are all encoded within our cells.  New biological paradigms will be needed to explore novel aspects of biology, and their very discovery will reside in genome-wide studies.  Development of new technologies, new statistical methods, new computational tools will be needed.  An explosion of biological data, but also an explosion in novel experimental techniques has already started.  And the only way to proceed is a constant marriage between biology and computer science.