BACKGROUND

0.1. Molecular biology and the study of life.

It is both humbling and bewildering that what separates humans from bacteria is merely the organization and assembly of the same basic bio-molecules.  It is the study of these shared foundations of life that gave rise to the discipline of molecular biology.  In the microscopic level, complex and simple organisms alike are made up of the same unit of life, the cell.  A cell contains all the information and machinery necessary for its growth, maintenance and replication.  It is delimited from its surrounding by a water-impermeable membrane and all communication and transport across the membrane is tightly controlled.  Two major types of cells exist, prokaryotic cells with simple internal organization, and eukaryotic cells, with extensive compartmentalization of functions such as information storage in the nucleus, energy production in mitochondria, metabolism in the cytoplasm, etc.  In unicellular organisms, the cell constitutes the complete organism, whereas multi-cellular organisms (typically eukaryotes) can contain up to trillions of cells, and hundreds of specialized cell types.  In either case though, a cell can rarely be thought of in isolation, but is constantly interacting with its surrounding, sensing the presence of environmental changes, and exchanging stimuli with other cells that may be part of the same colony or organism.

Within a cell, virtually all functional roles are fulfilled by proteins, the most versatile type of macromolecule.  Various types of proteins fulfill an immense array of tasks.  For example, enzymes catalyze countless chemical reactions;  transcription factors control the timing of gene usage; transporters carry molecules inside or outside the cell; trans-membrane channels regulate the concentrations of molecules in the cell; structural proteins provide support and shape to the cell; actins can cause motion; receptors recognize intra- or extra-cellular signals.  This incredible versatility of proteins comes from the innumerable combinations of an alphabet of only 20 amino acid building blocks, juxtaposed in a single unbranched chain of hundreds or thousands of such amino acids.  All amino acids share an identical portion of their structure that forms the protein backbone, to which is attached one of 20 possible side chains of variable size, shape, charge, polarity, hydrophobicity.  The precise sequence of amino acids dictates a unique three-dimensional fold that optimizes electrostatic and other interactions between the side-chains and with the solvent.

DNA in turn carries the genetic information that encodes the precise sequence of all proteins, the signals that control their production, and all other inheritable traits.  DNA is also a macromolecule, consisting of the linear juxtaposition of millions of nucleotides.  It encodes the genetic information digitally, like the bits of a digital computer, in the precise ordering of four types of nucleotides. Like amino-acids, these nucleotides share a fixed portion that forms a (phosphate) backbone to which is connected (via a deoxyribose sugar) a variable portion that is one of four bases, abbreviated A, C, G, T.  Unlike proteins however, the structure of DNA is fixed.  It consists of two strands, like the sidepieces of a ladder, connected by pairs of bases, like the steps of ladder.  The two strands are wrapped around each other and form a double-helix.  The two phosphate backbones form the outside of the helix, and the base pairs, connected by weak hydrogen bonds, form the interior of the helix.  Only two pairings of bases are possible, based on shape and charge complementarity:  A always pairs with T and C always pairs with G.  This self-complementarity of the DNA structure forms the very basis of heredity:  during DNA replication, the two strands open locally, and each strand becomes the template for synthesizing the opposite strand, its sequence dictated by base complementarity.  The DNA double helix is rarely exposed.  It is typically wrapped around histone proteins and packaged in a coiled structure referred to as chromatin. 

The complete DNA content of an organism is referred to as its genome, and is contained in one or more large uninterrupted pieces called chromosomes.  Prokaryotic cells contain one circular chromosome, and eukaryotic cells contain varying numbers of linear chromosomes (16 in yeast, 23 pairs in human) that are compartmentalized within the cell nucleus.  Each linear chromosome is marked by a well-defined central region, the centromere and the chromosomal endpoints called telomeres.  In a multi-cellular organism, every cell contains an identical copy of the genome (with extremely few exceptions such as red blood cells that do not have a nucleus).  In addition to the chromosomal DNA, cells typically contain additional small pieces of DNA in plasmids (small circular pieces found in bacteria and typically containing antibiotic resistance genes), or mitochondria and chloroplasts (energy production organelles found in eukaryotes).  Genome size varies widely across species, typically 5kb-200kb (kilo-bases) for viruses20-22, 500kb to 5Mb for bacteria15, 10-30Mb for unicellular fungi23,24, 97Mb for the worm25, 165Mb for the fly26, 2-3Gb for mammals19,27, and 100Mb-100Gb for plants28.

The amino-acid sequence of every protein is encoded within a single continuous stretch of DNA called a gene.  The transfer of information from the four-letter nucleotide alphabet of DNA to the 20 amino-acid alphabet of proteins is ensured by a process called translation.  Consecutive nucleotide triplets (codons) are translated into consecutive amino-acid residues, according to a precise translation table, referred to as the genetic code.  There are 64 possible codons and only 20 amino acids, hence the genetic code contains degeneracies, and the same amino acid can be encoded by multiple codons.  Additionally, the codon ATG (that codes for Methionine) also serves as a special translation initiation signal, and three codons (TGA, TAG, TAA) are dedicated translation termination signals.  These are typically called start and stop codons.  DNA is a directional molecule, and so are proteins.  DNA is always read and synthesized in the 5’ to 3’ direction (named after the 5’ and 3’ carbons in the carbon-ring of the sugar).  Given this directionality of either strand, we can refer to sequences upstream (5’) or downstream (3’) of a particular nucleotide on the same strand.  The two complementary strands run in opposite direction and are called anti-parallel, hence upstream in one strand is complementary to downstream on the opposite strand.  Upstream and downstream are typically used in relation to the coding strand of a gene (containing the sequence ATG).  Proteins are synthesized from the N terminus (encoded by the 5’ part of the gene) to the C terminus (encoded by the 3’ part of the gene).

0.2. Gene regulation and the dynamic cell

DNA is not directly translated into protein, but it is first transferred by complementarity into an intermediary single-stranded information carrier called messenger RNA or mRNA in a process called transcription.  The Central Dogma of biology refers to this transfer of the genetic information from DNA to RNA to protein.  RNA is similar to DNA, but is single-stranded and contains a different type of sugar connector between the phosphate backbone and the variable base (also the four bases are A,C,G,U instead of A,C,G,T).  This difference in structure enables RNA to assume complex three-dimensional folds and perform a variety of cellular functions, only one of which is information transfer between DNA and protein.  In eukaryotic cells, transcription occurs in the nucleus where the DNA resides, and the resulting mRNA molecule is then transferred outside the nucleus where the translation machinery resides.  During this transfer, the transcript undergoes a maturation step, including the excision (called splicing) of untranslated gene portions (called introns), and the joining of the remaining portions of the transcribed gene that are typically translated (called exons).  The splicing of introns is dictated by subtle signals between 6 and 8 bp (base pairs) long that are found mainly at the junctions between exons and introns and within each intron.  In prokaryotic cells, transcripts do not undergo splicing and sometimes contain multiple consecutively translated genes of related function.


The process of protein and RNA production, also called gene expression, is tightly controlled at multiple stages, but mainly at the stage of transcription initiation.  This involves the uncoiling of chromatin structure around the gene to be expressed and the recruitment of a number of protein players that include the transcription machinery.  These processes are regulated by a specific class of DNA-binding proteins called transcription factors.  These bind the double-stranded DNA helix in sequence-specific binding sites, recognizing electrostatic properties of the nucleotides at each contact point.  A regulatory motif describes the sequence specificity of a transcription factor, namely, the nucleotide patterns that are in common to the sites bound.  Transcription factors are classified according to their effect on the expression of their target genes:  an activator increases the level of gene expression when bound, and a repressor decreases that level. Transcription factor binding is modulated by the protein concentration and localization of the transcription factor, the three-dimensional conformation of the transcription factor that may depend on chemical modifications, protein-protein interactions with other factors that may bind cooperatively or competitively, and chromatin accessibility surrounding the binding site.  Finally, in addition to transcription initiation, gene expression is regulated at many stages, including mRNA transport and splicing, translation initiation and efficiency, mRNA stability and degradation, post-translational modifications of a protein, and protein stability.

These processes together modulate gene expression in response to environmental changes, and are interlinked in complex regulatory networks, responsible for the dynamic nature of the cell.  These dynamics create the multitude of specific cell responses to varying environmental stimuli.  Gene regulation also creates the incredible variety of cell types found within the same organism.  For example heart, liver, lung, nail, skin, eye, neurons, hair, or bone all have the exact same DNA content, but express a different set of genes.  Changes in gene expression however, can also be responsible for a number of complex diseases.  Understanding the dynamic cell is a major challenge for molecular biology and modern medicine. 

0.3. Evolutionary change and comparative genomics

The evolution of these complex mechanisms was shaped by the forces of random change and natural selection.  Random genomic change can generate new functions or disrupt existing ones, and natural selection favors and keeps the fittest combinations.  The genotypic differences accumulated at the DNA level lead to observed phenotypic differences between individuals of a population.  Genomic changes can be as subtle as the mutation, insertion or deletion of individual nucleotides, and as drastic as the duplication or loss of chromosomal segments, entire chromosomes, or complete genomes.  Changes in a protein-coding gene can lead to multiple co-existing variants, or alleles, of that gene within a population, that differ in specific residues and perform the same function with slight differences.  As the result of mating, the progeny will inherit a combination of paternal and maternal alleles for different genes.  The random mating of individuals within a populations and the random segregation of chromosomal segments in gamete formation creates new allelic combinations at each generation.  The frequency of these allelic combinations will vary through evolutionary time, either by selection for their evolutionary fitness or by random genetic drift.  As populations segregate and adapt to their environment, different combinations of alleles dominate in each population.  The resulting differences in behavior or chromosomal organization can lead to loss of reproductive ability across sub-populations and the emergence of new species.  The emergence of new functions in these changing species allowed adaptation to all niches on land, in the air, underground, or in the deepest oceans, in species as diverse as dinosaurs and amoebae.  It is thought that all life in the planet descends from a single ancestral cell that lived around 3.5 billion years ago, and the incredible biodiversity observed today resulted from incremental changes of existing life forms.

The genomes of related species exhibit similarities in functional elements that have undergone little change since the species’ common ancestor.  Deleterious mutations in these functional regions have certainly occurred, but the individuals carrying them have been at a disadvantage and eventually eliminated by natural selection.  Mutations in non-functional regions have no effect to an organism’s reproductive fitness, and will accumulate over evolutionary time.  Hence, the combined effects of random mutation and natural selection allow comparative approaches to separate conserved functional regions from diverged non-functional regions.  Comparative genome analysis of related species should provide a general approach for identifying functional elements without prior knowledge of function, by virtue of having a greater degree of conservation across the genomes of related species.  When selecting species for a pairwise comparative analysis, we face a tradeoff between closely related species (with many common functional elements but additional spuriously conserved non-functional regions), and distantly related species (with mostly diverged non-functional regions but fewer common functional elements).  The use of multiple closely-related species may present an attractive alternative, exhibiting an accumulation of independent mutations in non-functional regions, while having most biological functions in common.

Recent studies have demonstrated the potential power of comparative genomic comparison. Cross-species conservation has previously been used to identify putative genes or regulatory elements in small genomic regions10-13. Light sampling of whole-genome sequence has been studied as a way to improve genome annotation4,14. Complete bacterial genomes have been compared to identify pathogenic and other genes15-18. Genome-wide comparison has been used to estimate the proportion of the mammalian genome under selection19. 

0.4. Sequence alignment and phylogenetic trees

The comparison of related sequences is typically represented as sequence alignment (for an example see figure 3.2).  The correspondence of nucleotides across the sequences compared is given by offsetting the nucleotides of each sequence such that matching nucleotides are stacked at the same index across all sequences.  To represent insertions or deletions (indels), gaps are typically inserted as dashes in the shorter sequence;  these could represent a deletion in the sequence containing the gap, or an insertion in the other sequences.  Typically, no reordering or repetition of nucleotides is allowed within a sequence, and hence no inversions, duplications, or translocations are represented in a sequence alignment.  To construct an alignment of two sequences is equivalent to finding the optimal path in a two-dimensional grid of cells, and dynamic programming algorithms have been developed to align two sequences in time proportional to the product of their lengths, and space proportional to sum of their lengths.  The optimal alignment of two sequences minimizes the total cost of insertions, deletions, and nucleotide substitutions (gaps and mismatches), each penalized according to input parameters.  These parameters are set to match estimated rates of insertions, deletions and nucleotide substitutions in well-conserved portions of carefully-constructed alignments.  For example, substitutions between nucleotides of similar structure are more frequent and hence transitions between purines (A and G) or between pyrimidines (C and T) are penalized less than transversions from a purine to a pyrimidine and vice versa.  Also, it is typical to penalize gaps using affine functions, namely adding a cost proportional to the size of the gap to a fixed cost for starting a gap.  Global alignments compare the entire length of the sequences compared, and local alignments only align sub-portions of the sequences. 

The best match of a query sequence can be found in a database of sequences by scoring the local alignments between the query and each sequence in the database.  Constructing the full dynamic programming matrix for each of the sequences in a large database can be costly, and efficient algorithms have been developed to only align a small subset of the database sequences.  These algorithms take advantage of the fact that strong matches of a query sequence will typically contain stretches of perfectly conserved residues, and first select all database sequences that contain such stretches.  To do so, a hash table is first constructed for the database, listing all sequences and positions that contain a particular k-mer.  After this slow step that need only be performed once, the lookup of all k-mers in a query sequence can be performed rapidly against a large database, constructing a list of hits.  Local alignments are then constructed around each hit, extending the k-mer matches to longer high-scoring local alignments.  These ideas are implemented in the popular program BLAST, and used thousands of times daily to query the genomes of dozens of sequenced species and millions of sequences.  One modification of the BLAST algorithm called two-hit Blast only constructs a local alignment when at least two nearby hits are found.  This allows the retrieval of more distantly related sequences by searching for shorter k-mers, while still maintaining high specificity by requiring multiple k-mer hits in common.

Multiple sequence alignments can also be constructed for more than two sequences.  Constructing the full dynamic programming matrix is exponential in the number of sequences compared and typically impractical for long sequences.  Therefore, current algorithms work by extending multiple pairwise alignments between the sequences compared.  The similarities between all pairs of sequences can be used to construct a phylogenetic tree, summarizing the most likely ancestry of the sequences, linking them hierarchically from the most closely related pair to the most distantly related outgroup.  Multiple sequence alignment algorithms typically start by aligning the most closely related sequences, and progressively merge alignments moving up the phylogenetic tree from the leaves to the root.  Algorithms to merge two alignments typically use once-a-gap-always-a-gap methods, but more recent algorithms have been developed to locally re-optimize multiple alignment portions by revisiting previously added gaps and improving the overall alignment score.

0.5. Model organisms and yeast genetics.

The shared biology of related species allows one to study a biological process in one organism and apply the knowledge to another organism.  Simpler organisms provide excellent models for developing and testing the procedures needed for studying the much more complex human genome.  Such model organisms include bacteria, yeast, fungi, worms, flies and mice, each teaching us different aspects of human biology.  For example, the study of cancer development has flourished by studying mouse models, and has lead to medical application in humans.  Mutant strains can be isolated containing specific defects in genes that lead to disease phenotypes.  Controlled crosses can be used to restore lost functions or inhibit genes at particular stages of development and study their effects on the organism.  The shorter the generation time of a model organism, the easier it is to perform multiple crosses. 

The yeast Saccharomyces cerevisiae in particular provides a powerful genetic system with the availability of a wide array of tools such as gene replacement, plasmids, deletion strains, two-hybrid systems.  Yeast is also amenable to biochemical methods, such as the purification and characterization of protein complexes.  Because of these experimental advantages, yeast has been the system of choice to study the most basic cellular functions common to eukaryotes such as cell division, cell structure, energy production, cell growth, cell death, cell cycle, gene regulation, transcription initiation, cell signaling, and other basic cell processes.  More recently, yeast has become the organism of choice for the development and testing of modern technologies for genome-wide experimental studies.  The complete parts-list of all genes has radically changed the face of biological research.  If a particular phenotype is due to the function of a single protein, it is necessarily encoded by one of these few thousand genes.  Additionally, the relatively small number of genes (~6000) allows the simultaneous observation of the complete genome for mRNA expression, transcription factor binding, or protein-protein interactions.  The public sharing of yeast strains, materials, and genome-wide experimental data has provided a global view of the dynamic yeast genome unmatched in any other organism.

Yeast also presents an ideal organism for developing computational methods for genome-wide comparative analysis.  It is the most well-studied eukaryote, and the vast functional knowledge allows the immediate validation of our findings against previous work.  Additionally, the strong experimental system allows the experimental follow-up of biological hypotheses raised in the comparative work.  The small genome size (250 times smaller than human) allows the sequencing of multiple yeast species at an affordable cost.  Additionally, the small number of repetitive elements allows for easy whole-genome-shotgun assembly (see next section).  For all these considerations, we decided to work on yeast.

0.6. Genome sequencing and assembly

We sequenced and assembled the complete genomes of S. paradoxus, S. mikatae and S. bayanus, three yeast species that are close relatives of S. cerevisiae, within the Saccharomyces sensu stricto group29.  Their divergence times from the S. cerevisiae lineage are approximately 5, 10 and 20 million years (based on sequence divergence of ribosomal DNA sequence).  Like S. cerevisiae, they all have 16 chromosomes and their genomes contain about 12 million bases.  These species were chosen based on their evolutionary relationships (closely enough related that functional elements be conserved, and distant enough that non-functional bases have had enough evolutionary time to diverge). 

Reading the order of the nucleotides in any one segment of DNA relies on a technology developed by Sanger in 1977 that uses the central agent of DNA replication, DNA polymerase.  This protein complex recognizes the transition from double-stranded DNA to single-stranded DNA in an incomplete helix, and extends the shorter strand in the 5’ to 3’ direction.  By introducing a small fraction of faulty nucleotides that cause an early termination of the extension reaction, and subsequently comparing the lengths of resulting fragments in each of four reactions, this method infers the sequence of a DNA fragment.  The extension reaction can be initiated at any unique segment of DNA by introducing a complementary segment called a primer.  This primer binds single-stranded DNA by complementarity, creating the double-strand to single-strand transition recognized by DNA polymerase.  Unfortunately, since the Sanger method works by weight separation between fragments of different lengths, it can only determine the sequence of small fragments (currently around 800 nucleotides).  The weight difference between fragments of 800 nucleotides and fragments of 801 nucleotides is too small to be detected reliably.

To obtain the sequence of longer stretches of DNA, two methods are possible.  One is to synthesize a new primer at the end of 800 nucleotides and use it to sequence the subsequent 800 nucleotides (and so on).  Unfortunately, synthesizing new primers is expensive and time-consuming since the primer to be used is not known until the sequence is obtained, and this method is rarely used.  An alternative method is to first make many copies of the longer stretch of DNA and randomly break them into small fragments, and then sequence 800 nucleotide reads from each of these fragments and re-piece them together computationally (each of the fragments is inserted to a common vector whose sequence is known, hence the same primer can be used to sequence the end of each of these fragments).  This alternative method is called shotgun sequencing, in reference to the random breaking of the longer fragment as if struck by a shotgun.  Sequence reads can also be obtained from both ends of a fragment, providing linking information between paired reads.  This method is called paired-end shotgun sequencing.  The shotgun fragments are typically selected to be of a particular size, providing additional information about the genomic distance between paired sequence reads. 

Shotgun sequencing depends heavily on the computational ability to correctly assemble the resulting fragments of sequence.  Fragment assembly searches for sequences common between two sequence fragments (also called reads) and unique otherwise, in order to join them into a longer sequence.  This is made harder due to sequencing errors that lead to sequence differences between reads that really come from the same part of the genome, as well as repetitive sequences within genomes that lead to identical sequences between reads that come from different parts of the genome.  Modern assembly programs produce stretches of continuous sequence called contigs, which are linked into supercontigs or scaffolds, when their relative order, orientation, and estimated spacing is given by the pairing of reads (Figure 0.4).  To assemble complete genomes, two methods are currently in use.  Whole-genome shotgun (WGS) randomly breaks the complete genome and assembles all fragments computationally.  Clone-based methods first partition the genome into large fragments (clones) and then use shotgun sequencing for each of the fragments.  Clone-based methods are more expensive but more reliable.  WGS methods are cheaper but rely more heavily on the ability of subsequent computational assembly programs.  Hybrids between WGS and clone-based methods are used nowadays in major sequencing projects.  It is also common to use WGS with links of multiple sizes to provide both short-range and long-range connectivity information.