Manolis Kellis
Research Interests

Distinguished Alumnus (1964) Career Development Professorship.
Associate Professor, MIT Department of Electrial Engineering and Computer Science (EECS).
Principal Investigator, Computer Science and Artificial Intelligence Laboratory (CSAIL).
Associate Member, Broad Institute of MIT and Harvard (Broad).

My primary research interests are in Computational Biology. My group is developing new algorithms and machine learning techniques for the interpretation of complete genomes, the understanding of gene regulation, and the elucidation of evolutionary mechanisms and embryo development. I am additionally involved in various research projects in Machine Learning, Computational Geometry, Robotics, and Information Retrieval.

I did my Ph.D. thesis on computational genomics, with Eric Lander and Bonnie Berger.
I did my masters thesis on image interpretation and retrieval, with Patrick Winston.

Research topics: Genome Interpretation, Small RNAs, Genes, Chromatin, Regulatory Motifs, Bio Networks, Evolution, Phylogenomics, Algorithms.

Individual projects:

Computational Biology: Genome Alignment, Gene Identification, Regulatory Motif Discovery, Genome Evolution, Genome Duplication, Embryo Development.

Machine Learning: Image interpretation and retrieval, Human Motion Primitives, Handwriting Recognition, Learning Musical Patterns.

Computational Geometry: 3D Surface Reconstruction, 3D polygon morphing, 2D Gaussian curve morphing.

Robotics: iLogo robotic vehicle, Modular Reconfigurable Robots, Autonomous Lego Robot.

Other Projects: Stock Market, Web Crawler, Turing Machine, Ray Tracing, Information Sharing, Meta-Evolution.

Genome Interpretation

23. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures.

Stark, Lin, Kheradpour, Pedersen, Parts, Carlson, Crosby, Rasmussen, Roy, Deoras, Ruby, Brennecke, Hodges, Hinrichs, Caspi, Paten, Park, Han, Maeder, Polansky, Robson, Aerts, van, Hassan, Gilbert, Eastman, Rice, Weir, Hahn, Park, Dewey, Pachter, Kent, Haussler, Lai, Bartel, Hannon, Kaufman, Eisen, Clark, Smith, Celniker, Gelbart, Kellis

Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or 'evolutionary signatures', dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.

Nature. 2007 Nov 8;450(7167):219-32

2. Sequencing and comparison of yeast species to identify genes and regulatory elements.

Kellis, Patterson, Endrizzi, Birren, Lander

Identifying the functional elements encoded in a genome is one of the principal challenges in modern biology. Comparative genomics should offer a powerful, general approach. Here, we present a comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species (S. paradoxus, S. mikatae and S. bayanus). We first aligned the genomes and characterized their evolution, defining the regions and mechanisms of change. We then developed methods for direct identification of genes and regulatory motifs. The gene analysis yielded a major revision to the yeast gene catalogue, affecting approximately 15% of all genes and reducing the total count by about 500 genes. The motif analysis automatically identified 72 genome-wide elements, including most known regulatory motifs and numerous new motifs. We inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions. The results have implications for genome analysis of diverse organisms, including the human.

Nature. 2003 May 15;423(6937):241-54

12. Genome sequence, comparative analysis and haplotype structure of the domestic dog.

Lindblad-Toh, Wade, Mikkelsen, Karlsson, Jaffe, Kamal, Clamp, Chang, Kulbokas, Zody, Mauceli, Xie, Breen, Wayne, Ostrander, Ponting, Galibert, Smith, DeJong, Kirkness, Alvarez, Biagi, Brockman, Butler, Chin, Cook, Cuff, Daly, DeCaprio, Gnerre, Grabherr, Kellis, Kleber, Bardeleben, Goodstadt, Heger, Hitte, Kim, Koepfli, Parker, Pollinger, Searle, Sutter, Thomas, Webber, Baldwin, Abebe, Abouelleil, Aftuck, Ait-Zahra, Aldredge, Allen, An, Anderson, Antoine, Arachchi, Aslam, Ayotte, Bachantsang, Barry, Bayul, Benamara, Berlin, Bessette, Blitshteyn, Bloom, Blye, Boguslavskiy, Bonnet, Boukhgalter, Brown, Cahill, Calixte, Camarata, Cheshatsang, Chu, Citroen, Collymore, Cooke, Dawoe, Daza, Decktor, DeGray, Dhargay, Dooley, Dooley, Dorje, Dorjee, Dorris, Duffey, Dupes, Egbiremolen, Elong, Falk, Farina, Faro, Ferguson, Ferreira, Fisher, FitzGerald, Foley, Foley, Franke, Friedrich, Gage, Garber, Gearin, Giannoukos, Goode, Goyette, Graham, Grandbois, Gyaltsen, Hafez, Hagopian, Hagos, Hall, Healy, Hegarty, Honan, Horn, Houde, Hughes, Hunnicutt, Husby, Jester, Jones, Kamat, Kanga, Kells, Khazanovich, Kieu, Kisner, Kumar, Lance, Landers, Lara, Lee, Leger, Lennon, Leuper, LeVine, Liu, Liu, Lokyitsang, Lokyitsang, Lui, Macdonald, Major, Marabella, Maru, Matthews, McDonough, Mehta, Meldrim, Melnikov, Meneus, Mihalev, Mihova, Miller, Mittelman, Mlenga, Mulrain, Munson, Navidi, Naylor, Nguyen, Nguyen, Nguyen, Nguyen, Nicol, Norbu, Norbu, Novod, Nyima, Olandt, O'Neill, O'Neill, Osman, Oyono, Patti, Perrin, Phunkhang, Pierre, Priest, Rachupka, Raghuraman, Rameau, Ray, Raymond, Rege, Rise, Rogers, Rogov, Sahalie, Settipalli, Sharpe, Shea, Sheehan, Sherpa, Shi, Shih, Sloan, Smith, Sparrow, Stalker, Stange-Thomann, Stavropoulos, Stone, Stone, Sykes, Tchuinga, Tenzing, Tesfaye, Thoulutsang, Thoulutsang, Topham, Topping, Tsamla, Vassiliev, Venkataraman, Vo, Wangchuk, Wangdi, Weiand, Wilkinson, Wilson, Yadav, Yang, Yang, Young, Yu, Zainoun, Zembek, Zimmer, Lander

Here we report a high-quality draft genome sequence of the domestic dog (Canis familiaris), together with a dense map of single nucleotide polymorphisms (SNPs) across breeds. The dog is of particular interest because it provides important evolutionary information and because existing breeds show great phenotypic diversity for morphological, physiological and behavioural traits. We use sequence comparison with the primate and rodent lineages to shed light on the structure and evolution of genomes and genes. Notably, the majority of the most highly conserved non-coding sequences in mammalian genomes are clustered near a small subset of genes with important roles in development. Analysis of SNPs reveals long-range haplotypes across the entire dog genome, and defines the nature of genetic diversity within and across breeds. The current SNP map now makes it possible for genome-wide association studies to identify genes responsible for diseases and traits, with important consequences for human and companion animal health.

Nature. 2005 Dec 8;438(7069):803-19

1. The genome sequence of the filamentous fungus Neurospora crassa.

Galagan, Calvo, Borkovich, Selker, Read, Jaffe, FitzHugh, Ma, Smirnov, Purcell, Rehman, Elkins, Engels, Wang, Nielsen, Butler, Endrizzi, Qui, Ianakiev, Bell-Pedersen, Nelson, Werner-Washburne, Selitrennikoff, Kinsey, Braun, Zelter, Schulte, Kothe, Jedd, Mewes, Staben, Marcotte, Greenberg, Roy, Foley, Naylor, Stange-Thomann, Barrett, Gnerre, Kamal, Kamvysselis, Mauceli, Bielke, Rudd, Frishman, Krystofova, Rasmussen, Metzenberg, Perkins, Kroken, Cogoni, Macino, Catcheside, Li, Pratt, Osmani, DeSouza, Glass, Orbach, Berglund, Voelker, Yarden, Plamann, Seiler, Dunlap, Radford, Aramayo, Natvig, Alex, Mannhaupt, Ebbole, Freitag, Paulsen, Sachs, Lander, Nusbaum, Birren

Neurospora crassa is a central organism in the history of twentieth-century genetics, biochemistry and molecular biology. Here, we report a high-quality draft sequence of the N. crassa genome. The approximately 40-megabase genome encodes about 10,000 protein-coding genes--more than twice as many as in the fission yeast Schizosaccharomyces pombe and only about 25% fewer than in the fruitfly Drosophila melanogaster. Analysis of the gene set yields insights into unexpected aspects of Neurospora biology including the identification of genes potentially associated with red light photobiology, genes implicated in secondary metabolism, and important differences in Ca2+ signalling as compared with plants and animals. Neurospora possesses the widest array of genome defence mechanisms known for any eukaryotic organism, including a process unique to fungi called repeat-induced point mutation (RIP). Genome analysis suggests that RIP has had a profound impact on genome evolution, greatly slowing the creation of new genes through genomic duplication and resulting in a genome with an unusually low proportion of closely related genes.

Nature. 2003 Apr 24;422(6934):859-68

22. Evolution of genes and genomes on the Drosophila phylogeny.

Drosophila, Clark, Eisen, Smith, Bergman, Oliver, Markow, Kaufman, Kellis, Gelbart, Iyer, Pollard, Sackton, Larracuente, Singh, Abad, Abt, Adryan, Aguade, Akashi, Anderson, Aquadro, Ardell, Arguello, Artieri, Barbash, Barker, Barsanti, Batterham, Batzoglou, Begun, Bhutkar, Blanco, Bosak, Bradley, Brand, Brent, Brooks, Brown, Butlin, Caggese, Calvi, Bernardo, Caspi, Castrezana, Celniker, Chang, Chapple, Chatterji, Chinwalla, Civetta, Clifton, Comeron, Costello, Coyne, Daub, David, Delcher, Delehaunty, Do, Ebling, Edwards, Eickbush, Evans, Filipski, Findeiss, Freyhult, Fulton, Fulton, Garcia, Gardiner, Garfield, Garvin, Gibson, Gilbert, Gnerre, Godfrey, Good, Gotea, Gravely, Greenberg, Griffiths-Jones, Gross, Guigo, Gustafson, Haerty, Hahn, Halligan, Halpern, Halter, Han, Heger, Hillier, Hinrichs, Holmes, Hoskins, Hubisz, Hultmark, Huntley, Jaffe, Jagadeeshan, Jeck, Johnson, Jones, Jordan, Karpen, Kataoka, Keightley, Kheradpour, Kirkness, Koerich, Kristiansen, Kudrna, Kulathinal, Kumar, Kwok, Lander, Langley, Lapoint, Lazzaro, Lee, Levesque, Li, Lin, Lin, Lindblad-Toh, Llopart, Long, Low, Lozovsky, Lu, Luo, Machado, Makalowski, Marzo, Matsuda, Matzkin, McAllister, McBride, McKernan, McKernan, Mendez-Lago, Minx, Mollenhauer, Montooth, Mount, Mu, Myers, Negre, Newfeld, Nielsen, Noor, O'Grady, Pachter, Papaceit, Parisi, Parisi, Parts, Pedersen, Pesole, Phillippy, Ponting, Pop, Porcelli, Powell, Prohaska, Pruitt, Puig, Quesneville, Ram, Rand, Rasmussen, Reed, Reenan, Reily, Remington, Rieger, Ritchie, Robin, Rogers, Rohde, Rozas, Rubenfield, Ruiz, Russo, Salzberg, Sanchez-Gracia, Saranga, Sato, Schaeffer, Schatz, Schlenke, Schwartz, Segarra, Singh, Sirot, Sirota, Sisneros, Smith, Smith, Spieth, Stage, Stark, Stephan, Strausberg, Strempel, Sturgill, Sutton, Sutton, Tao, Teichmann, Tobari, Tomimura, Tsolas, Valente, Venter, Venter, Vicario, Vieira, Vilella, Villasante, Walenz, Wang, Wasserman, Watts, Wilson, Wilson, Wing, Wolfner, Wong, Wong, Wu, Wu, Yamamoto, Yang, Yang, Yorke, Yoshida, Zdobnov, Zhang, Zhang, Zimin, Baldwin, Abdouelleil, Abdulkadir, Abebe, Abera, Abreu, Acer, Aftuck, Alexander, An, Anderson, Anderson, Arachi, Azer, Bachantsang, Barry, Bayul, Berlin, Bessette, Bloom, Blye, Boguslavskiy, Bonnet, Boukhgalter, Bourzgui, Brown, Cahill, Channer, Cheshatsang, Chuda, Citroen, Collymore, Cooke, Costello, D'Aco, Daza, De, DeGray, DeMaso, Dhargay, Dooley, Dooley, Doricent, Dorje, Dorjee, Dupes, Elong, Falk, Farina, Faro, Ferguson, Fisher, Foley, Franke, Friedrich, Gadbois, Gearin, Gearin, Giannoukos, Goode, Graham, Grandbois, Grewal, Gyaltsen, Hafez, Hagos, Hall, Henson, Hollinger, Honan, Huard, Hughes, Hurhula, Husby, Kamat, Kanga, Kashin, Khazanovich, Kisner, Lance, Lara, Lee, Lennon, Letendre, LeVine, Lipovsky, Liu, Liu, Liu, Lokyitsang, Lokyitsang, Lubonja, Lui, MacDonald, Magnisalis, Maru, Matthews, McCusker, McDonough, Mehta, Meldrim, Meneus, Mihai, Mihalev, Mihova, Mittelman, Mlenga, Montmayeur, Mulrain, Navidi, Naylor, Negash, Nguyen, Nguyen, Nicol, Norbu, Norbu, Novod, O'Neill, Osman, Markiewicz, Oyono, Patti, Phunkhang, Pierre, Priest, Raghuraman, Rege, Reyes, Rise, Rogov, Ross, Ryan, Settipalli, Shea, Sherpa, Shi, Shih, Sparrow, Spaulding, Stalker, Stange-Thomann, Stavropoulos, Stone, Strader, Tesfaye, Thomson, Thoulutsang, Thoulutsang, Topham, Topping, Tsamla, Vassiliev, Vo, Wangchuk, Wangdi, Weiand, Wilkinson, Wilson, Yadav, Young, Yu, Zembek, Zhong, Zimmer, Zwirko, Jaffe, Alvarez, Brockman, Butler, Chin, Gnerre, Grabherr, Kleber, Mauceli, MacCallum

Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.

Nature. 2007 Nov 8;450(7167):203-18

30. Genome analysis of the platypus reveals unique signatures of evolution.

Warren, Hillier, Marshall, Birney, Ponting, Gr�tzner, Belov, Miller, Clarke, Chinwalla, Yang, Heger, Locke, Miethke, Waters, Veyrunes, Fulton, Fulton, Graves, Wallis, Puente, L�pez-Ot�n, Ord��ez, Eichler, Chen, Cheng, Deakin, Alsop, Thompson, Kirby, Papenfuss, Wakefield, Olender, Lancet, Huttley, Smit, Pask, Temple-Smith, Batzer, Walker, Konkel, Harris, Whittington, Wong, Gemmell, Buschiazzo, Vargas, Merkel, Schmitz, Zemann, Churakov, Kriegs, Brosius, Murchison, Sachidanandam, Smith, Hannon, Tsend-Ayush, McMillan, Attenborough, Rens, Ferguson-Smith, Lef�vre, Sharp, Nicholas, Ray, Kube, Reinhardt, Pringle, Taylor, Jones, Nixon, Dacheux, Niwa, Sekita, Huang, Stark, Kheradpour, Kellis, Flicek, Chen, Webber, Hardison, Nelson, Hallsworth-Pepin, Delehaunty, Markovic, Minx, Feng, Kremitzki, Mitreva, Glasscock, Wylie, Wohldmann, Thiru, Nhan, Pohl, Smith, Hou, Nefedov, de, Renfree, Mardis, Wilson

We present a draft genome sequence of the platypus, Ornithorhynchus anatinus. This monotreme exhibits a fascinating combination of reptilian and mammalian characters. For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles. Analysis of the first monotreme genome aligned these features with genetic innovations. We find that reptile and platypus venom proteins have been co-opted independently from the same gene families; milk protein genes are conserved despite platypuses laying eggs; and immune gene family expansions are directly related to platypus biology. Expansions of protein, non-protein-coding RNA and microRNA families, as well as repeat elements, are identified. Sequencing of this genome now provides a valuable resource for deep mammalian comparative analyses, as well as for monotreme biology and conservation.

Nature. 2008 May 8;453(7192):175-83

34. Evolution of pathogenicity and sexual reproduction in eight Candida genomes.

Butler, Rasmussen, Lin, Santos, Sakthikumar, Munro, Rheinbay, Grabherr, Forche, Reedy, Agrafioti, Arnaud, Bates, Brown, Brunke, Costanzo, Fitzpatrick, de, Harris, Hoyer, Hube, Klis, Kodira, Lennard, Logue, Martin, Neiman, Nikolaou, Quail, Quinn, Santos, Schmitzberger, Sherlock, Shah, Silverstein, Skrzypek, Soll, Staggs, Stansfield, Stumpf, Sudbery, Srikantha, Zeng, Berman, Berriman, Heitman, Gow, Lorenz, Birren, Kellis, Cuomo

Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes.

Nature. 2009 Jun 4;459(7247):657-62

Small regulatory RNAs (and large regulatory RNAs)

26. A single Hox locus in Drosophila produces functional microRNAs from opposite DNA strands.

Stark, Bushati, Jan, Kheradpour, Hodges, Brennecke, Bartel, Cohen, Kellis

MicroRNAs (miRNAs) are approximately 22-nucleotide RNAs that are processed from characteristic precursor hairpins and pair to sites in messages of protein-coding genes to direct post-transcriptional repression. Here, we report that the miRNA iab-4 locus in the Drosophila Hox cluster is transcribed convergently from both DNA strands, giving rise to two distinct functional miRNAs. Both sense and antisense miRNA products target neighboring Hox genes via highly conserved sites, leading to homeotic transformations when ectopically expressed. We also report sense/antisense miRNAs in mouse and find antisense transcripts close to many miRNAs in both flies and mammals, suggesting that additional sense/antisense pairs exist.

Genes Dev. 2008 Jan 1;22(1):8-13

19. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes.

Stark, Kheradpour, Parts, Brennecke, Hodges, Hannon, Kellis

MicroRNAs (miRNAs) are short regulatory RNAs that inhibit target genes by complementary binding in 3' untranslated regions (3' UTRs). They are one of the most abundant classes of regulators, targeting a large fraction of all genes, making their comprehensive study a requirement for understanding regulation and development. Here we use 12 Drosophila genomes to define structural and evolutionary signatures of miRNA hairpins, which we use for their de novo discovery. We predict >41 novel miRNA genes, which encompass many unique families, and 28 of which are validated experimentally. We also define signals for the precise start position of mature miRNAs, which suggest corrections of previously known miRNAs, often leading to drastic changes in their predicted target spectrum. We show that miRNA discovery power scales with the number and divergence of species compared, suggesting that such approaches can be successful in human as dozens of mammalian genomes become available. Interestingly, for some miRNAs sense and anti-sense hairpins score highly and mature miRNAs from both strands can indeed be found in vivo. Similarly, miRNAs with weak 5' end predictions show increased in vivo processing of multiple alternate 5' ends and have fewer predicted targets. Lastly, we show that several miRNA star sequences score highly and are likely functional. For mir-10 in particular, both arms show abundant processing, and both show highly conserved target sites in Hox genes, suggesting a possible cooperation of the two arms, and their role as a master Hox regulator.

Genome Res. 2007 Dec;17(12):1865-79. Epub 2007 Nov 7

11. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals.

Xie, Lu, Kulbokas, Golub, Mootha, Lindblad-Toh, Lander, Kellis

Comprehensive identification of all functional elements encoded in the human genome is a fundamental need in biomedical research. Here, we present a comparative analysis of the human, mouse, rat and dog genomes to create a systematic catalogue of common regulatory motifs in promoters and 3' untranslated regions (3' UTRs). The promoter analysis yields 174 candidate motifs, including most previously known transcription-factor binding sites and 105 new motifs. The 3'-UTR analysis yields 106 motifs likely to be involved in post-transcriptional regulation. Nearly one-half are associated with microRNAs (miRNAs), leading to the discovery of many new miRNA genes and their likely target genes. Our results suggest that previous estimates of the number of human miRNA genes were low, and that miRNAs regulate at least 20% of human genes. The overall results provide a systematic view of gene regulation in the human, which will be refined as additional mammalian genomes become available.

Nature. 2005 Mar 17;434(7031):338-45. Epub 2005 Feb 27

31. An endogenous small interfering RNA pathway in Drosophila.

Czech, Malone, Zhou, Stark, Schlingeheyde, Dus, Perrimon, Kellis, Wohlschlegel, Sachidanandam, Hannon, Brennecke

Drosophila endogenous small RNAs are categorized according to their mechanisms of biogenesis and the Argonaute protein to which they bind. MicroRNAs are a class of ubiquitously expressed RNAs of approximately 22 nucleotides in length, which arise from structured precursors through the action of Drosha-Pasha and Dicer-1-Loquacious complexes. These join Argonaute-1 to regulate gene expression. A second endogenous small RNA class, the Piwi-interacting RNAs, bind Piwi proteins and suppress transposons. Piwi-interacting RNAs are restricted to the gonad, and at least a subset of these arises by Piwi-catalysed cleavage of single-stranded RNAs. Here we show that Drosophila generates a third small RNA class, endogenous small interfering RNAs, in both gonadal and somatic tissues. Production of these RNAs requires Dicer-2, but a subset depends preferentially on Loquacious rather than the canonical Dicer-2 partner, R2D2 (ref. 14). Endogenous small interfering RNAs arise both from convergent transcription units and from structured genomic loci in a tissue-specific fashion. They predominantly join Argonaute-2 and have the capacity, as a class, to target both protein-coding genes and mobile elements. These observations expand the repertoire of small RNAs in Drosophila, adding a class that blurs distinctions based on known biogenesis mechanisms and functional roles.

Nature. 2008 Jun 5;453(7196):798-802. Epub 2008 May 7

18. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs.

Ruby, Stark, Johnston, Kellis, Bartel, Lai

MicroRNA (miRNA) genes give rise to small regulatory RNAs in a wide variety of organisms. We used computational methods to predict miRNAs conserved among Drosophila species and large-scale sequencing of small RNAs from Drosophila melanogaster to experimentally confirm and complement these predictions. In addition to validating 20 of our top 45 predictions for novel miRNA loci, the large-scale sequencing identified many miRNAs that had not been predicted. In total, 59 novel genes were identified, increasing our tally of confirmed fly miRNAs to 148. The large-scale sequencing also refined the identities of previously known miRNAs and provided insights into their biogenesis and expression. Many miRNAs were expressed in particular developmental contexts, with a large cohort of miRNAs expressed primarily in imaginal discs. Conserved miRNAs typically were expressed more broadly and robustly than were nonconserved miRNAs, and those conserved miRNAs with more restricted expression tended to have fewer predicted targets than those expressed more broadly. Predicted targets for the expanded set of microRNAs substantially increased and revised the miRNA-target relationships that appear conserved among the fly species. Insights were also provided into miRNA gene evolution, including evidence for emergent regulatory function deriving from the opposite arm of the miRNA hairpin, exemplified by mir-10, and even the opposite strand of the DNA, exemplified by mir-iab-4.

Genome Res. 2007 Dec;17(12):1850-64. Epub 2007 Nov 7

15. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila.

Brennecke, Aravin, Stark, Dus, Kellis, Sachidanandam, Hannon

Drosophila Piwi-family proteins have been implicated in transposon control. Here, we examine piwi-interacting RNAs (piRNAs) associated with each Drosophila Piwi protein and find that Piwi and Aubergine bind RNAs that are predominantly antisense to transposons, whereas Ago3 complexes contain predominantly sense piRNAs. As in mammals, the majority of Drosophila piRNAs are derived from discrete genomic loci. These loci comprise mainly defective transposon sequences, and some have previously been identified as master regulators of transposon activity. Our data suggest that heterochromatic piRNA loci interact with potentially active, euchromatic transposons to form an adaptive system for transposon control. Complementary relationships between sense and antisense piRNA populations suggest an amplification loop wherein each piRNA-directed cleavage event generates the 5' end of a new piRNA. Thus, sense piRNAs, formed following cleavage of transposon mRNAs may enhance production of antisense piRNAs, complementary to active elements, by directing cleavage of transcripts from master control loci.

Cell. 2007 Mar 23;128(6):1089-103. Epub 2007 Mar 8

29. Conservation of small RNA pathways in platypus.

Murchison, Kheradpour, Sachidanandam, Smith, Hodges, Xuan, Kellis, Gr�tzner, Stark, Hannon

Small RNA pathways play evolutionarily conserved roles in gene regulation and defense from parasitic nucleic acids. The character and expression patterns of small RNAs show conservation throughout animal lineages, but specific animal clades also show variations on these recurring themes, including species-specific small RNAs. The monotremes, with only platypus and four species of echidna as extant members, represent the basal branch of the mammalian lineage. Here, we examine the small RNA pathways of monotremes by deep sequencing of six platypus and echidna tissues. We find that highly conserved microRNA species display their signature tissue-specific expression patterns. In addition, we find a large rapidly evolving cluster of microRNAs on platypus chromosome X1, which is unique to monotremes. Platypus and echidna testes contain a robust Piwi-interacting (piRNA) system, which appears to be participating in ongoing transposon defense.

Genome Res. 2008 Jun;18(6):995-1004. Epub 2008 May 7

38. The Tasmanian devil transcriptome reveals Schwann cell origins of a clonally transmissible cancer.

Murchison, Tovar, Hsu, Bender, Kheradpour, Rebbeck, Obendorf, Conlan, Bahlo, Blizzard, Pyecroft, Kreiss, Kellis, Stark, Harkins, Marshall, Woods, Hannon, Papenfuss

The Tasmanian devil, a marsupial carnivore, is endangered because of the emergence of a transmissible cancer known as devil facial tumor disease (DFTD). This fatal cancer is clonally derived and is an allograft transmitted between devils by biting. We performed a large-scale genetic analysis of DFTD with microsatellite genotyping, a mitochondrial genome analysis, and deep sequencing of the DFTD transcriptome and microRNAs. These studies confirm that DFTD is a monophyletic clonally transmissible tumor and suggest that the disease is of Schwann cell origin. On the basis of these results, we have generated a diagnostic marker for DFTD and identify a suite of genes relevant to DFTD pathology and transmission. We provide a genomic data set for the Tasmanian devil that is applicable to cancer diagnosis, disease evolution, and conservation biology.

Science. 2010 Jan 1;327(5961):84-7

Protein-coding genes

17. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes.

Lin, Carlson, Crosby, Matthews, Yu, Park, Wan, Schroeder, Gramates, St, Roark, Wiley, Kulathinal, Zhang, Myrick, Antone, Celniker, Gelbart, Kellis

The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect >10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster.

Genome Res. 2007 Dec;17(12):1823-36. Epub 2007 Nov 7

25. Distinguishing protein-coding and noncoding genes in the human genome.

Clamp, Fry, Kamal, Xie, Cuff, Lin, Kellis, Lindblad-Toh, Lander

Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.

Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33. Epub 2007 Nov 26

2. Sequencing and comparison of yeast species to identify genes and regulatory elements.

Kellis, Patterson, Endrizzi, Birren, Lander

Nature. 2003 May 15;423(6937):241-54

28. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes.

Lin, Deoras, Rasmussen, Kellis

Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (< or =240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human.

PLoS Comput Biol. 2008 Apr 18;4(4):e1000067

35. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Pruitt, Harrow, Harte, Wallin, Diekhans, Maglott, Searle, Farrell, Loveland, Ruef, Hart, Suner, Landrum, Aken, Ayling, Baertsch, Fernandez-Banet, Cherry, Curwen, Dicuccio, Kellis, Lee, Lin, Schuster, Shkeda, Amid, Brown, Dukhanina, Frankish, Hart, Maidak, Mudge, Murphy, Murphy, Rajan, Rajput, Riddick, Snow, Steward, Webb, Weber, Wilming, Wu, Birney, Haussler, Hubbard, Ostell, Durbin, Lipman

Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4

43. Optimization of parameters for coverage of low molecular weight proteins.

M�ller, Kohajda, Findei�, Stadler, Washietl, Kellis, vonBergen, Kalkhof

Proteins with molecular weights of <25 kDa are involved in major biological processes such as ribosome formation, stress adaption (e.g., temperature reduction) and cell cycle control. Despite their importance, the coverage of smaller proteins in standard proteome studies is rather sparse. Here we investigated biochemical and mass spectrometric parameters that influence coverage and validity of identification. The underrepresentation of low molecular weight (LMW) proteins may be attributed to the low numbers of proteolytic peptides formed by tryptic digestion as well as their tendency to be lost in protein separation and concentration/desalting procedures. In a systematic investigation of the LMW proteome of Escherichia coli, a total of 455 LMW proteins (27% of the 1672 listed in the SwissProt protein database) were identified, corresponding to a coverage of 62% of the known cytosolic LMW proteins. Of these proteins, 93 had not yet been functionally classified, and five had not previously been confirmed at the protein level. In this study, the influences of protein extraction (either urea or TFA), proteolytic digestion (solely, and the combined usage of trypsin and AspN as endoproteases) and protein separation (gel- or non-gel-based) were investigated. Compared to the standard procedure based solely on the use of urea lysis buffer, in-gel separation and tryptic digestion, the complementary use of TFA for extraction or endoprotease AspN for proteolysis permits the identification of an extra 72 (32%) and 51 proteins (23%), respectively. Regarding mass spectrometry analysis with an LTQ Orbitrap mass spectrometer, collision-induced fragmentation (CID and HCD) and electron transfer dissociation using the linear ion trap (IT) or the Orbitrap as the analyzer were compared. IT-CID was found to yield the best identification rate, whereas IT-ETD provided almost comparable results in terms of LMW proteome coverage. The high overlap between the proteins identified with IT-CID and IT-ETD allowed the validation of 75% of the identified proteins using this orthogonal fragmentation technique. Furthermore, a new approach to evaluating and improving the completeness of protein databases that utilizes the program RNAcode was introduced and examined.

Anal Bioanal Chem. 2010 Aug 28. [Epub ahead of print]

46. Locating protein-coding sequences under purifying selection for additional, overlapping functions in 29 mammalian genomes

Lin, Washietl, Kheradpour, Parker, Pedersen, Kellis,

The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes - especially at synonymous sites. In this study, we use genome alignments of 29 placental mammals to systematically locate short regions within human ORFs that show conspicuously low estimated rates of synonymous substitution across these species. The 29-species alignment provides statistical power to locate more than 10,000 such regions with resolution down to nine-codon windows, which are found within more than a quarter of all human protein-coding genes and contain ~2% of their synonymous sites. We collect numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. Our results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape.

Genome Research, 14 pages, in press.

Chromatin, regulation, and developmental programs

32. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals.

Guttman, Amit, Garber, French, Lin, Feldser, Huarte, Zuk, Carey, Cassady, Cabili, Jaenisch, Mikkelsen, Jacks, Hacohen, Bernstein, Kellis, Regev, Rinn, Lander

There is growing recognition that mammalian cells produce many thousands of large intergenic transcripts. However, the functional significance of these transcripts has been particularly controversial. Although there are some well-characterized examples, most (>95%) show little evidence of evolutionary conservation and have been suggested to represent transcriptional noise. Here we report a new approach to identifying large non-coding RNAs using chromatin-state maps to discover discrete transcriptional units intervening known protein-coding loci. Our approach identified approximately 1,600 large multi-exonic RNAs across four mouse cell types. In sharp contrast to previous collections, these large intervening non-coding RNAs (lincRNAs) show strong purifying selection in their genomic loci, exonic sequences and promoter regions, with greater than 95% showing clear evolutionary conservation. We also developed a functional genomics approach that assigns putative functions to each lincRNA, demonstrating a diverse range of roles for lincRNAs in processes from embryonic stem cell pluripotency to cell proliferation. We obtained independent functional validation for the predictions for over 100 lincRNAs, using cell-based assays. In particular, we demonstrate that specific lincRNAs are transcriptionally regulated by key transcription factors in these processes such as p53, NFkappaB, Sox2, Oct4 (also known as Pou5f1) and Nanog. Together, these results define a unique collection of functional lincRNAs that are highly conserved and implicated in diverse biological processes.

Nature. 2009 Mar 12;458(7235):223-7. Epub 2009 Feb 1

24. RNA polymerase stalling at developmental control genes in the Drosophila melanogaster embryo.

Zeitlinger, Stark, Kellis, Hong, Nechaev, Adelman, Levine, Young

It is widely assumed that the key rate-limiting step in gene activation is the recruitment of RNA polymerase II (Pol II) to the core promoter. Although there are well-documented examples in which Pol II is recruited to a gene but stalls, a general role for Pol II stalling in development has not been established. We have carried out comprehensive Pol II chromatin immunoprecipitation microarray (ChIP-chip) assays in Drosophila embryos and identified three distinct Pol II binding behaviors: active (uniform binding across the entire transcription unit), no binding, and stalled (binding at the transcription start site). The notable feature of the approximately 10% genes that are stalled is that they are highly enriched for developmental control genes, which are either repressed or poised for activation during later stages of embryogenesis. We propose that Pol II stalling facilitates rapid temporal and spatial changes in gene activity during development.

Nat Genet. 2007 Dec;39(12):1512-6. Epub 2007 Nov 11

14. Whole-genome ChIP-chip analysis of Dorsal, Twist, and Snail suggests integration of diverse patterning processes in the Drosophila embryo.

Zeitlinger, Zinzen, Stark, Kellis, Zhang, Young, Levine

Genetic studies have identified numerous sequence-specific transcription factors that control development, yet little is known about their in vivo distribution across animal genomes. We determined the genome-wide occupancy of the dorsoventral (DV) determinants Dorsal, Twist, and Snail in the Drosophila embryo using chromatin immunoprecipitation coupled with microarray analysis (ChIP-chip). The in vivo binding of these proteins correlate tightly with the limits of known enhancers. Our analysis predicts substantially more target genes than previous estimates, and includes Dpp signaling components and anteroposterior (AP) segmentation determinants. Thus, the ChIP-chip data uncover a much larger than expected regulatory network, which integrates diverse patterning processes during development.

Genes Dev. 2007 Feb 15;21(4):385-90

8. Transcriptional regulatory code of a eukaryotic genome.

Harbison, Gordon, Lee, Rinaldi, Macisaac, Danford, Hannett, Tagne, Reynolds, Yoo, Jennings, Zeitlinger, Pokholok, Kellis, Rolfe, Takusagawa, Lander, Gifford, Fraenkel, Young

DNA-binding transcriptional regulators interpret the genome's regulatory code by binding to specific sequences to induce or repress gene expression. Comparative genomics has recently been used to identify potential cis-regulatory sequences within the yeast genome on the basis of phylogenetic conservation, but this information alone does not reveal if or when transcriptional regulators occupy these binding sites. We have constructed an initial map of yeast's transcriptional regulatory code by identifying the sequence elements that are bound by regulators under various conditions and that are conserved among Saccharomyces species. The organization of regulatory elements in promoters and the environment-dependent use of these elements by regulators are discussed. We find that environment-specific use of regulatory elements predicts mechanistic models for the function of a large population of yeast's transcriptional regulators.

Nature. 2004 Sep 2;431(7004):99-104

33. Histone modifications at human enhancers reflect global cell-type-specific gene expression.

Heintzman, Hon, Hawkins, Kheradpour, Stark, Harp, Ye, Lee, Stuart, Ching, Ching, Antosiewicz-Bourget, Liu, Zhang, Green, Lobanenkov, Stewart, Thomson, Crawford, Kellis, Ren

The human body is composed of diverse cell types with distinct functions. Although it is known that lineage specification depends on cell-specific gene expression, which in turn is driven by promoters, enhancers, insulators and other cis-regulatory DNA sequences for each gene, the relative roles of these regulatory elements in this process are not clear. We have previously developed a chromatin-immunoprecipitation-based microarray method (ChIP-chip) to locate promoters, enhancers and insulators in the human genome. Here we use the same approach to identify these elements in multiple cell types and investigate their roles in cell-type-specific gene expression. We observed that the chromatin state at promoters and CTCF-binding at insulators is largely invariant across diverse cell types. In contrast, enhancers are marked with highly cell-type-specific histone modification patterns, strongly correlate to cell-type-specific gene expression programs on a global scale, and are functionally active in a cell-type-specific manner. Our results define over 55,000 potential transcriptional enhancers in the human genome, significantly expanding the current catalogue of human enhancers and highlighting the role of these elements in cell-type-specific gene expression.

Nature. 2009 May 7;459(7243):108-12. Epub 2009 Mar 18

36. Unlocking the secrets of the genome.

Celniker, Dillon, Gerstein, Gunsalus, Henikoff, Karpen, Kellis, Lai, Lieb, MacAlpine, Micklem, Piano, Snyder, Stein, White, Waterston

Despite the successes of genomics, little is known about how genetic information produces complex organisms. A look at the crucial functional elements of fly and worm genomes could change that.

Nature. 2009 Jun 18;459(7249):927-30

39. A comprehensive map of insulator elements for the Drosophila genome.

N�gre, Brown, Shah, Kheradpour, Morrison, Henikoff, Feng, Ahmad, Russell, White, Stein, Henikoff, Kellis, White

Insulators are DNA sequences that control the interactions among genomic regulatory elements and act as chromatin boundaries. A thorough understanding of their location and function is necessary to address the complexities of metazoan gene regulation. We studied by ChIP-chip the genome-wide binding sites of 6 insulator-associated proteins-dCTCF, CP190, BEAF-32, Su(Hw), Mod(mdg4), and GAF-to obtain the first comprehensive map of insulator elements in Drosophila embryos. We identify over 14,000 putative insulators, including all classically defined insulators. We find two major classes of insulators defined by dCTCF/CP190/BEAF-32 and Su(Hw), respectively. Distributional analyses of insulators revealed that particular sub-classes of insulator elements are excluded between cis-regulatory elements and their target promoters; divide differentially expressed, alternative, and divergent promoters; act as chromatin boundaries; are associated with chromosomal breakpoints among species; and are embedded within active chromatin domains. Together, these results provide a map demarcating the boundaries of gene regulatory units and a framework for understanding insulator function during the development and evolution of Drosophila.

PLoS Genet. 2010 Jan 15;6(1):e1000814

42. Discovery and characterization of chromatin states for systematic annotation of the human genome.

Ernst, Kellis

A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal 'chromatin states' in human T cells, based on recurrent and spatially coherent combinations of chromatin marks. We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, large-scale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

Nat Biotechnol. 2010 Aug;28(8):817-25. Epub 2010 Jul 25

44. The NIH Roadmap Epigenomics Mapping Consortium

Bernstein, Stamatoyannopoulos, Costello, Ren, Milosavljevic, Meissner, Kellis, Marra, Beaudet, Ecker, Farnham, Hirst, Lander, Mikkelsen, Thomson,

The NIH Roadmap Epigenomics Mapping Consortium aims to produce a public resource of epigenomic maps for stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease. The maps will detail the genomewide landscapes of DNA methylation, histone modifications and related chromatin features. They are intended to provide a reference for studies of the genetic and epigenetic events that underlie human development, diversity and disease. Here we describe the organizational structure, goals and anticipated deliverables of the consortium.

Nature Biotechnology, 3 pages, in press

Regulatory Motifs

11. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals.

Xie, Lu, Kulbokas, Golub, Mootha, Lindblad-Toh, Lander, Kellis

Nature. 2005 Mar 17;434(7031):338-45. Epub 2005 Feb 27

2. Sequencing and comparison of yeast species to identify genes and regulatory elements.

Kellis, Patterson, Endrizzi, Birren, Lander

Nature. 2003 May 15;423(6937):241-54

16. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites.

Xie, Mikkelsen, Gnirke, Lindblad-Toh, Kellis, Lander

Conserved noncoding elements (CNEs) constitute the majority of sequences under purifying selection in the human genome, yet their function remains largely unknown. Experimental evidence suggests that many of these elements play regulatory roles, but little is known about regulatory motifs contained within them. Here we describe a systematic approach to discover and characterize regulatory motifs within mammalian CNEs by searching for long motifs (12-22 nt) with significant enrichment in CNEs and studying their biochemical and genomic properties. Our analysis identifies 233 long motifs (LMs), matching a total of approximately 60,000 conserved instances across the human genome. These motifs include 16 previously known regulatory elements, such as the histone 3'-UTR motif and the neuron-restrictive silencer element, as well as striking examples of novel functional elements. The most highly enriched motif (LM1) corresponds to the X-box motif known from yeast and nematode. We show that it is bound by the RFX1 protein and identify thousands of conserved motif instances, suggesting a broad role for the RFX family in gene regulation. A second group of motifs (LM2*) does not match any previously known motif. We demonstrate by biochemical and computational methods that it defines a binding site for the CTCF protein, which is involved in insulator function to limit the spread of gene activation. We identify nearly 15,000 conserved sites that likely serve as insulators, and we show that nearby genes separated by predicted CTCF sites show markedly reduced correlation in gene expression. These sites may thus partition the human genome into domains of expression.

Proc Natl Acad Sci U S A. 2007 Apr 24;104(17):7145-50. Epub 2007 Apr 18

C03. Phylogenetically and Spatially Conserved Word Pairs Associated with Gene-Expression Changes in Yeasts

Chiang, Moses, Kellis, Lander, Eisen

Transcriptional regulation in eukaryotes is often multifactorial, involving multiple transcription factors binding to the same transcription control region (e.g., upstream activating sequences and enhancers), and to understand the regulatory content of eukaryotic genomes it is necessary to consider the cooccurrence and spatial relationships of individual binding sites. The identification of sequences conserved among related species (often known as phylogenetic footprinting) has been successfully used to identify individual transcription factor binding sites. Here, we extend this concept of functional conservation to higher-order features of transcription control regions involved in the multifactorial control of gene expression. We used the genome sequences of four yeast species of the genus Saccharomyces to identify sequences potentially involved in multifactorial control of gene expression. We found 1,117 potential regulatory "templates": pairs of hexameric sequences that are jointly conserved in transcription regulatory regions and also exhibit non-random relative spacing. Many of the individual sequences in these templates correspond to known transcription factor binding sites, and the sets of genes containing a particular template in their transcription control regions tend to be differentially expressed in conditions where the corresponding transcription factors are known to be active. The incorporation of both joint conservation and spacing constraints of sequence pairs predicts groups of target genes that were specific for common patterns of gene expression. Our work suggests that positional information, especially the relative spacing between transcription factor binding sites, may represent a common organizing principle of transcription control regions.

ACM RECOMB, p. 84-93, Apr 13, 2003

4. Position specific variation in the rate of evolution in transcription factor binding sites.

Moses, Chiang, Kellis, Lander, Eisen

The binding sites of sequence specific transcription factors are an important and relatively well-understood class of functional non-coding DNAs. Although a wide variety of experimental and computational methods have been developed to characterize transcription factor binding sites, they remain difficult to identify. Comparison of non-coding DNA from related species has shown considerable promise in identifying these functional non-coding sequences, even though relatively little is known about their evolution. Here we analyse the genome sequences of the budding yeasts Saccharomyces cerevisiae, S. bayanus, S. paradoxus and S. mikatae to study the evolution of transcription factor binding sites. As expected, we find that both experimentally characterized and computationally predicted binding sites evolve slower than surrounding sequence, consistent with the hypothesis that they are under purifying selection. We also observe position-specific variation in the rate of evolution within binding sites. We find that the position-specific rate of evolution is positively correlated with degeneracy among binding sites within S. cerevisiae. We test theoretical predictions for the rate of evolution at positions where the base frequencies deviate from background due to purifying selection and find reasonable agreement with the observed rates of evolution. Finally, we show how the evolutionary characteristics of real binding motifs can be used to distinguish them from artefacts of computational motif finding algorithms. CONCLUSION: As has been observed for protein sequences, the rate of evolution in transcription factor binding sites varies with position, suggesting that some regions are under stronger functional constraint than others. This variation likely reflects the varying importance of different positions in the formation of the protein-DNA complex. The characterization of the pattern of evolution in known binding sites will likely contribute to the effective use of comparative sequence data in the identification of transcription factor binding sites and is an important step toward understanding the evolution of functional non-coding DNA.

BMC Evol Biol. 2003 Aug 28;3:19. Epub 2003 Aug 28

37. Motif Discovery in Physiological Datasets: A Methodology for Inferring Predictive Elements.

Syed, Stultz, Kellis, Indyk, Guttag

In this article, we propose a methodology for identifying predictive physiological patterns in the absence of prior knowledge. We use the principle of conservation to identify activity that consistently precedes an outcome in patients, and describe a two-stage process that allows us to efficiently search for such patterns in large datasets. This involves first transforming continuous physiological signals from patients into symbolic sequences, and then searching for patterns in these reduced representations that are strongly associated with an outcome.Our strategy of identifying conserved activity that is unlikely to have occurred purely by chance in symbolic data is analogous to the discovery of regulatory motifs in genomic datasets. We build upon existing work in this area, generalizing the notion of a regulatory motif and enhancing current techniques to operate robustly on non-genomic data. We also address two significant considerations associated with motif discovery in general: computational efficiency and robustness in the presence of degeneracy and noise. To deal with these issues, we introduce the concept of active regions and new subset-based techniques such as a two-layer Gibbs sampling algorithm. These extensions allow for a framework for information inference, where precursors are identified as approximately conserved activity of arbitrary complexity preceding multiple occurrences of an event.We evaluated our solution on a population of patients who experienced sudden cardiac death and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for motifs in the sudden death population against control populations of normal individuals and those with non-fatal supraventricular arrhythmias. Our results suggest that predictive motif discovery may be able to identify clinically relevant information even in the absence of significant prior knowledge.

ACM Trans Knowl Discov Data. 2010 Jan;4(1):2

Biological Networks

13. Network motif discovery using motif enumeration and symmetry conditions

Grochow, Kellis

The study of biological networks and network motifs can yield significant new insights into systems biology. Previous methods of discovering network motifs - network-centric subgraph enumeration and sampling - have been limited to motifs of 6 to 8 nodes, revealing only the smallest network components. New methods are necessary to identify larger network sub-structures and functional motifs. Here we present a novel algorithm for discovering large network motifs that achieves these goals, based on a novel symmetry-breaking technique, which eliminates repeated isomorphism testing, leading to an exponential speed-up over previous methods. This technique is made possible by reversing the traditional network-based search at the heart of the algorithm to a motif-based search, which also eliminates the need to store all motifs of a given size and enables parallelization and scaling. Additionally, our method enables us to study the clustering properties of discovered motifs, revealing even larger network elements. We apply this algorithm to the protein-protein interaction network and transcription regulatory network of S. cerevisiae, and discover several large network motifs, which were previously inaccessible to existing methods, including a 29-node cluster of 15-node motifs corresponding to the key transcription machinery of S. cerevisiae.

Lecture Notes in Bioinformatics, 2007

27. The evolutionary dynamics of the Saccharomyces cerevisiae protein interaction network after duplication.

Presser, Elowitz, Kellis, Kishony

Gene duplication is an important mechanism in the evolution of protein interaction networks. Duplications are followed by the gain and loss of interactions, rewiring the network at some unknown rate. Because rewiring is likely to change the distribution of network motifs within the duplicated interaction set, it should be possible to study network rewiring by tracking the evolution of these motifs. We have developed a mathematical framework that, together with duplication data from comparative genomic and proteomic studies, allows us to infer the connectivity of the preduplication network and the changes in connectivity over time. We focused on the whole-genome duplication (WGD) event in Saccharomyces cerevisiae. The model allowed us to predict the frequency of intergene interaction before WGD and the post duplication probabilities of interaction gain and loss. We find that the predicted frequency of self-interactions in the preduplication network is significantly higher than that observed in today's network. This could suggest a structural difference between the modern and ancestral networks, preferential addition or retention of interactions between ohnologs, or selective pressure to preserve duplicates of self-interacting proteins.

Proc Natl Acad Sci U S A. 2008 Jan 22;105(3):950-4. Epub 2008 Jan 16

20. Reliable prediction of regulator targets using 12 Drosophila genomes.

Kheradpour, Stark, Roy, Kellis

Gene expression is regulated pre- and post-transcriptionally via cis-regulatory DNA and RNA motifs. Identification of individual functional instances of such motifs in genome sequences is a major goal for inferring regulatory networks yet has been hampered due to the motifs' short lengths that lead to many chance matches and poor signal-to-noise ratios. In this paper, we develop a general methodology for the comparative identification of functional motif instances across many related species, using a phylogenetic framework that accounts for the evolutionary relationships between species, allows for motif movements, and is robust against missing data due to artifacts in sequencing, assembly, or alignment. We also provide a robust statistical framework for evaluating motif confidence, which enables us to translate evolutionary conservation into a confidence measure for each motif instance, correcting for varying motif length, composition, and background conservation of the target regions. We predict targets of fly transcription factors and miRNAs in alignments of 12 recently sequenced Drosophila species. When compared to extensive genome-wide experimental data, predicted targets are of high quality, matching and surpassing ChIP-chip microarrays and recovering miRNA targets with high sensitivity. The resulting regulatory network suggests significant redundancy between pre- and post-transcriptional regulation of gene expression.

Genome Res. 2007 Dec;17(12):1919-31. Epub 2007 Nov 7

45. SubMAP: Aligning metabolic pathways with subnetwork mappings

Ay, Kellis, Kahveciy,

We consider the problem of aligning two metabolic pathways. Unlike traditional approaches, we do not restrict the alignment to one-to-one mappings between the molecules (nodes) of the input pathways (graphs). We follow the observation that in nature different organisms can perform the same or similar functions through different sets of reactions and molecules. The number and the topology of the molecules in these alternative sets often vary from one organism to another. With the motivation that an accurate biological alignment should be able to reveal these functionally similar molecule sets across different species, we develop an algorithm that first measures the similarities between different nodes using a mixture of homology and topological similarity. We combine the two metrics by employing an eigenvalue formulation. We then search for an alignment between the two input pathways that maximizes a similarity score, evaluated as the sum of the similarities of the mapped subsets of size at most a given integer k, and also does not contain any conflicting mappings. Here we prove that this maximization is NP-hard by a reduction from Maximum Weight Independent Set (MWIS) problem. We then convert our problem into an instance of MWIS and use an efficient vertex-selection strategy to extract the mappings that constitute our alignment. We name our algorithm SubMAP (Subnetwork Mappings in Alignment of Pathways). We evaluate its accuracy and performance on real datasets. Our empirical results demonstrate that SubMAP can identify biologically-relevant mappings that are missed by traditional alignment methods and, that it is scalable for metabolic pathways of arbitrary topology, including searching for a query pathway of size 70 against the complete KEGG database of 1,842 pathways.

Journal of Computational Biology, 12 pages, in press

Genome Evolution and Phylogenomics

6. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae.

Kellis, Birren, Lander

Whole-genome duplication followed by massive gene loss and specialization has long been postulated as a powerful mechanism of evolutionary innovation. Recently, it has become possible to test this notion by searching complete genome sequence for signs of ancient duplication. Here, we show that the yeast Saccharomyces cerevisiae arose from ancient whole-genome duplication, by sequencing and analysing Kluyveromyces waltii, a related yeast species that diverged before the duplication. The two genomes are related by a 1:2 mapping, with each region of K. waltii corresponding to two regions of S. cerevisiae, as expected for whole-genome duplication. This resolves the long-standing controversy on the ancestry of the yeast genome, and makes it possible to study the fate of duplicated genes directly. Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions.

Nature. 2004 Apr 8;428(6983):617-24. Epub 2004 Mar 7

9. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype.

Jaillon, Aury, Brunet, Petit, Stange-Thomann, Mauceli, Bouneau, Fischer, Ozouf-Costaz, Bernot, Nicaud, Jaffe, Fisher, Lutfalla, Dossat, Segurens, Dasilva, Salanoubat, Levy, Boudet, Castellano, Anthouard, Jubin, Castelli, Katinka, Vacherie, Bi�mont, Skalli, Cattolico, Poulain, De, Cruaud, Duprat, Brottier, Coutanceau, Gouzy, Parra, Lardier, Chapple, McKernan, McEwan, Bosak, Kellis, Volff, Guig�, Zody, Mesirov, Lindblad-Toh, Birren, Nusbaum, Kahn, Robinson-Rechavi, Laudet, Schachter, Qu�tier, Saurin, Scarpelli, Wincker, Lander, Weissenbach, Roest

Tetraodon nigroviridis is a freshwater puffer fish with the smallest known vertebrate genome. Here, we report a draft genome sequence with long-range linkage and substantial anchoring to the 21 Tetraodon chromosomes. Genome analysis provides a greatly improved fish gene catalogue, including identifying key genes previously thought to be absent in fish. Comparison with other vertebrates and a urochordate indicates that fish proteins have diverged markedly faster than their mammalian homologues. Comparison with the human genome suggests approximately 900 previously unannotated human genes. Analysis of the Tetraodon and human genomes shows that whole-genome duplication occurred in the teleost fish lineage, subsequent to its divergence from mammals. The analysis also makes it possible to infer the basic structure of the ancestral bony vertebrate genome, which was composed of 12 chromosomes, and to reconstruct much of the evolutionary history of ancient and recent chromosome rearrangements leading to the modern human karyotype.

Nature. 2004 Oct 21;431(7011):946-57

22. Evolution of genes and genomes on the Drosophila phylogeny.

Nature. 2007 Nov 8;450(7167):203-18

21. Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes.

Rasmussen, Kellis

Comparative genomics provides a general methodology for discovering functional DNA elements and understanding their evolution. The availability of many related genomes enables more powerful analyses, but requires rigorous phylogenetic methods to resolve orthologous genes and regions. Here, we use 12 recently sequenced Drosophila genomes and nine fungal genomes to address the problem of accurate gene-tree reconstruction across many complete genomes. We show that existing phylogenetic methods that treat each gene tree in isolation show large-scale inaccuracies, largely due to insufficient phylogenetic information in individual genes. However, we find that gene trees exhibit common properties that can be exploited for evolutionary studies and accurate phylogenetic reconstruction. Evolutionary rates can be decoupled into gene-specific and species-specific components, which can be learned across complete genomes. We develop a phylogenetic reconstruction methodology that exploits these properties and achieves significantly higher accuracy, addressing the species-level heterotachy and enabling studies of gene evolution in the context of species evolution.

Genome Res. 2007 Dec;17(12):1932-42. Epub 2007 Nov 7

30. Genome analysis of the platypus reveals unique signatures of evolution.

Nature. 2008 May 8;453(7192):175-83

34. Evolution of pathogenicity and sexual reproduction in eight Candida genomes.

Nature. 2009 Jun 4;459(7247):657-62

41. A Bayesian Approach for Fast and Accurate Gene Tree Reconstruction.

Rasmussen, Kellis

Recent sequencing and computing advances have enabled phylogenetic analyses to expand to both entire genomes and large clades, thus requiring more efficient and accurate methods designed specifically for the phylogenomic context. Here we present SPIMAP, an efficient Bayesian method for reconstructing gene trees in the presence of a known species tree. We observe many improvements in reconstruction accuracy, achieved by modeling multiple aspects of evolution, including gene duplication and loss rates, speciation times, and correlated substitution rate variation across both species and loci. We have implemented and applied this method on two clades of fully-sequenced species, 12 Drosophila and 16 fungal genomes as well as simulated phylogenies, and find dramatic improvements in reconstruction accuracy as compared to the most popular existing methods, including those that take the species tree into account. We find that reconstruction inaccuracies of traditional phylogenetic methods overestimate the number of duplication and loss events by as much as 2 to 3 fold, while our method achieves significantly higher accuracy. We feel the results and methods presented here will have many important implications for future investigations of gene evolution.

Mol Biol Evol. 2010 Jul 25. [Epub ahead of print]

B02. A Phylogenomic Approach to the Evolutionary Dynamics of Gene Duplication in Birds

Organ, Rasmussen, Baldwin, Kellis, Edwards

Gene duplication is a fundamental aspect of genome evolution that produces large and small gene families of (usually) related function. We perform a phylogenomic analysis of gene duplication in the chicken (Gallus gallus) to characterize the dynamics and evolution of gene duplication on the evolutionary line to birds. In Gallus, the distribution of the number of paralogs per gene family is heavily skewed towards small families. This finding is in accord with other studies that find gene family size typically follows a power-law distribution in animals, a pattern thought to be produced by differential rates of pseudogenization among families. We also test for within-family evolutionary rate variation in Gallus, finding that the vast majority of gene families exhibit substantial rate variation among lineages. This rate variation probably stems from two sources: natural deviations in the clock as commonly found, for example, in phylogenetic analyses of different species; and bursts of adaptive evolution among newly evolved gene family members. The age of gene duplications in Gallus are distributed exponentially, with most duplications occurring recently, a pattern consistent with analyses on other eukaryotes. Taken together, these results begin to reveal the dynamics of gene family evolution in birds, the most speciose group of living amniotes, though whole genome data are required from more bird and reptile species to fully understand patterns of gene gain and loss in this group.

Evolution after Gene Duplication, Wiley-Blackwell, 17 pages, October 13, 2010

Algorithmic papers in Computational Biology

C02. Whole-Genome Comparative Annotation and Motif Discovery in Multiple Yeast Species

Kellis, Patterson, Birren, Berger, Lander

In Kellis et al 2003 we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. We developed methods for the automatic comparative annotation of the four species and the determination of orthologous genes and intergenic regions. The algorithms enabled the automatic identification of orthologs for more than 90% of genes despite the large number of duplicated genes in the yeast genome, and the discovery of recent gene family expansions and genome rearrangements. We also developed a test to validate computationally predicted protein-coding genes based on their patterns of nucleotide conservation. The method has high specificity and sensitivity, and enabled us to revisit the current annotation of S.cerevisiae with important biological implications. We developed statistical methods for the systematic de-novo identification of regulatory motifs. Without making use of coregulated gene sets, we discovered virtually all previously known DNA regulatory motifs as well as several noteworthy novel motifs. With the additional use of gene ontology information, expression clusters and transcription factor binding profiles, we assigned candidate functions to the novel motifs discovered. Our results demonstrate that entirely automatic genome-wide annotation, gene validation, and discovery of regulatory motifs is possible. Our findings are validated by the extensive experimental knowledge in yeast, confirming their applicability to other genomes

ACM RECOMB, p. 157-166, Apr 13, 2003

3. Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts.

Chiang, Moses, Kellis, Lander, Eisen

Transcriptional regulation in eukaryotes often involves multiple transcription factors binding to the same transcription control region, and to understand the regulatory content of eukaryotic genomes it is necessary to consider the co-occurrence and spatial relationships of individual binding sites. The determination of conserved sequences (often known as phylogenetic footprinting) has identified individual transcription factor binding sites. We extend this concept of functional conservation to higher-order features of transcription control regions. We used the genome sequences of four yeast species of the genus Saccharomyces to identify sequences potentially involved in multifactorial control of gene expression. We found 989 potential regulatory 'templates': pairs of hexameric sequences that are jointly conserved in transcription regulatory regions and also exhibit non-random relative spacing. Many of the individual sequences in these templates correspond to known transcription factor binding sites, and the sets of genes containing a particular template in their transcription control regions tend to be differentially expressed in conditions where the corresponding transcription factors are known to be active. The incorporation of word pairs to define sequence features yields more specific predictions of average expression profiles and more informative regression models for genome-wide expression data than considering sequence conservation alone. CONCLUSIONS: The incorporation of both joint conservation and spacing constraints of sequence pairs predicts groups of target genes that are specific for common patterns of gene expression. Our work suggests that positional information, especially the relative spacing between transcription factor binding sites, may represent a common organizing principle of transcription control regions.

Genome Biol. 2003;4(7):R43. Epub 2003 Jun 26

C04. Network motif discovery using subgraph enumeration and symmetry breaking

Grochow, Kellis

ACM RECOMB, p. 92-106, Apr 21, 2007

5. Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery.

Kellis, Patterson, Birren, Berger, Lander

In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refining the gene structure of hundreds of genes. (3) We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on genomewide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs. Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast and will be invaluable in the study of complex genomes like that of the human.

J Comput Biol. 2004;11(2-3):319-55

M02. Computational Comparative Genomics: Genes, Regulation, Evolution

Kellis

Understanding the biological signals encoded in a genome is a key challenge of computational biology. These signals are encoded in the four-nucleotide alphabet of DNA and are responsible for all molecular processes in the cell. In particular, the genome contains the blueprint of all protein-coding genes and the regulatory motifs used to coordinate the expression of these genes. Comparative genome analysis of related species provides a general approach for identifying these functional elements, by virtue of their stronger conservation across evolutionary time. In this thesis we address key issues in the comparative analysis of multiple species. We present novel computational methods in four areas (1) the automatic comparative annotation of multiple species and the determination of orthologous genes and intergenic regions (2) the validation of computationally predicted protein-coding genes (3) the systematic de-novo identification of regulatory motifs (4) the determination of combinatorial interactions between regulatory motifs. We applied these methods to the comparative analysis of four yeast genomes, including the best-studied eukaryote, Saccharomyces cerevisiae or baker's yeast. Our results show that nearly a tenth of currently annotated yeast genes are not real, and have refined the structure of hundreds of genes. Additionally, we have automatically discovered a dictionary of regulatory motifs without any previous biological knowledge. These include most previously known regulatory motifs, and a number of novel motifs. We have automatically assigned candidate functions to the majority of motifs discovered, and defined biologically meaningful combinatorial interactions between them. Finally, we defined the regions and mechanisms of rapid evolution, with important biological implications. Our results demonstrate the central role of computational tools in modern biology. The analyses presented in this thesis have revealed biological findings that could not have been discovered by traditional genetic methods, regardless of the time or effort spent. The methods presented are general and may present a new paradigm for understanding the genome of any single species. They are currently being applied to a kingdom-wide exploration of fungal genomes, and the comparative analysis of the human genome with that of the mouse and other mammals.

Ph.D. Thesis, MIT Libraries, 100 pages, May 25, 2003

45. SubMAP: Aligning metabolic pathways with subnetwork mappings

Ay, Kellis, Kahveciy,

Journal of Computational Biology, 12 pages, in press

C05. Information-Theoretic Inference of Regulatory Networks Using Backward Elimination

Meyer, Marbach, Roy, Kellis

Unraveling transcriptional regulatory networks is essential for understanding and predicting cellular responses in different developmental and environmental contexts. Information-theoretic methods of network inference have been shown to produce high-quality reconstructions because of their ability to infer both linear and non-linear dependencies between regulators and targets. In this paper, we introduce MRNETB an improved version of the previous information-theoretic algorithm, MRNET, which has competitive performance with state-of-the-art algorithms. MRNET infers a network by using a forward selection strategy to identify a maximally-independent set of neighbors for every variable. However, a known limitation of algorithms based on forward selection is that the quality of the selected subset strongly depends on the first variable selected. In this paper, we present MRNETB, an improved version of MRNET that overcomes this limitation by using a backward selection strategy followed by a sequential replacement. Our new variable selection procedure can be implemented with the same computational cost as the forward selection strategy. MRNETB was benchmarked against MRNET and two other information-theoretic algorithms, CLR and ARACNE. Our benchmark comprised 15 datasets generated from two regulatory network simulators, 10 of which are from the DREAM4 challenge, which was recently used to compare over 30 network inference methods. To assess stability of our results, each method was implemented with two estimators of mutual information. Our results show that MRNETB has significantly better performance than MRNET, irrespective of the mutual information estimation method. MRNETB also performs comparably to CLR and significantly better than ARACNE indicating that our new variable selection strategy can successfully infer high-quality networks.

BioComp10, 12 pages, July 13, 2010

Computational Biology

Graph-Theoretic Framework for Genome Alignment

Manolis Kellis, Bonnie Berger, Eric Lander
MIT/Whitehead Center for Genome Research

The first step to comparative genomics is to determine what regions to compare across the multiple species analyzed. This is by no means a simple task, because even close relatives are separated by gene duplication and loss events, and large-scale genomic rearrangements. Previous approaches have used heuristics to find the most similar sequence for each region and provide no guarantees of correctness or uniqueness. I formalized the problem of genome correspondence in a graph-theoretic framework and developed algorithms that work with the full bipartite graph of gene similarities across the species compared. The framework seamlessly incorporates protein sequence similarity and gene order information to resolve ambiguities in gene correspondence and to determine those genes that arose from the same gene in the common ancestor. Applying the graph algorithms to four closely related species of yeast determined that over 90% of genes have one-to-one correspondence. Additionally, the resulting graph pinpointed those genes that underwent recent duplication, loss, and protein-family expansion events, revealing regions of rapid evolution at the ends of chromosomes, and distinct mechanisms of chromosomal change. Finally, by ensuring that aligned genes share a common ancestry, the algorithm allowed me to apply robust evolutionary models of sequence divergence to discover biological signals.

Systematic Gene Identification from Patterns of Nucleotide Change

Manolis Kellis, Bruce Birren, Eric Lander
MIT/Whitehead Center for Genome Research

I developed computational methods that use multiple related genomes to identify protein-coding genes in yeast. Previous studies have identified genes based on the functional effects of their disruption, experimental evidence of their use, or protein sequence similarity to distantly related species. I formalized the question of gene identification as a classification problem between genes and intergenic regions. I developed a computational test to discriminate between the two types of region based on their patterns of nucleotide change in the multiple alignment of the four yeast species. I found distinct conservation signatures between genes and intergenic regions in yeast and implemented a discriminating test with sensitivity and specificity values that exceed 99%, based on 4000 experimentally-validated genes and 300 random intergenic regions. I used this test to evaluate every candidate gene in the yeast genome, and to refine the start, stop and intron boundaries of previously annotated genes. The work led to the largest re-annotation of the genome since its original sequencing, reducing the overall gene count by 10%, and proposing changes that affect nearly 15% of all genes. Numerous independent reports have provided experimental support confirming the changes.

Regulatory Motif Discovery using Genome-Wide Conservation Criteria

Manolis Kellis, Bruce Birren, Eric Lander
MIT/Whitehead Center for Genome Research

I also developed new computational methods to address the greater challenge of de novo regulatory motif discovery. Regulatory motifs are short 6-8 nucleotide sequence tags recognized by specific proteins that control gene expression, dictating when the associated genes will be turned on or off. Regulatory motif discovery has typically relied on extensive experimentation, evaluating gene expression of thousands of genes across dozens of experimental conditions, then constructing clusters of genes that show similar patterns of expression and searching their upstream sequences for over-represented sequence tags. In absence of biological knowledge of functionally related sets of genes, the identification or regulatory motifs has remained an open problem. Previous comparative studies have focused on individual instances of regulatory motifs, searching for conservation islands in intergenic regions. Such methods lack the power to discover the true motifs within these islands, since additional bases surrounding the motifs are spuriously conserved, not by natural selection but due to lack of divergence time between the species compared.

I approached regulatory motif discovery using genome-wide methods, evaluating motif conservation simultaneously across the entire genome. I found that the spuriously conserved bases surrounding a motif varied across its different instances, allowing the signal of the true motifs to stand out. Computationally, the algorithm started with an exhaustive enumeration of short sequence seeds with non-random genome-wide conservation, followed by a motif refinement and collapsing step, leading to a small dictionary of discovered motifs. These included nearly all previously known regulatory motifs, and a similar number of novel motifs, most of which showed functional specificities. Overall, the purely computational method showed unprecedented power in extracting signal from noise. The discovery of such genome-wide signals would have been impossible using traditional gene-by-gene experimental approaches, regardless of the time and effort invested.

Genome Evolution

Manolis Kellis, Bruce Birren, Eric Lander
MIT/Whitehead Center for Genome Research

Genome Duplication and Emergence of New Functions

Manolis Kellis, Bruce Birren, Eric Lander
MIT/Whitehead Institute Center for Genome Research

I have additionally applied computational methods to understand mechanisms of evolutionary innovation. It is generally thought that new functions arise by duplication of existing genes, but the mechanisms of duplication have been poorly understood. For example, the ancestry of hundreds of duplicated human and yeast genes has been under debate, some scientist arguing for complete genome duplication and others for independent duplication events. Previous comparative work has focused on divergence of individual gene families or regions, and the analyses have remained inconclusive.

I have taken a genome-wide approach to address this question in yeast. Instead of using divergence rates of individual genes, I have focused on the conservation of gene order across the entire genome. My approach relied on a key dataset: the sequence of yeast species that diverged prior to the estimated time of duplication. My genome alignment methods provided definitive proof of complete genome duplication: every region of the newly sequenced relative corresponded to exactly two regions of the yeast genome. More than 90% of duplicated genes were lost in one copy or the other, but the remaining genes provided a sufficient signal to show complete duplication. This work has put an end to a long debate on the ancestry of the yeast genome that started nearly seven years ago with the completion of its sequencing. Additionally, the analysis led to the first genome-wide study of the emergence of new gene functions after duplication, with important revelations on the emergence of anti-viral defense mechanisms, environmental stress response, gene silencing and transcriptional regulation.

Genomic Analysis of the Hox developmental genes

Ken Dewar, Manolis Kellis, Eric Lander
MIT/Whitehead Center for Genome Research

Hox genes control the anterior-posterior axis of embryo formation during development. By comparing the intergenic regions of human, baboon, mouse, and rat hox clusters, we are trying to understand the reasons behind the preservation of sequence and spacing.

Machine Learning

Sketch based image retrieval

Manolis Kellis, Ovidiu Marina, Patrick Winston
MIT AI Lab

My masters thesis was on Computer Vision, more precisely, sketch-based image retrieval. The project involved image understanding, segmentation, representation and learning (more about Imagina)

Understanding Primitives in Human Motion

Manolis Kellis, Michael Black
Digital Video Analysis group (DiVA), Xerox PARC

Developped a new represenation for full-body 3D motion data. The goal is to find lower-dimensional manifolds that contain motion patterns for different activities, embedded in a 27-dimensional space defined by the angles of joints. Finding such structure in human motion would allow to describe activities much more compactly for recognition or generation of 3D motion in terms of primitives intrinsic to the dynamics of human anatomy.

Handwritten character recognition

Manolis Kellis, Paul Viola
MIT AI Lab

Presented a compact angle representation for handwritten characters based on online data that filters noise and facilitates recognition. Wavelet decopmosition of the character signal in derivative angle space is invariant to scaling or rotation and remains preserved within samples of the same character. Code in Matlab. Supervised by Paul Viola from the Learning and Vision group of the AI Lab at MIT.

Learning Musical Patterns

Manolis Kellis, Ovidiu Marina, Hooman Vassef, Dedric Carter, Patrick Winston
MIT AI Lab

Under the supervision of Patrick Winston, more interested in understanding the human I/O channels, I wrote a program to recognize music by learning patterns of attentional state. Instead of observing the notes directly, we were more interested in watching the states that a note observer would go through. In other words, the succession of primitive routines that a note pattern-matcher uses in observing a note sequence should tell us more about a musical piece than the patterns extracted themselves. The strength of this project was a clean architecture for using meta-information about an observed phenomenon. It separated the low-level primitive routines (that match patterns, change the focus of attention to salient points, and establish properties about the observed events) from an attentional state observer of those low-level routines (that records and analyzes series of states the pattern matcher enters). This separation that the program exhibited is suggested by Minsky's notion of a B-brain that monitors the activity of our A-brain. (visit the MOOD team)

Computational Geometry

Surface Reconstruction

Nina Amenta, Marshall Bern, Manolis Kellis
Xerox PARC, CSL

My involvement in graphics also grew out of mathematical interest, in computational geometry. At Xerox PARC, I worked with Marshall Bern and Nina Amenta on a surface reconstruction algorithm. The algorithm relies on Voronoi diagrams and Delaunay triangulations to compute a provably correct reconstruction of the connectivity of a set of scattered sample points in space. Furthermore, the sampling density requirement is proportional to the curvature and proximity of other surface patches, which allows less dense sampling in those parts of the figure with fewer features. Our implementation reached speed and correctness benchmarks that were state of the art, and our work was published the subsequent summer at ACM Siggraph 1998, the largest graphics conference. (learn more about the Crust algorithm). Computer graphics was my new passion.

Polygon based 3D morphing

Manolis Kellis, Krzysztof Gajos, Matt Blum, Hooman Vassef, Seth Teller
MIT Graphics Group

Back at MIT I took a computer graphics class and as a final project I proposed and implemented a 3D morphing algorithm. The project explored the question of transforming a three-dimensional polygonal model into another while minimizing the energy lost in creating and destroying polygons and maximizing the uniformity of the polygon flow. Following these constraints, a polygon can spawn another polygon from one of its edges (birth), and a polygon can collapse onto another polygon's edge (death). From there followed two algorithms I called Edge Ordering and Incremental Edge Ordering that solve the general problem of scheduling a multitude of agents to satisfy a set of tasks, for example a set of robots extinguishing small fires. Further constraints were applied on the birth and death edges to preserve the uniformity of the polygon flow, and the aesthetic results were very appealing, yet biased by the cost metric I used. (you can find 3D morphing examples online)

Gaussian curve morphing

Manolis Kellis, Berthold Horn
MIT AI Lab

In search of a purely mathematical solution, I further explored the problem of morphing in the continuous case. Instead of having a discrete set of polygons as the input, this problem addressed an arbitrary convex shape. Inspired by Professor Berthold Horn's Robot Vision class, I mapped the two shapes to be morphed to their corresponding Extended Gaussian Images, and simply matched the normals on the surface of one unit sphere with normals on the other. The matching algorithm is a function that distributes each source to all destinations, the amount displaced between any two normals inversely exponential to a combination of their difference in angle and weight. This combination depended on an input parameter that favored matching normals of similar weights or normals of neighboring angles. This parameter ended up incorporating the different intuitive approaches we have about transforming a square into a triangle: in one case we round off the corners (minimizing length variation of the sides), and in the other case we progressively shrink one of the square's edges until it becomes a triangle (minimizing angle variation). Both approaches were supported by different users of the morphing program. This purely mathematical solution thus yielded intuition on a problem for which there was no base of argumentation. (you can test the Gaussian Morphing applet yourself!).

Robotics

An interacting reprogrammable robot

Manolis Kellis, Jeremy Lueck, Chris Rohrs
MIT

I made this system a reality in my 6.115 final project, RoboLogo, in a team of three people. Inspired from Seymour Papert's work with children and Logo, we developed iLogo (interactive Logo), to extend the Logo language with primitives for interacting with the environment, reading sensors, and reacting accordingly to external events. We designed the new language and its grammar; we built a compiler that translated programs from iLogo to assembly code; we wrote a library of low-level routines and initialization code to be linked with the generated files; we designed and laid out a printed circuit board to incorporate a processor, RAM, ROM, motor controllers, sensor drivers, AD converter; finally we put everything together on a mechanical truck on which we mounted sensors and actuators. Children can control the robot by writing simple high level commands in iLogo, and see a working robot bounce between two walls with only six lines of code. This new abstraction layer allows for more complicated behaviors to be expressed in smaller, simpler programs, with an overhead that is negligible when the speed of the processor is compared to the speed of external events. Building the entire RoboLogo environment from the ground was a very rewarding experience and went a step further than 6.270 where the circuit board was provided along with the iC language and compiler. Making RoboLogo happen was like constructing our own 6.270 class. (on to the RoboLogo page).

Modular Reconfigurable Robots

Mark Yim, Manolis Kellis
Xerox PARC, SPL

At Xerox PARC, the summer of 98, I worked on a larger-scale robotics project and I concentrated on the control of the robots rather than the low-level construction of the hardware. I worked with Mark Yim on Polypod, a unit-modular reconfigurable robot, and I was responsible of the control architecture for coordinating the actions of the different modules. The model I proposed is a combination between a global bus carrying one or multiple values shared among all modules and a message passing mechanism that provides an interpretation for the shared values. The architecture is designed to be adaptive to the environment and robust against isolated module failures. To decentralize the control it distributes the computational power throughout the robot and specifies local behavioral rules at the module level, so that global communication be not necessary for simple decisions that need only local information. Global behavior then emerges at the robot level when Polypod interacts with the environment. While working on Polypod, my graphics background helped in the making of a Java3D simulation to visualize the modules; my geometry background helped me work out the forward and inverse kinematics; my background in AI helped me develop an architecture for decision making and control arbitration; my mathematical background helped me address scalability issues and design algorithms for information propagation and storing; moreover, having dealt with both mechanical and electronics issues during my robotics experience at MIT, I had an intuition about what was feasible to undertake in a mechatronics system. (see more at the PolyPod page)

Old Projects

Predicting the stock market

Manolis Kellis, Shishir Mehrotra, Nimrod Warshawsky, Tomás Lozano Perez
MIT AI Lab

My interest in artificial intelligence arose as a mathematical curiosity first. In an introductory AI class, I was presented with a series of small independent problems and smart ideas with which to approach them. I applied those ideas in different projects in an intensive class by Tomás Lozano Perez, for which I trained a Neural Network to recognize human faces, I used Temporal Difference Learning for a board game in which the machine grew sometimes stronger than the user, and for my final project, I experimented with different machine learning methods to extract patterns from the stock market. (more about Invest)

Autonomous Lego Robot

Panayiotis Kellis, Maria Kellis, Manolis Kellis
MIT

My interest in building and controlling robots was first expressed when I participated in 6.270, the autonomous robot design competition. My brother was a junior in course VI, my sister a sophomore in course II and I was a freshman at the time. Together, we constructed a robot out of Lego pieces, and we programmed it in iC (interactive C) using a provided circuit board. The goal was to create insect-like robots, or more specially, ants. The "ants" were supposed to collect as much "food" as possible during each 60-second face-off. The food was represented by styrofoam blocks. Robots received more points for picking up the blocks and stacking them on top of each other in its "nest," a raised rubber platform. This experience taught me a lot about the difference between programming and reality, and how many things one needed to take care of when programming a system that interacts with the real world, even in an overconstrained environment like the contest table. (visit the RoboAnt contest page). I was thinking that children in high school, and even primary school should have the same opportunity to program autonomous robots. However the overhead from mechanical, electronics and low-level aspects should be reduced for younger children. They should start in a system with instant gratification, where they can write their "Hello, World!" program in a high level language on the first day. Writing complex programs, more complete reactions to the environment, and understanding the tools that make their program really work should be part of further lessons. They should however, immediately see the robot act and react around the lab under their commands and bridge the gap between concrete world actions and abstract programming commands.

Constraint model for a web crawler

Manolis Kellis, Henrik Frystyk Nielsen
World Wide Web Consortium

The summer and fall of 96 i worked at the world wide web consortium, where i developped a simple constraint model for a web robot. The code was written in Tcl and allowed for arbitrarily rules to be expressed in a simple form of recursive deifinitions. (more about the w3c mini Webbot)

Scheme implementation of a Universal Turing Machine

Manolis Kellis
MIT

The fourth core computer science class at MIT, 6.004, also known as Computational Structures, introduces students to Turing Machines. I implemented a Universal Turing Machine that can read in the descriptions of machines and simulate them. The code is written in Scheme. A year later, Mark Hershberg ported this tool online for 6.004 students to use when they want to test their FSM descriptions on sample inputs (the entire Turing code is available for download).

Ray Tracing, Lighting, Modelling

Manolis Kellis, Seth Teller
MIT LCS

The picture on the right was generated by a ray tracer that I wrote using C and OpenGL (see example 1 or 2). Other parts of the graphics pipeline I implemented were color interpolation, normal interpolation for Gourauld shading, casting shadows, and handling transparency. Other graphics work I did included a 3D model of the MIT campus (with some artistic liberties taken), including animations of a rowing race and traffic cars. Here's a contact-inspired graphic. The rings were animated.

Online Information Sharing

Manolis Kellis
MIT

The spring of 1997 i created the HKN Knowledge Market, a online meeting place where students can exchange ideas, information, hints, and advice about classes, problem sets, internships, research. The project is sponsored by HKN (more about Knowledge Market)

How meta-information directs evolution

Manolis Kellis
MIT

At the same time, in Marvin Minsky's Society of Mind I explored a different approach to AI, interested more in understanding how the human brain works from a computational standpoint rather than in creating powerful machines unrelated to human capabilities. For Minsky's class, I did some research on evolution and the human DNA and wrote a paper on how along with the genetic data, meta-information can be propagated to direct the interpretation of the raw data. This theory allowed for the evolving species' DNA to learn not only about surviving, but also about evolving. This twist provided a simple answer to some questions Darwin left unexplained, such as the exponential increase in evolutionary speed as species evolve, and the evolutionary bursts that disturb otherwise steady evolutionary periods. Minsky's class helped me understand the human brain as an engineering project, which inspired intuitions both from the biological world to address engineering problems, and from the engineering world to understand how we think, perceive, act. (read the Evolution paper).

Manolis KellisResearch Interests

Genome Interpretation

Small regulatory RNAs (and large regulatory RNAs)

Protein-coding genes

Chromatin, regulation, and developmental programs

Regulatory Motifs

Biological Networks

Genome Evolution and Phylogenomics

Algorithmic papers in Computational Biology

Computational Biology

Graph-Theoretic Framework for Genome Alignment

Systematic Gene Identification from Patterns of Nucleotide Change

Regulatory Motif Discovery using Genome-Wide Conservation Criteria

Genome Evolution

Genome Duplication and Emergence of New Functions

Genomic Analysis of the Hox developmental genes

Machine Learning

Sketch based image retrieval

Understanding Primitives in Human Motion

Handwritten character recognition

Learning Musical Patterns

Computational Geometry

Surface Reconstruction

Polygon based 3D morphing

Gaussian curve morphing

Robotics

An interacting reprogrammable robot

Modular Reconfigurable Robots

Old Projects

Predicting the stock market

Autonomous Lego Robot

Constraint model for a web crawler

Scheme implementation of a Universal Turing Machine

Ray Tracing, Lighting, Modelling

Online Information Sharing

How meta-information directs evolution

Manolis Kellis
Research Interests