NOTE: The version of GENSCAN on Athena is maintained by Liana Lareau fyfer@mit.edu PLEASE contact me before bothering Dr. Burge, and I'll pass your message on. ********************************************************************** ********************************************************************** ** ** ** DOCUMENTATION FOR COMMAND-LINE VERSION OF GENSCAN ** ** ** ** ** ** Christopher Burge, PhD ** ** ** ** MIT ** ** Department of Biology ** ** 77 Massachusetts Ave., 68-222 ** ** ** ** cburge@mit.edu ** ** ** ********************************************************************** ********************************************************************** ______________________________________________________________________ TABLE OF CONTENTS 1. OVERVIEW OF GENSCAN 2. INSTALLING GENSCAN 3. GENSCAN INPUT 4. GENSCAN OUTPUT 5. HOW GENSCAN WORKS 6. WEB PAGES 7. REFERENCES ______________________________________________________________________ 1. OVERVIEW OF GENSCAN GENSCAN is a general-purpose gene identification program which analyzes genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants. For each sequence, the program determines the most likely "parse" (gene structure) under a probabilistic model of the gene structural and compositional properties of the genomic DNA for the given organism. This set of exons/genes is then printed to an output file (the text output) together with the corresponding predicted peptide sequences. A graphical (PostScript) output may also be created which displays the location and DNA strand of each predicted exon. Unlike the majority of other currently available gene prediction programs, the model treats the most general case in which the sequence may contain no genes, one gene, or multiple genes on either or both DNA strands and partial genes as well as complete genes are considered. The most important restrictions are that only protein coding genes are considered (and not tRNA or rRNA genes, for example), and that transcription units are assumed to be non-overlapping. The probabilistic model used by GENSCAN accounts for many of the essential gene structural properties of genomic sequences, e.g., typical gene density, the typical number of exons per gene, the distribution of exon sizes for different types of exon; and also many of the important compositional properties of genes, e.g., the reading frame-specific hexamer composition of coding regions vs the (reading frame-independent) hexamer composition of introns and intergenic regions, and the position-specific composition of the translation initiation (Kozak) and termination signals, and of the TATA box, cap site and poly-adenylation signals. Importantly, novel models of the donor and acceptor splice sites are used which capture potentially important dependencies (interactions) between positions in these signals. For human and vertebrate sequences, separate sets of model parameters are used which account for the many substantial differences in gene density and structure observed in distinct C+G% compositional regions of the human genome and the genomes of other vertebrates. The program and the model that underlies are described in Burge & Karlin, 1997 (see REFERENCES) and in greater detail in my thesis, which is available by anonymous ftp in PostScript format (see http://gnomic.stanford.edu/~chris). ______________________________________________________________________ INSTALLING GENSCAN Make sure that you have requested the correct binary for the computer system on which you intend to install GENSCAN. Also make sure that the computer you have chosen has enough memory to run the types of sequences you want to run efficiently. As a rule, to run sequences of length N kilobases efficiently, your computer should have at least N/2 Megabytes of RAM. Longer sequences can be tried, but results may vary. For example, you might find that the program runs much more slowly or that you may get a "memory fault" error message for long sequences. In the latter case, it is probably best to break up the sequence into two or more smaller pieces and run each one separately. Installing GENSCAN should be fairly easy. Here is a suggested sequence of steps - feel free to modify it depending on the setup of your system. The ">" symbol below represents the tcsh (or other shell) prompt. 1. First, if you have not done so already, make a directory to store files related to GENSCAN, insert the floppy disk provided into the drive and extract the files on it using the Unix bar command: > mkdir GENSCAN > cd GENSCAN > bar -xvf /dev/rfd0 . (Here, /dev/rfd0 refers to the floppy disk drive: on some systems it may have a different name.) 2. List this directory (ls). The following files should be present: README (This file: GENSCAN documentation) HumanIso.smat (Parameter file for human/vertebrates) Arabidopsis.smat (Parameter file for Arabidopsis) Maize.smat (Parameter file for maize) HUMRASH (A sample human genomic sequence) HUMRASH.sample (GENSCAN output for the sample sequence) genscan (The GENSCAN binary) 3. Make sure that you have permission to execute the binary and to read the parameter files: > chmod a+x genscan > chmod a+r *.smat 4. Install the binary in a directory which is in your path, e.g., > mv genscan /usr/bin/genscan (Must be superuser to write to /usr/bin on most systems.) 5. Now make a directory for the GENSCAN parameter files and put them into this directory: > mkdir /usr/lib/GENSCAN > mv *.smat /usr/lib/GENSCAN 6. Type rehash and then try running the program (with no arguments): > rehash > genscan What happened? Hopefully, you got a message like: ******************************************************************* ******************************************************************* ** ** ** N O T I C E ** ** ** ** This program is made available ... ... yadda yadda yadda ... ** ** ******************************************************************* ******************************************************************* usage: genscan parfname seqfname [-v] [-cds] [-subopt cutoff] [-ps psfname scale] parfname : full pathname of parameter file (for appropriate organism) seqfname : full pathname of sequence file (FastA or minimal GenBank format) -v : verbose output (extra explanatory info) -cds : print predicted coding sequences (nucleic acid) -subopt : display suboptimal exons with P > cutoff (optional) cutoff : suboptimal exon probability cutoff (minimum: 0.01) -ps : create Postscript output (optional) psfname : filename for PostScript output scale : scale for PostScript output (bp per line) If not, you might double-check your path to make sure it includes the directory where you put genscan and make sure that all of the permissions are set properly. You have now successfully installed genscan. ______________________________________________________________________ 3. GENSCAN INPUT Typing "genscan" at the prompt displays the usage information for the program (see above). In brief, the program takes two essential command-line arguments, as well as one or more optional arguments (the ones in square brackets above). These arguments are described below. OPTIONAL ARGUMENTS __________________ -v Add some extra explanatory information to the text output. This information may be helpful the first few times you run the program but will soon become tiresome (that's why its optional). -cds Print predicted CDS (coding sequences) as well as predicted peptides. -subopt Identify suboptimal exons. The default output of the program is the optimal "parse" of the sequence, i.e. the highest probability gene structure(s) which is present: the exons in this optimal parse are referred to as "optimal exons" and are always printed out by GENSCAN. Suboptimal exons, on the other hand, are defined as potential exons which have probability above a certian threshold but which are not contained in the optimal parse of the sequence. Suboptimal exons have a variety of potential uses. First, suboptimal exons sometimes correspond to real exons which were missed for whatever reason by the optimal parse of the sequence. Second, regions of a prediction which contain multiple overlapping and/or incompatible optimal and suboptimal exons may in some cases indicate alternatively spliced regions of a gene (Burge & Karlin, in preparation). The argument "cutoff" is the probability cutoff used to determine which potential exons qualify as suboptimal exons. This argument should be a number between 0.01 and 0.99. For most applications, a cutoff value of about 0.10 is recommended. Setting the value much lower than 0.10 will often lead to an explosion in the number of suboptimal exons, most of which will probably not be useful. On the other hand, if the value is set much higher than 0.10, then potentially interesting suboptimal exons may be missed. -ps Create the PostScript (graphical) output, which is a diagram of the locations and DNA strand of all predicted exons/genes. Exons on the "forward" (input) strand of the sequence are displayed above the sequence line; exons on the other strand are displayed below this line. Optimal exons are displayed as solid blue blocks (black on a black-and-white monitor or printer); suboptimal exons are outlined in blue. (If many overlapping suboptimal exons are present, the graphical display may become somewhat cluttered.) The argument "psfname" is the name of the file where you want the PostScript output to go (should end in ".ps"). This argument is required whenever the "-ps" flag is used. The "scale" argument tells the program what scale to make the PostScript image, i.e. how many base pairs to represent per line. This number must be no greater than one fourth of the length of the sequence (because only four lines fit on a page). If this argument is omitted, the program will choose a reasonable scale for the image. The PostScript output may be printed on any PostScript printer: it can be viewed using any of several PostScript interpreters such as ghostscript/ghostview, pageview, xpsview, etc. The PostScript language was developed by Adobe Systems Inc. and the name PostScript is a registered trademark of this company. ESSENTIAL ARGUMENTS ___________________ 1. The Parameter File The parameter file must follow a very specific format to be read correctly by GENSCAN and SHOULD NEVER BE MODIFIED. If you really have a good reason for wanting to change some of the parameters, please send email to: cburge@mit.edu and we can discuss the proposed changes. Note that separate parameter files are provided for different organisms. Currently available parameter files are: HumanIso.smat for human/vertebrate sequences (also Drosophila) Arabidopsis.smat for Arabidopsis thaliana sequences Maize.smat for Zea mays sequences The full path of the parameter file is the first (mandatory) command line argument. You can either type the full path every time you run the program or save some typing by using aliases. For example, assuming that the parameter files have been installed in the directory /usr/lib/GENSCAN, you could put the following aliases in your .cshrc file (normally located in your home directory): alias genvert genscan /usr/lib/GENSCAN/HumanIso.smat alias genarab genscan /usr/lib/GENSCAN/Arabidopsis.smat alias genmaiz genscan /usr/lib/GENSCAN/Maize.smat That way (after you source .cshrc) you can simply type, for example, > genvert SEQFILE to run the program on SEQFILE with the human/vertebrate parameters. 2. The Sequence File The sequence file may be in either FastA or minimal GenBank format. These formats are described below, with examples of each. A sequence in FastA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The sequence data may be upper or lower case. All spaces, tabs, numbers or other non-alphabet characters are ignored by GENSCAN, with the exception of asterisks ("*"), which are treated as unknown nucleotides (N's). GENSCAN does not distinguish between special symbols indicating purine or pyrimidine nucleotides such as R, Y, etc., but treats all letters other than A, C, G, T as unknowns (N). It is usually an easy task to convert sequences stored in other formats such as Intelligenetics, EMBL, etc. to FastA format either by hand or using any of several standard utitilities. A sample FastA format sequence is shown below: >HC2667A cosmid clone from human chromosome 5q22 GGATCCCAGCCTTTCCCCAGCCCGTAGCCCCGGGACCTCCGCGGTGGGCGGCGCCGCGCT GCCGGCGCAGGGAGGGCCTCTGGTGCACCGGCACCGCTGAGTCGGGTTCTCTCGCCGGCC TGTTCCCGGGAGAGCCCGGGGCCCTGCTCGGAGATGCCGCCCCGGGCCCCCAGACACCGG ...... GENSCAN may also be run on files in "minimal GenBank" format. This includes files in proper GenBank format as well as files (perhaps constructed by hand) which contain only partial GenBank annotation (see below). The main reason why GENSCAN has been written so as to accept GenBank annotated as well as unannotated (FastA) files is so that the predictive accuracy of the program can be easily measured on sequences with known gene locations. Thus, when run on a GenBank file containing a feature table, the program automatically compares its predictions to the annotated CDS (coding sequence) features, displays a summary of the annotated as well as predicted exons, and calculates some standard measures of predictive accuracy such as nucleotide- and exon-level sensitivity, specificity and so on. (The conventions used to calculate these statistics are those described in Burset and Guigo, 1996 - see REFERENCES). This makes it relatively easy to check the program's accuracy for any particular set of sequences for which the annotated CDS features are deemed to be reliable and complete. Of course, the program does not actually use the annotation in any way to make its predictions: that would be silly. Therefore, in particular, the set of predicted genes/exons will be identical for a sequence in GenBank format as for the same sequence converted to FastA format, i.e. without the feature annotation. In general, GenBank format files must follow an elaborate set of formatting conventions devised by the National Center for Biotechnology Information (NCBI) in Washington. However, GENSCAN needs only a fraction (described below) of the complete GenBank annotation for its purposes, so only these lines need be present (in the proper order) in an input file which is to be read by the program. This "minimal GenBank" format is as follows. The LOCUS and ORIGIN lines must be present. The first line of the file must be the LOCUS line and this line must be in proper GenBank format (see below). The ORIGIN line must begin with the word ORIGIN but has no other special restrictions. The sequence must begin on the line immediately after the ORIGIN line. All other lines normally present in GenBank files (e.g., ACCESSION, KEYWORDS, etc.) are optional. However, if a feature table is present, it must begin with a line: FEATURES Location/Qualifiers (with exactly the same spacing/capitalization/etc. as in a GenBank file) and must end with a BASE COUNT line (as in a real GenBank file) followed by an ORIGIN line, and then the sequence. The feature table must be in correct GenBank feature table format (including spacing) -- consult the NCBI GenBank format description or look at some real GenBank files if in doubt. All features other than those labeled "CDS" are ignored. All CDS features are read and the complete set of annotated coding exons are compared to the complete set of predicted coding exons. The format for the LOCUS line is: LOCUS SEQNAME seqlen bp seqtype taxgroup date where SEQNAME is the name of the sequence (any string), seqlen is the length of the sequence in base pairs (must match the actual number of sequence characters in the file), and the last three strings describe the type of sequence (e.g., ds-DNA, cDNA), the taxonomic group code, and the date. Again, the sequence may be upper or lower case and all spaces, tabs, numbers, etc. occurring after the ORIGIN line are ignored. An example of a sequence in minimal GenBank format is given below: LOCUS HUMRASH 6453 bp ds-DNA PRI 15-MAR-1988 DEFINITION Human c-Ha-ras1 proto-oncogene, complete coding sequence. ACCESSION J00277 J00206 J00276 K00954 FEATURES Location/Qualifiers prim_transcript <1664..3744 /note="c-Ha-ras1 mRNA" CDS join(1664..1774,2042..2220,2374..2533,3231..3350) /note="c-Ha-ras1 p21 protein; NCBI gi: 190891." /codon_start=1 /translation="MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV VIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKR VKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLV REIRQHKLRKLNPPDESGPGCMSCKCVLS" source 1..6453 /organism="Homo sapiens" BASE COUNT 946 a 2287 c 2113 g 1107 t ORIGIN 1 bp upstream of BamHI site. 1 ggatcccagc ctttccccag cccgtagccc cgggacctcc gcggtgggcg gcgccgcgct 61 gccggcgcag ggagggcctc tggtgcaccg gcaccgctga gtcgggttct ctcgccggcc 121 tgttcccggg agagcccggg gccctgctcg gagatgccgc cccgggcccc cagacaccgg ... .......... ______________________________________________________________________ 4. GENSCAN OUTPUT By default, the text output of the program is directed to stdout, which means that if you simply run the program without redirecting or piping the output, it will be printed to the screen. The purpose of printing the text output to stdout is so that you (the user) have the option of either redirecting the output to a file, e.g., > genscan HumanIso.smat SEQFILE > SEQFILE.out or piping the text output through some sort of filtering program to put it in a form which is more convenient for your purposes, e.g., > genscan HumanIso.smat SEQFILE | gsfilter > SEQFILE.fil_out Of course, it is your responsibility to create the filtering program. At this point, you may wish to test GENSCAN by running the program on the sample sequence HUMRASH which was provided and redirecting the output to a file, e.g. (assuming you set up the recommended aliases) type: > genvert HUMRASH > HUMRASH.out While running, the program should print out the "notice" message and then the following (everything between the rows of stars): ************************************************************************* reading sequence file HUMRASH... 6453 bp reading parameter file /usr/lib/GENSCAN/HumanIso.smat... 14 matrices scoring sequence... done running Viterbi/forward algorithms... done running backward algorithm... done printing results... 1 predicted genes processing features... done Run time 5 seconds : 0.8 s / kb ************************************************************************* Of course, the run time may vary depending on what type of computer you are using and what other processes were running at the same time. Other than that, the output should be similar to that shown above. If not, then there may have been a problem with the installation. (Check the steps listed above.) These messages are designed to give you an idea of what the program is doing at each particular time so that if a problem occurs you have some idea of what might have gone wrong. They should be fairly self-explanatory (see next section). Now compare the output HUMRASH.out to the output HUMRASH.sample which was provided using the Unix diff command: > diff HUMRASH.out HUMRASH.sample Only one line should differ between the two files: the line beginning GENSCAN 1.0 and containing the date and time the program was run. If any problems occurred up to this point (or in subsequent use of the program), send email to Chris Burge at: cburge@mit.edu Include in the email the type of computer system you have, the exact syntax of the command used to invoke the program, a description of the problem which occurred while running the program, and any error messages which appeared. In addition, if any bugs are found in GENSCAN, please notify me promptly so that I can fix them as soon as possible. Happy gene hunting!!! ______________________________________________________________________ 5. HOW GENSCAN WORKS This section gives an indication of the sequence of steps which the program carries out when run on a sequence. 1) The sequence and parameters are read and stored in dynamically allocated arrays. 2) The sequence is scored using the probabilistic models of coding/non-coding regions, donor and acceptor splice sites, etc. given in the parameter file and the resulting scores are also stored in dynamically allocated arrays. 3) Modified Viterbi, forward and backward recursions are performed which allow determination of the most likely gene structure in the sequence, probability of each predicted exon, and (optionally) all suboptimal exons with probability above a given cutoff. 4) The predicted gene structure(s), suboptimal exons, associated score information and the corresponding predicted peptides are then printed to stdout (the text output). Optionally, a PostScript output file which displays the locations of all predicted exons is also created (the graphical output). 5) The program indicates the amount of time it took to run and exits. ______________________________________________________________________ 6. WEB PAGES GENSCAN web server http://genes.edu/GENSCAN.html (Contains links to other GENSCAN-related pages) GENSCAN email server instructions http://ccr-081.mit.edu/GENSCANM.html My home page: http://genes.mit.edu/chris ______________________________________________________________________ 7. REFERENCES Burge, C. & Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94. Burge, C. & Karlin, S. (1997) Gene structure, exon prediction, and alternative splicing. (in preparation). Burge, C. (1997) Identification of genes in human genomic DNA. PhD thesis, Stanford University, Stanford, CA. Burset, M. & Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics 34, 353-367. ______________________________________________________________________ Copyright 1997-2001 Christopher Burge ______________________________________________________________________