mhmmscan (unsuported)

Usage:

mhmmscan [options] <MHMM file> <sequence file>

Description

mhmmscan searches a sequence database using a motif-based hidden Markov model (HMM) of the kind produced by mhmm. This program is similar to mhmms, except that mhmmscan

Each sequence-vs-model match is assigned an E-value, and matches that score below a user-specified E-value threshold are printed in order of increasing E-value.

mhmmscan has two modes of computing match scores:

In p-value score mode, the search sequence can be thought of as consisting of three steps:

  1. find motif matches ("hits") with each sequence with p-values less than the user-specified p-value threshold,
  2. coalesce hits found in each sequence into "matches", where hits separated by more than maxgap positions are always separated into distinct matches, and
  3. report matches with E-values less than threshold.

The three parameters p-value threshold, maxgap and threshold are described in more detail under "Options:", below.

In log-odds score mode, the search can be thought of as consisting of two steps:

  1. find local matches between the model and each sequence that maximize the log-odds score and exceed minscore, and
  2. report matches with E-values less than threshold.

The threshold parameter is described in more detail under "Options:", below.

In order for E-values to be computed by mhmmscan, at least 100 matches must be found. If there are too few sequences in the database, or if certain other options are made to stringent (see Options, below), too few matches may exist for E-values to be computed. In this case, the results are sorted by match score, the E-value column is set to "NaN" and all matches are printed.

Input

Motif-based HMM File

A file containing a motif-based HMM.

Sequence File

A file containing FASTA formatted sequences. If the filename is given as '-' then the sequences wil be read from standard input.

Output

The MHMM scan results are written to standard output.

Options

Option Parameter Description Default Behaviour
General Options
--gfffilename Produce an output called filename in GFF2 format.
--p-threshp-value threshold The --p-thresh option activates p-value score mode, motif match scores are converted to their p-values. They are then converted to bit scores as follows:
S = -log2(p/T)
where S is the bit score of the hit, p is the p-value of the log-odds score, and T is the p-value threshold. In this way, only hits more significant than the p-value threshold get positive scores. The p-value threshold, T, must be in the range 0 < T ≤ 1. This mode of scoring automatically activates the -motif-scoring feature (described below under "Advanced Options:") so that partial motif hits are disallowed.

Note

  • If p-value threshold is too small, there may be few (or no) "hits", and, consequently, few (or no) matches. This may cause mhmmscan to be unable to compute match E-values, or to report no matches. Small values of the p-value threshold may also cause the reported E-values to be inaccurate. In this case, the E-values will always be too large (conservative). The proper value for the p-value threshold can only be determined by experimentation since it depends on the number of motifs, the information content of the motifs and the value of maxgap.
  • If p-value threshold is too large, the expected length of a match may be longer than most of the sequences in the database you are searching. This will prevent mhmmscan from being able to compute E-values. Very low values of p-value threshold, when search genomic DNA, tend to give high scores to low-complexity sequence and repeated elements.
--max-gapmax gap The --max-gap option allows you to specify the longest gap allowed before two local matches will be split. Matches separated by a gap longer than max gap will be split into two separate matches. This switch automatically activates the --zselo option (described in the Advanced Options) so that gap emission scores are set to zero. It also sets the --min-score threshold to a small number (1e-6), and sets --gap-open and --gap-extend (described in the Advanced Options) to T/L, where T is the --min-score threshold, and L is max gap.

Note

  • This switch causes mhmmscan to ignore the transition probabilities in the HMM.
  • Large values of max gap combined with large values of p-value threshold may prevent mhmmscan from computing E-values due to the problem described above in the second note for the --p-thresh option.
--e-threshthreshold mhmmscan prints the matches with E-values below the given threshold. If E-values cannot be computed, all matches are printed. The default threshold is 10.
--fancy The --fancy option turns on a more detailed output format that shows, in addition to the score for each sequence, the complete model-to-sequence match. For more details, see the documentation for mhmms.
--widthwidth Specify the width (in characters) of each line in the output. The description of each sequence, which is taken from the input FASTA file, will be truncated as necessary. By default, the output width is 132 characters.
--text Print output as ASCII text to standard output. Print output as HTML to standard output. This makes use of the program mhmm2html.
--nosort Do not sort the output. This reduces the memory requirements.
--bg-filebackground file Read background frequencies from background file. The file should be in MEME background file format. The background letter distribution of the appropriate (DNA or protein) NCBI non-redundant database is used.
--allow-weak-motifs In p-value score mode, weak motifs are defined as ones where the best possible hit has a p-value less than the p-value threshold. Such motifs cannot contribute to a match in p-value score mode. When the --allow-weak-motifs option is supplied the search will proceed, but the weak motifs will never appear in any matches.

Note

This switch only applies to p-value score mode.
Any search containing weak motifs is rejected.
--blocksizeblock size Specify the number of letters that are read from each sequence at once. If your system has a lot of memory, then you can specify a larger number of letters to be read at once, and vice versa. Reads in blocks of 107/N letters, where N is the number of states in the model. Specifying a verbosity of 2 or larger will cause the block size to be output when the program runs.
--synth Generate random synthetic sequences if fewer than 10000 matches are found in the sequence database. These sequences will be searched and their match scores used to estimate the random score distribution. The sequences are generated using the background distribution specified using the --bg switch, above. If the background file contains a higher-order Markov model, the higher-order lines are used when generating synthetic sequences.
--progressn Print to standard error a progress message after every n sequences.
--noheader Do not include a header in the output. Include a header in the output.
-notime Do not include a running time or host name at the end of the output.
--quiet Combine the previous 2 flags and set the verbosity to 1.
Advanced Options
--zselo Specifying the --zselo option causes the spacer emission log-odds scores to be set to zero. This prevents regions of unusual base/residue composition matching spacers well when the spacer emission probabilities are different than the background probabilities. It is particularly useful with DNA models. This option is enabled by the --maxgap option.
--gap-open cost The --gap-open option causes all transitions into a spacer state to be assigned a log-odds score equal to cost. Together with the -gap-extend option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (transition probabilities into and out of gap states). This option is enabled by the --maxgap option.
--gap-extendcost The --gap-extend option causes all spacer self-loop log-odds scores to be set to cost. In addition, it causes all other transitions out of a spacer to be set to zero. Together with the --gap-open option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (self-loop transition probabilities of gap states). This option is enabled by the --maxgap option.
--min-scoremin score This option allows you to specify the threshold for the repeated match algorithm used by mhmmscan. Matches must have a score of at least min score to be detected. Matches containing internal regions with scores less than minus 'threshold' will be split and reported as two separate matches. This option is enabled by the --maxgap option.
--egcostcost Scale the expected cost of a random gap to be cost times the expected score of a random hit. The larger you set cost, the more gaps will be penalized. This can only be used in conjunction with --max-gap. This may not be used in conjunction with --min-score. This option is enabled by the --maxgap option. By default, gap costs are essentially zero.
--motif-scoring Specifying the --motif-scoring option forces all matches to motifs to be complete. This prevents matches to motifs from overhanging the sequence ends. It also prevents matches from beginning (ending) anywhere but at the start (end) of a motif. This option is enabled by the --p-thresh option. Matches can begin or end anywhere within a motif.
--pseudo-weightbeta The weight on the pseudocount probabilities can be adjusted to any value ≥ 0 using the -pseudo-weight option. The smaller the weight, the less effect the pseudocount probabilities have, and the closer the adjusted probabilities will be to the emission probabilities in the model. The pseudocount probabilities are weighted by beta = 10, and emission probabilities in the model by alpha = 20. (See the formula above for converting letter frequencies to letter scores.)
--pamdistance With the --pam option, you can specify a different integer distance from 1 to 500. This can be overridden with the --score-file option, below. The distance-1 transition/transversion joint probability matrix for DNA is given below:
   A    C    G    T   
A  .990 .002 .006 .002
C  .002 .990 .002 .006
G  .006 .002 .990 .002
T  .002 .006 .002 .990
              
The target probabilities are derived from the distance-250 PAM matrix for proteins, and from a distance-1 transition/transversion matrix for DNA.
--score-filescore file The --score-file option causes a score file (in BLAST format) to be read and used instead of the built-in PAM (for proteins) or transition/transversion (for DNA) score file. The target probabilities for letters are then derived from the score file. Several score files are provided (including BLOSUM62) in directory mhmm/data. Other, user-provided score files may be specified as well, as long as they are in the same format.