mhmmscan [options] <MHMM file> <sequence file>
mhmmscan searches a sequence database using a motif-based hidden Markov model (HMM) of the kind produced by mhmm. This program is similar to mhmms, except that mhmmscan
Each sequence-vs-model match is assigned an E-value, and matches that score below a user-specified E-value threshold are printed in order of increasing E-value.
mhmmscan has two modes of computing match scores:
In p-value score mode, the search sequence can be thought of as consisting of three steps:
The three parameters p-value threshold, maxgap and threshold are described in more detail under "Options:", below.
In log-odds score mode, the search can be thought of as consisting of two steps:
The threshold parameter is described in more detail under "Options:", below.
In order for E-values to be computed by
mhmmscan
, at least 100 matches must be found. If there
are too few sequences in the database, or if certain other options
are made to stringent (see Options, below), too few matches may
exist for E-values to be computed. In this case, the results
are sorted by match score, the E-value column is set to
"NaN" and all matches are printed.
A file containing a motif-based HMM.
A file containing FASTA formatted sequences. If the filename is given as '-' then the sequences wil be read from standard input.
The MHMM scan results are written to standard output.
Option | Parameter | Description | Default Behaviour |
---|---|---|---|
General Options | |||
--gff | filename | Produce an output called filename in GFF2 format. | |
--p-thresh | p-value threshold | The --p-thresh option activates
p-value score mode, motif match scores are converted to
their p-values. They are then converted to bit scores as follows:
S = -log2(p/T)
where S is the bit score of the hit, p
is the p-value of the log-odds score, and T is the
p-value threshold. In this way, only hits more significant
than the p-value threshold get positive scores. The
p-value threshold, T, must be in the range
0 < T ≤ 1. This mode of scoring automatically activates
the -motif-scoring feature (described
below under "Advanced Options:") so that partial motif hits are
disallowed.
Note
|
|
--max-gap | max gap | The --max-gap option allows you to
specify the longest gap allowed before two local matches will be
split. Matches separated by a gap longer than
max gap will be split into two
separate matches. This switch automatically activates the
--zselo option (described in the
Advanced Options) so that gap emission scores are set to zero.
It also sets the --min-score threshold
to a small number (1e-6), and sets
--gap-open and
--gap-extend (described in the
Advanced Options) to T/L , where T is
the --min-score threshold, and
L is max gap.
Note
|
|
--e-thresh | threshold | mhmmscan prints the matches with E-values below the given threshold. If E-values cannot be computed, all matches are printed. | The default threshold is 10. |
--fancy | The --fancy option turns on a more detailed output format that shows, in addition to the score for each sequence, the complete model-to-sequence match. For more details, see the documentation for mhmms. | ||
--width | width | Specify the width (in characters) of each line in the output. The description of each sequence, which is taken from the input FASTA file, will be truncated as necessary. | By default, the output width is 132 characters. |
--text | Print output as ASCII text to standard output. | Print output as HTML to standard output. This makes use of the program mhmm2html. | |
--nosort | Do not sort the output. This reduces the memory requirements. | ||
--bg-file | background file | Read background frequencies from background file. The file should be in MEME background file format. | The background letter distribution of the appropriate (DNA or protein) NCBI non-redundant database is used. |
--allow-weak-motifs | In p-value score mode, weak motifs are defined as ones
where the best possible hit has a p-value less than the
p-value threshold. Such motifs cannot contribute to a
match in p-value score mode. When the
--allow-weak-motifs option is supplied
the search will proceed, but the weak motifs will never appear
in any matches.
NoteThis switch only applies to p-value score mode. |
Any search containing weak motifs is rejected. | |
--blocksize | block size | Specify the number of letters that are read from each sequence at once. If your system has a lot of memory, then you can specify a larger number of letters to be read at once, and vice versa. | Reads in blocks of 107/N letters,
where N is the number of states in the model.
Specifying a verbosity of 2 or larger will cause the block size
to be output when the program runs. |
--synth | Generate random synthetic sequences if fewer than 10000 matches are found in the sequence database. These sequences will be searched and their match scores used to estimate the random score distribution. The sequences are generated using the background distribution specified using the --bg switch, above. If the background file contains a higher-order Markov model, the higher-order lines are used when generating synthetic sequences. | ||
--progress | n | Print to standard error a progress message after every n sequences. | |
--noheader | Do not include a header in the output. | Include a header in the output. | |
-notime | Do not include a running time or host name at the end of the output. | ||
--quiet | Combine the previous 2 flags and set the verbosity to 1. | ||
Advanced Options | |||
--zselo | Specifying the --zselo option causes the spacer emission log-odds scores to be set to zero. This prevents regions of unusual base/residue composition matching spacers well when the spacer emission probabilities are different than the background probabilities. It is particularly useful with DNA models. This option is enabled by the --maxgap option. | ||
--gap-open | cost | The --gap-open option causes all transitions into a spacer state to be assigned a log-odds score equal to cost. Together with the -gap-extend option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (transition probabilities into and out of gap states). This option is enabled by the --maxgap option. | |
--gap-extend | cost | The --gap-extend option causes all spacer self-loop log-odds scores to be set to cost. In addition, it causes all other transitions out of a spacer to be set to zero. Together with the --gap-open option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (self-loop transition probabilities of gap states). This option is enabled by the --maxgap option. | |
--min-score | min score | This option allows you to specify the threshold for the repeated match algorithm used by mhmmscan. Matches must have a score of at least min score to be detected. Matches containing internal regions with scores less than minus 'threshold' will be split and reported as two separate matches. This option is enabled by the --maxgap option. | |
--egcost | cost | Scale the expected cost of a random gap to be cost times the expected score of a random hit. The larger you set cost, the more gaps will be penalized. This can only be used in conjunction with --max-gap. This may not be used in conjunction with --min-score. This option is enabled by the --maxgap option. | By default, gap costs are essentially zero. |
--motif-scoring | Specifying the --motif-scoring option forces all matches to motifs to be complete. This prevents matches to motifs from overhanging the sequence ends. It also prevents matches from beginning (ending) anywhere but at the start (end) of a motif. This option is enabled by the --p-thresh option. | Matches can begin or end anywhere within a motif. | |
--pseudo-weight | beta | The weight on the pseudocount probabilities can be adjusted to any value ≥ 0 using the -pseudo-weight option. The smaller the weight, the less effect the pseudocount probabilities have, and the closer the adjusted probabilities will be to the emission probabilities in the model. | The pseudocount probabilities are weighted by beta = 10, and emission probabilities in the model by alpha = 20. (See the formula above for converting letter frequencies to letter scores.) |
--pam | distance |
With the --pam option, you can specify
a different integer distance from 1 to 500. This can be
overridden with the --score-file
option, below. The distance-1 transition/transversion
joint probability matrix for DNA is given below:
A C G T A .990 .002 .006 .002 C .002 .990 .002 .006 G .006 .002 .990 .002 T .002 .006 .002 .990 |
The target probabilities are derived from the distance-250 PAM matrix for proteins, and from a distance-1 transition/transversion matrix for DNA. |
--score-file | score file | The --score-file option causes a score file (in BLAST format) to be read and used instead of the built-in PAM (for proteins) or transition/transversion (for DNA) score file. The target probabilities for letters are then derived from the score file. Several score files are provided (including BLOSUM62) in directory mhmm/data. Other, user-provided score files may be specified as well, as long as they are in the same format. |