mhmms [options] <MHMM file> <sequence file>
mhmms searches a sequence database using a motif-based hidden Markov model (HMM) of the kind produced by mhmm. Each sequence in the database is assigned an E-value, and the IDs and scores of sequences scoring below a given threshold are printed in sorted order.
The E-value of a given sequence is the expected number of sequences which match the given model as well or better than this sequence that you would expect to see by chance in a random database of the same size as the given database. Scores are assigned using a local search algorithm; in other words, the algorithm finds the subsequence that matches a subset of model states with highest log-odds.
The emission probabilities in the model are converted to log-odds scores before performing the local search. This is done by combining pseudocount probabilities derived from a score matrix (see the --pam and --score-file options below) with the emission frequencies. You can control the relative weight placed on the emission probabilities versus the pseudocount probabilities (see --pseudo-weight below). The adjusted emission probabilities are then converted to odds by dividing by background probabilities (see --bg-file below). Finally, they are converted to log-odds scores by taking their logarithm.
Transition probabilities are converted to log-odds scores by taking their logarithms before searching. This can be overridden and the gap scores can be set explicitly using the --gap-open and --gap-extend switches, below. This allows you to specify a single affine gap cost function for all spacers in the model.
The filename of a motif-based hidden Markov model. If the filename is given as '-' then mhmms will attempt to read the HMM from standard input.
A file containing FASTA formatted sequences. If the filename is given as '-' then mhmms will attempt to read the sequence database from standard input.
The MHMM scan results are written to standard output.
Option | Parameter | Description | Default Behaviour |
---|---|---|---|
General Options | |||
--paths | single|all | This option determines how mhmms computes raw scores. With the single option, mhmms computes the Viterbi score, which is the log-odds score associated with the single most likely match between the sequence and the model. The all option yields the total log-odds score, which is the sum of the log-odds of all sequence-to-model matches. | Viterbi scoring is used as if --paths single had been specified. |
--global | Uses global scoring in the viterbi or forward algorithm. | Uses local scoring. | |
--maxseqs | max seqs | The maximum number of sequences to print. | There is no limit on the number of printed sequences. |
--p-thresh | p-value threshold | The --p-thresh option activates
p-value score mode, motif match scores are converted to
their p-values. They are then converted to bit scores as follows:
S = -log2(p/T)
where S is the bit score of the hit, p
is the p-value of the log-odds score, and T is the
p-value threshold. In this way, only hits more significant
than the p-value threshold get positive scores. The
p-value threshold, T, must be in the range
0 < T ≤ 1. This mode of scoring automatically activates
the -motif-scoring feature (described
below under "Advanced Options:") so that partial motif hits are
disallowed.
Note
|
|
--both-strands | This allows matches to occur on either DNA strand. The -both-strand option implies the --motif-scoring option. | Motif matches are only found on the given strand. | |
--e-thresh | threshold | mhmms lists the sequences that have E-values below the given threshold. | The default threshold is 10. |
--fancy | The --fancy option turns on a more
detailed output format that shows, in addition to the score for
each sequence, the complete model-to-sequence match. Here is an
example of the fancy output format, showing a two-motif model
matching a sequence of length 179:
*................................................. 170K_TRVPS 1 GAHLVPTKSGDADTYNANSDRTLCALLSELPLEKAVMVTYGGDDSLIAF 49 .........................FDWqKFAGtWH.............. D+ F G+ 170K_TRVPS 50 PRGTQFVDPCPKLATKWNFECKIFKYDVPMFCGKFLLKTSSCYEFVPDPV 99 .................................................. 170K_TRVPS 100 KVLTKLGKKSIKDVQHLAEIYISLNDSNRALGNYMVVSKLSESVSDRYLY 149 ......GYCPEVKPI...............* C+ K I 170K_TRVPS 150 KGDSVHALCALWKHIKSFTALCTLLPRRKG 179 |
||
--width | w | Specify the width (in characters) of each line in the output. The description of each sequence, which is taken from the input FASTA file, will be truncated as necessary. | The output width is 132 characters. |
--nosort | Do not sort the output. | ||
--bg-file | background file | Read background frequencies from background file. The file should be in MEME background file format. | The background letter distribution of the appropriate (DNA or protein) NCBI non-redundant database is used. |
--allow-weak-motifs | In p-value score mode, weak motifs are defined as ones
where the best possible hit has a p-value less than the
p-value threshold. Such motifs cannot contribute to a
match in p-value score mode. When the
--allow-weak-motifs option is supplied
the search will proceed, but the weak motifs will never appear
in any matches.
NoteThis option only applies to p-value score mode. |
Any search containing weak motifs is rejected. | |
--progress | n | Print to standard error a progress message after every n sequences. | |
--noheader | Do not include a header in the output. | Include a header in the output. | |
--noparams | Do not list the parameters at the end of the output. | ||
-notime | Do not include a running time or host name at the end of the output. | ||
--quiet | Combine the previous 3 flags and set the verbosity to 1. | ||
Advanced Options | |||
--zselo | Specifying the --zselo option causes the spacer emission log-odds scores to be set to zero. This prevents regions of unusual base/residue composition matching spacers well when the spacer emission probabilities are different than the background probabilities. It is particularly useful with DNA models. | ||
--gap-open | cost | The --gap-open option causes all transitions into a spacer state to be assigned a log-odds score equal to cost. Together with the -gap-extend option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (transition probabilities into and out of gap states). | |
--gap-extend | cost | The --gap-extend option causes all spacer self-loop log-odds scores to be set to cost. In addition, it causes all other transitions out of a spacer to be set to zero. Together with the --gap-open option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (self-loop transition probabilities of gap states). | |
--motif-scoring | Specifying the --motif-scoring option forces all matches to motifs to be complete. This prevents matches to motifs from overhanging the sequence ends. It also prevents matches from beginning (ending) anywhere but at the start (end) of a motif. This option is enabled by the --p-thresh option. | Matches can begin or end anywhere within a motif. | |
--pseudo-weight | beta | The weight on the pseudocount probabilities can be adjusted to any value ≥ 0 using the -pseudo-weight option. The smaller the weight, the less effect the pseudocount probabilities have, and the closer the adjusted probabilities will be to the emission probabilities in the model. | The pseudocount probabilities are weighted by beta = 10, and emission probabilities in the model by alpha = 20. (See the formula above for converting letter frequencies to letter scores.) |
--pam | distance |
With the --pam option, you can specify
a different integer distance from 1 to 500. This can be
overridden with the --score-file
option, below. The distance-1 transition/transversion
joint probability matrix for DNA is given below:
A C G T A .990 .002 .006 .002 C .002 .990 .002 .006 G .006 .002 .990 .002 T .002 .006 .002 .990 |
The target probabilities are derived from the distance-250 PAM matrix for proteins, and from a distance-1 transition/transversion matrix for DNA. |
--score-file | score file | The --score-file option causes a score file (in BLAST format) to be read and used instead of the built-in PAM (for proteins) or transition/transversion (for DNA) score file. The target probabilities for letters are then derived from the score file. Several score files are provided (including BLOSUM62) in directory mhmm/data. Other, user-provided score files may be specified as well, as long as they are in the same format. |