mhmms (unsupported)

Usage:

mhmms [options] <MHMM file> <sequence file>

Description

mhmms searches a sequence database using a motif-based hidden Markov model (HMM) of the kind produced by mhmm. Each sequence in the database is assigned an E-value, and the IDs and scores of sequences scoring below a given threshold are printed in sorted order.

The E-value of a given sequence is the expected number of sequences which match the given model as well or better than this sequence that you would expect to see by chance in a random database of the same size as the given database. Scores are assigned using a local search algorithm; in other words, the algorithm finds the subsequence that matches a subset of model states with highest log-odds.

The emission probabilities in the model are converted to log-odds scores before performing the local search. This is done by combining pseudocount probabilities derived from a score matrix (see the --pam and --score-file options below) with the emission frequencies. You can control the relative weight placed on the emission probabilities versus the pseudocount probabilities (see --pseudo-weight below). The adjusted emission probabilities are then converted to odds by dividing by background probabilities (see --bg-file below). Finally, they are converted to log-odds scores by taking their logarithm.

Transition probabilities are converted to log-odds scores by taking their logarithms before searching. This can be overridden and the gap scores can be set explicitly using the --gap-open and --gap-extend switches, below. This allows you to specify a single affine gap cost function for all spacers in the model.

Input

Motif-based HMM file

The filename of a motif-based hidden Markov model. If the filename is given as '-' then mhmms will attempt to read the HMM from standard input.

Sequence File

A file containing FASTA formatted sequences. If the filename is given as '-' then mhmms will attempt to read the sequence database from standard input.

Output

The MHMM scan results are written to standard output.

Options

Option Parameter Description Default Behaviour
General Options
--pathssingle|​all This option determines how mhmms computes raw scores. With the single option, mhmms computes the Viterbi score, which is the log-odds score associated with the single most likely match between the sequence and the model. The all option yields the total log-odds score, which is the sum of the log-odds of all sequence-to-model matches. Viterbi scoring is used as if --paths single had been specified.
--global Uses global scoring in the viterbi or forward algorithm. Uses local scoring.
--maxseqsmax seqs The maximum number of sequences to print. There is no limit on the number of printed sequences.
--p-threshp-value threshold The --p-thresh option activates p-value score mode, motif match scores are converted to their p-values. They are then converted to bit scores as follows:
S = -log2(p/T)
where S is the bit score of the hit, p is the p-value of the log-odds score, and T is the p-value threshold. In this way, only hits more significant than the p-value threshold get positive scores. The p-value threshold, T, must be in the range 0 < T ≤ 1. This mode of scoring automatically activates the -motif-scoring feature (described below under "Advanced Options:") so that partial motif hits are disallowed.

Note

  • If p-value threshold is too small, there may be few (or no) "hits", and, consequently, few (or no) matches. This may cause mhmms to be unable to compute match E-values, or to report no matches. Small values of the p-value threshold may also cause the reported E-values to be inaccurate. In this case, the E-values will always be too large (conservative). The proper value for the p-value threshold can only be determined by experimentation since it depends on the number of motifs, the information content of the motifs and the value of maxgap.
  • If p-value threshold is too large, the expected length of a match may be longer than most of the sequences in the database you are searching. This will prevent mhmmscan from being able to compute E-values. Very low values of p-value threshold, when search genomic DNA, tend to give high scores to low-complexity sequence and repeated elements.
--both-strands This allows matches to occur on either DNA strand. The -both-strand option implies the --motif-scoring option. Motif matches are only found on the given strand.
--e-threshthreshold mhmms lists the sequences that have E-values below the given threshold. The default threshold is 10.
--fancy The --fancy option turns on a more detailed output format that shows, in addition to the score for each sequence, the complete model-to-sequence match. Here is an example of the fancy output format, showing a two-motif model matching a sequence of length 179:
                 *.................................................
                                                                   
170K_TRVPS     1  GAHLVPTKSGDADTYNANSDRTLCALLSELPLEKAVMVTYGGDDSLIAF    49

                 .........................FDWqKFAGtWH..............
                                           D+  F G+                
170K_TRVPS    50 PRGTQFVDPCPKLATKWNFECKIFKYDVPMFCGKFLLKTSSCYEFVPDPV    99

                 ..................................................
                                                                   
170K_TRVPS   100 KVLTKLGKKSIKDVQHLAEIYISLNDSNRALGNYMVVSKLSESVSDRYLY   149

                 ......GYCPEVKPI...............*
                         C+  K I                
170K_TRVPS   150 KGDSVHALCALWKHIKSFTALCTLLPRRKG    179
            
--widthw Specify the width (in characters) of each line in the output. The description of each sequence, which is taken from the input FASTA file, will be truncated as necessary. The output width is 132 characters.
--nosort Do not sort the output.
--bg-filebackground file Read background frequencies from background file. The file should be in MEME background file format. The background letter distribution of the appropriate (DNA or protein) NCBI non-redundant database is used.
--allow-weak-motifs In p-value score mode, weak motifs are defined as ones where the best possible hit has a p-value less than the p-value threshold. Such motifs cannot contribute to a match in p-value score mode. When the --allow-weak-motifs option is supplied the search will proceed, but the weak motifs will never appear in any matches.

Note

This option only applies to p-value score mode.
Any search containing weak motifs is rejected.
--progressn Print to standard error a progress message after every n sequences.
--noheader Do not include a header in the output. Include a header in the output.
--noparams Do not list the parameters at the end of the output.
-notime Do not include a running time or host name at the end of the output.
--quiet Combine the previous 3 flags and set the verbosity to 1.
Advanced Options
--zselo Specifying the --zselo option causes the spacer emission log-odds scores to be set to zero. This prevents regions of unusual base/residue composition matching spacers well when the spacer emission probabilities are different than the background probabilities. It is particularly useful with DNA models.
--gap-open cost The --gap-open option causes all transitions into a spacer state to be assigned a log-odds score equal to cost. Together with the -gap-extend option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (transition probabilities into and out of gap states).
--gap-extendcost The --gap-extend option causes all spacer self-loop log-odds scores to be set to cost. In addition, it causes all other transitions out of a spacer to be set to zero. Together with the --gap-open option, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (self-loop transition probabilities of gap states).
--motif-scoring Specifying the --motif-scoring option forces all matches to motifs to be complete. This prevents matches to motifs from overhanging the sequence ends. It also prevents matches from beginning (ending) anywhere but at the start (end) of a motif. This option is enabled by the --p-thresh option. Matches can begin or end anywhere within a motif.
--pseudo-weightbeta The weight on the pseudocount probabilities can be adjusted to any value ≥ 0 using the -pseudo-weight option. The smaller the weight, the less effect the pseudocount probabilities have, and the closer the adjusted probabilities will be to the emission probabilities in the model. The pseudocount probabilities are weighted by beta = 10, and emission probabilities in the model by alpha = 20. (See the formula above for converting letter frequencies to letter scores.)
--pamdistance With the --pam option, you can specify a different integer distance from 1 to 500. This can be overridden with the --score-file option, below. The distance-1 transition/transversion joint probability matrix for DNA is given below:
   A    C    G    T   
A  .990 .002 .006 .002
C  .002 .990 .002 .006
G  .006 .002 .990 .002
T  .002 .006 .002 .990
              
The target probabilities are derived from the distance-250 PAM matrix for proteins, and from a distance-1 transition/transversion matrix for DNA.
--score-filescore file The --score-file option causes a score file (in BLAST format) to be read and used instead of the built-in PAM (for proteins) or transition/transversion (for DNA) score file. The target probabilities for letters are then derived from the score file. Several score files are provided (including BLOSUM62) in directory mhmm/data. Other, user-provided score files may be specified as well, as long as they are in the same format.