mhmmscan Format

The mhmmscan/MCAST output has up to three sections containing your search results:

Database Search Results
Alignments
Motif Diagrams

All three sections are always present in MCAST output. The second two sections will not be present in mhmmscan output unless the -fancy option was specified.

The results in all three sections are sorted by increasing E-value if possible, or by decreasing match score if E-values could not be computed.

Database Search Results

The "Database Search Results" section consists of lines of the following form:

These fields contain, for each match found,

ID - The sequence identifier, as given in the database file.
E-value - The E-value, which is the total number of matches that you would expect with match scores as good as this match if the database contained only sequences unrelated to the query. Thus, a small E-value indicates a good match.
Score - The match score, which is the sum of the scores for hits, minus the penalties for gaps.
Hits - The number of hits in the match.
Span - The length of the match, measured from the start of the first hit to the end of the last hit.
Start - The position in the sequence where the match begins.
End - The position in the sequence where the match ends.
Length - The length of the sequence that contains the match.
Description - The sequence descriptor. This description is taken from the FASTA database file, and is truncated so that the output fits easily on one line.

Alignments

Each alignment lists the sequence identifier, match E-value and log-odds score along the left. On the right, it shows the alignment of the match with the sequence in groups of four segments. An example segment from an alignment is given below, followed by a description of what each line of the segment means. (The example shows p-value score mode. The row of p-values would be replaced by log-odds scores in log-odds score mode. If --motif-scoring is not on, the row of p-values or scores is absent.)

hb_P1_element
1.5e-07
55.02

                             2.4e-04                           2.4e-04            1.3e-04
                             *_____+3__*                       *____-2__*         *___+1_*
                             TTTTTTATGCG.......................TTTTATGACT.........CTAATCCG..................................
                              TTTTTAT+ +                       TTTTAT A T         +TAATC+G
          220 CGGAACATTAAAATGATTTTTATTTCTATGCTAAATCTGTTGTATTTACTTTTATAAATTTAATGTGTTTAATCTGTTCACATTTTTAAATACTTCGTATGCTATCNNNN     329

The bottom-most line in each segment contains 50 letters from the target database sequence, flanked by that segment's start and end locations within the entire sequence. Thus, the first segment would be flanked by "1" and "49", the second by "50" and "99", etc.
Aligned above these 50-letter segments are the motifs corresponding to the hits in the match. The motifs are labeled with numbers in the order they appear in the query. A plus or minus sign preceding a hit indicates that the hit occurs on the given (+) or reverse complement (-) of the DNA sequence in the database. Each position within a motif region is indicated by a letter, and each gap position is indicated with a period.
In between the sequence segment and the corresponding match positions is a line that indicates the degree of match between the motifs and the sequence. If the letter in the motif with the largest log-odds score appears in the sequence, then the match row contains that letter. If the sequence letter does not have the largest log-odds score, but does have a positive log-odds score, then the match row contains a plus sign. Otherwise, the match row is empty.
The top-most row of each segment shows the p-value (or log-odds score) of each hit aligned above the start of the hit, depending on the score mode.

Motif Diagrams

The motif diagrams section shows the matches in schematic format. For each match, in the right two columns, it shows the sequence identifier and the match E-value. On the left, it shows the positions and spacings of the hits making up the match. Hits are labeled with numbers corresponding to the order the motifs were given in the query. A plus or minus sign preceding a hit indicates that the hit occurs on the given (+) or reverse complement (-) of the DNA sequence in the database.

Log-odds Scores

The log-odds scores for each motif column are created using prior information on the letters appearing in alignment columns. The prior information is the target frequencies [Karlin,S. and Altschul,S.F., PNAS USA , 87, 2264-2268] implicit in a scoring matrix. Meta-MEME can read a user-specified scoring matrix (in the same format as used by the BLAST family of programs) from a file or generate a PAM matrix. By default, PAM 250 is used for proteins, and PAM 1 is used for DNA. For DNA, the "PAM 1" frequency matrix is

.990 .002 .006 .002
.002 .990 .002 .006
.060 .002 .990 .020
.020 .060 .002 .990

Meta-MEME calculates the target frequencies q_ij = p_ip_j exp(L s_ij) from the scoring matrix s_ij and the background letter frequencies p_i by finding the value of L that makes the q_ij sum to one. These target frequencies are then used to create pseudo-frequencies to be added to the emission frequencies of the column, following the approach of [Henikoff,S. and Henikoff,J.G., JMB, 243, 574-578]. The pseudo-frequency for the i^th letter is computed as: g_i = sum _{j in alphabet} (f_j q_ij/p_j).

The pseudo-frequencies, g_i, are then combined with the emission frequencies, f_i to give frequency estimates

Q_i = (alpha f_i + beta g_i) / (alpha + beta).

Finally, the log-odds score for a letter in the motif column is computed by dividing by the background frequency of the letter and taking the logarithm,

S_i = log(Q_i / p_i).

In general, alpha should be proportional to the amount of independent information in the emission frequencies. We have set it to the constant 20. The parameter beta is arbitrary and controls the relative importance of prior information. We set it to the constant 10.

Our method is essentially that used in PSI-BLAST [Altschul,S.F et al., NAR, 25:17, 3389-3402] without

sequence weighting, and
scaling for amount of independent information (alpha).

To do 1) and 2) correctly would require having and using alignment information rather than emission frequencies as the starting point.

The MEME Suite

Motif-based sequence analysis tools

Description

Format Specification

Database Search Results

Alignments

Motif Diagrams

Log-odds Scores