motiph

Scan a DNA multiple sequence alignment for occurrences of given motifs, taking into account the phylogenetic tree relating the sequences. Any sequences containing nothing but gaps are removed from the alignment. motiph supports several evolutionary models that it uses to compute the score for aligning the motif to each possible region of the multiple alignment. Read about the algorithm for more details.

Alignment File

A DNA multiple alignment in ClustalW format. Alternatively, if the --list option is used, this file may contain a list of alignment files. In this case each of the alignment files will be scanned for each of the motifs.

Tree File

A phylogenetic tree in Phylip Newick format. This tree may contain additional species not represented in the alignment.

Motif File

A list of MEME formatted motifs.

Output is created as a table in HTML (.html) and text (.txt) formats. Each line in the table represents a match of the motif to a "window" in the alignment. Matches to the motif and its reverse complement are listed separately. The output columns contain the motif name, the alignment name, the start position of the match in the alignment, the end position of the match, the match score, the match p-value, the match q-value, and the matched portion of the first sequence in the alignment, respectively.

Output is also created in GFF format (.gff) and in CISML (.xml) format.

Option Parameter Description Default Behaviour

General Options

--bg mutation rate The mutation rate for sites in the background model. A background mutation rate of 1 is used.

--bgfile background file Read background frequencies from background file. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file.

--fg mutation rate The mutation rate for sites in the foreground model(s). A forground mutation rate of 1 is used.

--gap

skip|fixed|wildcard|minimum|model

Specifies the gap handling strategy.

Value	Name
skip	Skip those sites where any position in the alignment window contains a gap.
fixed	Sites containing gaps are assigned a fixed score, specified by --gap-cost.
wildcard	The gap character matches any base, and the score is the product of the corresponding probabilities.
minimum	The gap character is assigned the score corresponding to the least likely letter at the given position.
model	Use model-specific gap handling. Currently, the only model that supports this is f81_gap.

Behaves as if --gap skip had been specified.

--gap-cost cost Specifies the costs for gaps when using the fixed gap handling strategy. A gap cost of zero is used.

--hb Use the Halpern-Bruno modification to the evolutionary model.

--list Treat the second required input as a list of alignments, rather than a single alignment. The second required argument is treated as a single alignment.

--max-stored-scores storage Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate. The maximum number of stored matches is set to 100,000.

--model

single|average|jc|k2|f81|f84|hky|tn

The evolutionary model to use.

Value	Name	Description
single	Single Score	score first sequence: compute standard log-odds score of first sequence in the alignment; ignores tree but does NOT remove gaps.
average	Average Score	compute average of standard log-odds score of aligned sites.
jc	Jukes-Cantor	equilibrium base frequencies are all 1/4; the only free parameter is the mutation rate.
k2	Kimura 2-parameter	equilibrium base frequencies are all 1/4; the free parameters are the mutation rate and the transition/transversion rate ratio.
f81	Felsenstein 1981	equilibrium base frequencies are taken from the alignment; the only free parameter is the mutation rate.
f84	Felsenstein 1984	equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate and the transition/transversion rate ratio. The ratio of purine-purine to pyrimidine->pyrimidine transitions is assumed to be 1.
hky	Hasegawa-Kishino-Yano	equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate and the transition/transversion rate ratio. The ratio of purine-purine to pyrimidine-pyrimidine transitions is assumed to be equal to the ratio of purines to pyrimidines.
tn	Tamura-Nei	equilibrium base frequencies are taken from the alignment; the free parameters are the mutation rate, the transition/transversion rate ratio, and the ratio of purine-purine transitions to pyrimidine-pyrimidine transitions.

A description of the f81 model is available in chapter 13 of Statistical Methods in Bioinformatics by Ewens and Grant. The other models are described in chapters 9 and 13 of Inferring Phylogenies by Felsenstein.

Behaves as if --model f81 was specified.

--motif ID Use only the motif identified by ID. This option may be repeated. All motifs are used.

--no-qvalue Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995). The program computes q-values.

--norc Do not score the reverse complement DNA strand. Both strands are scored.

--o dir Create a folder called dir and write output files in it. This option is not compatible with --oc as only one output folder is allowed. The program behaves as if --oc motiph_out had been specified.

--oc dir Create a folder called dir but if it already exists allow overwriting the contents. This option is not compatible with --o as only one output folder is allowed. The program behaves as if --oc motiph_out had been specified.

--output-pthresh p-value threshold The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If both the --output-pthresh and --output-qthresh options appear on the command line, whichever appears later on the command line will be applied. The p-value threshold is set to 1e^-4.

--output-qthresh q-value threshold The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. No q-value threshold is applied.

--pseudocounts pseudocount A pseudocount to be added to each count in the motif matrix, weighted by the background frequencies for the nucleotides (Dirichlet prior), before converting the motif to probabilities. A pseudocount of 0.1 is added.

--pur-pyr ratio The ratio of the purine transition rate to pyrimidine transition rate. This parameter is used by the Tamura-nei model. The ratio is set to 1.0.

--seed n Seed for random number generator. The seed is set to the current time.

--text Limits output to plain text sent to standard out. This mode allows the program to search an arbitrarily large database, because results are not stored in memory.

--transition-transversion ratio The ratio of the transition rate to the transversion rate. This parameter is used by the Kimura 2-parameter, F84, HKY, and Tamura-nei models. The ratio is set to 0.5.

--verbosity 1|2|3|4|5 A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then it will only output error messages whereas the other extreme 5 (dump) outputs lots of mostly useless information. The verbosity level is set to 2 (normal).

The algorithm is as follows:

A = a multiple alignment
T = a phylogenetic tree
M = a motif frequency matrix
B = the background frequencies of the four bases in A
U = a background evolutionary model with equilibrium frequencies B

// Build evolutionary models.
for j in positions of M {
  E[j] = an evolutionary model with equilibrium frequencies 
    from the jth position of M
}

// Build matrix of all possible scores
for i in all possible alignment columns {
  for j in positions of M {
    foreground = site_likelihood(E[j], A[:][i], T)
    background = site_likelihood(U, A[:][i] T)
    score_matrix[i][j] = log_2(foreground / background)
  }    
}
// Calculate p-values of all possible scores for motif sized windows
// windows in the alignment.
pvalues = get_pv_lookup(score_matrix, B)

// Calculate score for each motif sized window in the alignment.
for i in columns of A {
  score = 0
  for j in positions of M {
    index = calculate the index of A[:][i + j] in the array
      of all possible alignment columns
    score = score + score_matrix[index][j]
  }    
  print pvalues[score]
}

The core of the algorithm is a routine (site_likelihood) for scoring a particular column of the multiple alignment using a given evolutionary model and a given phylogenetic tree. The alignment site provides the observed nucleotides at the base of the tree, and we sum over the unobserved nucleotides in the rest of the tree, conditioning on the equilibrium distribution from the evolutionary model at the root of the tree (Felsenstein Pruning Algorithm). The tree must be a maximum likelihood tree, of the kind generated by DNAml from Phylip or by FastDNAml. Branch lengths in the tree are converted to conditional probabilities using the specified evolutionary model.

The MEME Suite

Motif-based sequence analysis tools

Usage:

Description

Input

Alignment File

Tree File

Motif File

Output

Options

Algorithm