Usage:

ama [options] <motif file> <sequence file> [<background file>]

Description

The name AMA stands for "Average Motif Affinity". The program scores a set of sequences given a binding motif, treating each position in the sequence as a possible binding event. The score is calculated by averaging the likelihood ratio scores for all feasible binding events to the given sequence (and to its reverse strand for complentable alphabets). The binding strength at each potential site is defined as the likelihood ratio of the site under the motif versus under a zero-order background model provided by the user.

By default, AMA reports the average motif affinity score. It can also report p-values, which are estimated analytically using the given zero-order background model or using the GC-content of each sequence. The GC-content options are restricted to alphabets with 4 symbols in 2 complementary pairs, like DNA.

AMA can also compute the sequence-dependent likelihood ratio score used by Clover. The denominator of this score depends on the sequence being scored, and is the likelihood of the site under a Markov model derived from the sequence itself. Unlike Clover, AMA also allows higher-order sequence-derived Markov models (see --sdbg option below).

If the input file contains more than one motif, the motifs will be processed consecutively.

Full details are given in the supplement to the GOMO paper:

Fabian A. Buske, Mikael Bodén, Denis C. Bauer and Timothy L. Bailey, "Assigning roles to DNA regulatory motifs using comparative genomics", Bioinformatics, 26(7):860-866, 2010.

Inputs

Motif File

A file containing a list of motifs, in MEME Motif format.

Sequence File

A file containing a collection of sequences in FASTA format.

Background File

A file containing 0-order Markov Model in background model format such as produced by fasta-get-markov.
Note: This is a required option unless --sdbg is specified.

Outputs

AMA writes in CisML format to standard out, unless you specify one of --o or --oc. In that case, the o-format option (if given) is ignored and two output files are written to the directory you specify. The files are ama.xml in CISML format, and ama.txt in (almost) GFF2 format.

AMA's version of the GFF2 format uses the "sequence strand" field (field 7) to hold the p-value of the sequence:

 <sequence_name> ama sequence 1 <sequence_length> <sequence_score> <sequence_p-value> . <motif_id>

Options

OptionParameterDescriptionDefault Behaviour
General Options
--sdbgn Use a sequence-dependent Markov model of order n when computing likelihood ratios. A different sequence-dependent Markov model is computed for each sequence in the input and used to compute the likelihood ratio of all sites in that sequence. This option overrides --pvalues, --gcbins, and --rma. The background file is required and is used to compute the likelihood ratio for all sites in all sequences.
--motifid Use only the motif identified by id. This option may be repeated. All motifs are used.
--motif-pseudofloat A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency. A pseudocount of 0.1 is applied.
--norc  Do not score the reverse complement strand (when using a complementable alphabet). All strands are scored.
--scoringavg-odds|max-odds Indicates whether the average or the maximum likelihood ratio (odds) score should be calculated. If max-odds is chosen, no p-value will be printed. Average score will be calculated.
--rma  Scale the motif affinity score by the maximum achievable score for each motif. This is termed the Relative Motif Affinity score. This allows for direct comparison between different motifs. Affinity scores are not scaled.
--pvalues  Print the p-value of the average odds score in the output file. The p-score for a score is normally computed (but see --gcbins) assuming the sequences were each generated by the 0-order Markov model specified by the background file frequencies. This option is ignored if max-odds scoring is used. No p-value will be printed.
--gcbinsbins Compensate p-values for the complementary pair content (aka GC content) of each sequence independently. This is done by computing the score distributions for a range of complementary pair frequency values. Using 41 bins (recommended) computes distributions at intervals of 2.5% GC content. The computation assumes that the ratios of the two complementary pairs (ie A & T or G & C for the DNA alphabet) are both equal to 1. This assumption will fail if a sequence contains far more of a letter than its complement. This option sets the --pvalues option. This option is ignored if max-odds scoring is used. Uncompensated p-values are printed.
--cs  Enables combining of sequences with the same identifier by taking the average score and the Sidak corrected p-value: 1−(1−α)^1/n. Different sequences with the same identifier are used in GOMO databases if one gene in the reference species has more than one homologous gene in the related species (one-to-many relationship). Sequences are processed independently of each other.
--o-formatGFF2|cisml Set the output file format. CISML output format is used.
--odir Create a folder called dir and write output files in it. This option is not compatible with -oc as only one output folder is allowed. The program writes to standard out.
--ocdir Create a folder called dir but if it already exists allow overwriting the contents. This option is not compatible with -o as only one output folder is allowed. The program writes to standard out.
--max-seq-lengthmax Set the maximum length allowed for input sequences to max. The maximum allowed input sequence length is 250000000.
--lastn Use only scores of (up to) last n sequence positions to compute AMA. If the sequence is shorter than this value the entire sequence is scored. If the motif is longer than this value it will not be scored. The full sequence is scored.