FIMO

The program uses a dynamic programming algorithm to convert log-odds scores into p-values, assuming a zero-order background model. By default the program reports all motif occurrences with a p-value less than 1e-4. The threshold can be set using the --thresh option.

The p-values for each motif occurrence are converted to q-values following the method of Benjamini and Hochberg ("q-value" is defined as the minimal false discovery rate at which a given motif occurrence is deemed significant). The --qv-thresh option directs the program to use q-values rather than p-values for the threshold.

If a motif has the strand feature set to +/- (rather than +), then FIMO will search both strands for occurrences.

The parameter --max-stored-scores sets the maximum number of motif occurrences that will be retained in memory. It defaults to 100,000. If the number of matches found reaches the maximum value allowed, FIMO will discard 50% of the least significant matches, and new matches falling below the significance level of the retained matches will also be discarded.

FIMO can make use of position specific priors (PSPs) to improve its identification of true motif occurrences. When priors are provided FIMO uses log-posterior odds scores instead of log-odds scores. The log-posterior odds score is described in this paper:

Gabriel Cuellar-Partida, Fabian A. Buske, Robert C. McLeay, Tom Whitington, William Stafford Noble, and Timothy L. Bailey,
"Epigenetic priors for identifying active transcription factor binding sites",
Bioinformatics 28(1): 56-62, 2012 [pdf]

To take advantage of PSPs in FIMO you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.

The PSP can be provided in MEME PSP file format or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP values. When no PSP is available for a given position, FIMO will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes.

The PSP and PSP distribution files can be generated from raw scores using the create-priors utility.

Motifs

A file containing MEME formatted motifs. Outputs from MEME and DREME are supported, as well as Minimal MEME Format. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite.

Database

A file containing a collection of sequences in FASTA format.

If only one motif is supplied to FIMO then a hyphen ('-') can be used to indicate that the sequence data should be read from standard input.

The FASTA header lines are used as the source of sequence names. The sequence name is the string following the initial '>' up to the first white space character. If the sequence name is of the form text:number-number, then the text portion will be used as the sequence name. The numbers will be used as genomic coordinates, and the first number will be used as the coordinate of the first position of the sequence. In all other cases the coordinate of the first postion of the sequence is taken as 1.

FIMO will create a directory, named fimo_out by default. Any existing output files in the directory will be overwritten. The directory will contain:

An HTML file named fimo.html.
A plain text file named fimo.txt.
A plain text file in GFF3 format named fimo.gff.
A CISML file named cisml.xml using the CisML schema.
An XML file named fimo.xml referencing the CISML file.

The default output directory can be overridden using the --o or --oc options which are described below.

The --text option will limit output to plain text sent to the standard output. This will disable the calculation and printing of q-values.

The score reported in the GFF3 output is
min(1000, -10*(log10(pvalue))),
and the group name is
<motif_id>_<sequence_id><strand>.

The HTML and plain text output contain the following columns:

The motif identifier
The (optional) alternate identifier for the motif
The sequence identifier
The strand '+' indicates the motif matched the forward strand, '-' the reverse strand, and '.' indicates strand is not applicable (as for amino acid sequences).
The start position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided)
The end position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided).
The score for the motif occurrence. The score is computed by by summing the appropriate entries from each column of the position-dependent scoring matrix that represents the motif.
The p-value of the motif occurrence. The p-value is the probability of a random sequence of the same length as the motif matching that position of the sequence with a score at least as good.
The q-value of the motif occurrence. The q-value is the estimated false discovery rate if the occurrence is accepted as significant. See Storey JD, Tibshirani R. Statistical significance for genome-wide studies, Proc. Natl. Acad. Sci. USA (2003) 100:9440–9445. Note: This column is omitted if you use the --text switch.
The sequence matched to the motif.

The HTML and plain text output is sorted by increasing p-value.

Option	Parameter	Description	Default Behaviour
General Options
--max-strand		If matches on both strands at a given position satisfy the output threshold, only report the match for the strand with the higher score. If the scores are tied, the matching strand is chosen at random.	Both matches are reported.
--max-stored-scores	max	Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate.	The maximum number of stored matches is 100,000.
--motif	id	Use only the motif identified by id. This option may be repeated.	Use all motifs.
--motif-pseudo	count	A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency	A pseudocount of 0.1 is used.
--no-qvalue		Do not compute a q-value for each p-value. The q-value calculation is that of Benjamini and Hochberg (1995).	The q-values are calculated.
--norc		Do not score the reverse complement DNA strand.	Both strands are scored.
--parse-genomic-coord		When this option is specified each sequence header will be checked for UCSC style genomic coordinates. These are of the form: >sequence name:starting position-ending position Where sequence name is the name of the sequence, starting position is the index of the first base and ending position is the index of the final base. The sequence name may not contain any white space. If genomic coordinates are found they will be used as the coordinates in the output. When no coordinates are found the default behaviour is used.	The first position in the sequence will be assumed to be 1.
--psp	file	File containing position specific priors (PSP) in MEME PSP format or wiggle format. This file can be generated using the create-priors utility.
--alpha	num	The alpha parameter for calculating position specific priors, used in conjuction with the psp option. Alpha represents the fraction of all transcription factor binding sites that are binding sites for the TF of interest. Alpha must be between 0 and 1.	An alpha value of 1 is used.
--prior-dist	file	File containing binned distribution of priors. This file can be generated using the create-priors utility.
--qv-thresh		Directs the program to use q-values for the output threshold.	The program thresholds on p-values.
--skip-matched-sequence		Like the --text option, this limits output to plain text sent to standard out, but in addition, turns off output of the sequence of motif matches. This speeds up processing considerably.	The program thresholds on p-values.
--text		Limits output to plain text sent to standard out. For FIMO, the text output is unsorted, and q-values are not reported. This mode allows the program to search an arbitrarily large database, because results are not stored in memory.
--thresh	num	The output threshold for displaying search results. Only search results with a p-value less than the threshold will be output. The threshold can be set to use q-values rather than p-values via the --qv-thresh option.	The threshold is a p-value of 1e-4.
--version		Display the version and exit.	Run as normal.

The MEME Suite

Motif-based sequence analysis tools

Find Individual Motif Occurrences

Usage:

Description

Input

Motifs

Database

Output

Options

Citing