MCAST

In order for MCAST to compute statistical confidence estimates, at least 200 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth option. When this option is set, synthetic sequences will be generated using a background model generated by choosing a random GC frequency within the range of observed GC minimum and maximum. The synthetic sequences will be used to estimate significance statistics.

When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisfied the user-specified p-value or q-value threshold.

MCAST can make use of position-specific priors (PSPs) to improve its identification of true motif occurrences. To take advantage of PSPs in MCAST you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.

The PSP can be provided in MEME PSP file format, or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP. When no PSP is available for a given position, MCAST will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes.

The PSP and PSP distribution files can be generated from raw scores using the create-priors utility available when you download and install the MEME Suite on your own computer.

A full description of the algorithm may be found in:

Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics (Proceedings of the European Conference on Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.

Motifs

A file containing DNA motifs in MEME formatted. Outputs from MEME and DREME are supported, as well as Minimal MEME Format. You can also input DNA motifs in TRANSFAC format if you specify the --transfac. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite. Input motifs that are likely to appear in the sequences.

Sequence Database

A collection of DNA sequences in FASTA format.

MCAST will create a directory named mcast_out (the name of this directory can be overridden via the --o or --oc options) The directory will contain:

A file named mcast.html reporting the matches in HTML format (see details here)
A file named cisml.xml reporting the matches in XML format using the CisML schema
A file named mcast.xml describing the inputs to MCAST in XML format and referencing cisml.xml
A file named mcast.txt reporting the matches in tab-delimited format (see details here)
A file named mcast.gff reporting the matches in GFF3 format

The score reported in the GFF3 output is min(1000, -10*(log10(pvalue))).

Option	Parameter	Description	Default Behaviour
General Options
--alpha	alpha	The fraction of all TF binding sites that are binding sites for the TF of interest.	1.0
--hardmask		Nucleotides in lower case will be converted to the wildcard 'N'. This prevents these positions from being considred in motif matches. This is useful when the input sequence file has been soft-masked for tandem repeats. Without hard masking, MCAST may assign sequence segments containing tandem repeats a highly significant score.	Nucleotides in lower case are converted to upper case.
--max-gap	max gap	The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap combined with large values of pthresh may prevent MCAST from computing E-values.	The maximum gap is set to 50.
--max-stored-scores	max	Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate.	The maximum number of stored matches is 100,000.
--motif-pthresh	pthresh	sets the scale for calculating pscores for motif hits. The p-score for a hit with p-value p is S = -log₂(p/pthresh),	The motif scaling pvalue defaults to 0.0005.
--output-ethresh	out E-value	The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.	The E-value threshold is 10.0.
--output-pthresh	out p-value	The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.	The E-value is used as the threshold. See --output-ethresh option.
--output-qthresh	out q-value	The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter.	The E-value is used as the threshold. See --output-ethresh option.
--parse-genomic-coord		When this options is specified each sequence header will be checked for UCSC style genomic coordinates. These are of the form: >sequence name:starting position-ending position Where sequence name is the name of the sequence, starting position is the index of the first base and ending position is the index of the final base. The sequence name may not contain any white space. If genomic coordinates are found they will be used as the coordinates in the output. When no coordinates are found the default behaviour is used.	The first position in the sequence will be assumed to be 1.
--psp	file	File containing position-specific priors (PSP) in MEME PSP format or wiggle format. This file can be generated using the create-priors utility.	A uniform position-specific prior is used.
--prior-dist	file	File containing binned distribution of priors. This file can be generated using the create-priors utility.	A uniform position-specific prior is used.
--synth		Use synthetic scores for distribution. A 0th-order Markov model of nucleotide frequencies will be created by choosing a GC content at random between the observed minimum and maximum values. This model will be used to generate synthetic sequences, and the synthetic sequences will be used to estimate the distribution of p-values.	No synthetic sequences will be generated.
--text		Limits output to plain text sent to standard out.
--transfac		MCAST will assume that the motif file is in TRANSFAC matrix format.	MCAST assumes the motif file is in MEME format.
--version		Display the version and exit.	Run as normal.

The HTML output contains

A list of the motifs, and the best possible "hit" for each
A list of matches. Each match record contains
- The name of the sequence
- The start position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided)
- The end position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided).
- The match score
- The match p-value
- The match e-value
- The match q-value
- A block diagram showing the relative positions of each motif "hit" in the match
- A detailed view of each match is available as a pop-up window. The detailed view shows the full sequence for the match, the alignment to the motifs, and the p-values for the motif hits
The inputs to MCAST
Text describing the MCAST results

The plain text output contains a line for each match. Each line contains the following fields:

An id string for the match
The name of the sequence containing the match
The start position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided)
The end position of the motif occurrence (closed, 1-based coordinates, unless genomic coordinates are provided).
The match score
The match p-value
The match e-value
The match q-value
The sequence of the matched region

The lines are sorted by score in descending order.

The MEME Suite

Motif-based sequence analysis tools

Motif Cluster Alignment and Search Tool

Usage:

Description

Input

Motifs

Sequence Database

Output

Options

HTML output

Text output

Citing