Usage:

mcast [options] <motifs> <sequence database>

Description

In order for MCAST to compute statistical confidence estimates, at least 200 matches must be found. If the database contains too few sequences, or if certain other options are made too stringent, then too few matches may exist for significance statistics to be computed. In this case, the p-value, q-value, and E-value columns are set to "NaN", and all matches are printed. This limitation can be overcome by specifying the --synth option. When this option is set, synthetic sequences will be generated using a background model generated by choosing a random GC frequency within the range of observed GC minimum and maximum. The synthetic sequences will be used to estimate significance statistics.

When computing statistical confidence estimates, MCAST must retain the matches in memory until the final distribution of scores can be estimated. This means that the scanning of genome sized datasets has the potential to exhaust all available memory. To avoid this problem, MCAST uses reservoir sampling of the match scores, and limits the number of matches that are kept in memory. The default number of matches kept in memory is 100,000, but this value can be adjusted via the --max-stored-scores option. If the maximum number of stored matches is reached, then MCAST will drop the least significant half of the matches. This behavior may result in matches missing from the MCAST output, even though they would have satisfied the user-specified p-value or q-value threshold.

MCAST can make use of position-specific priors (PSPs) to improve its identification of true motif occurrences. To take advantage of PSPs in MCAST you use must provide two command line options. The --psp option is used to set the name of a file containing the PSP, and the --prior-dist option is used to set the name of a file containing the binned distribution of the PSP.

The PSP can be provided in MEME PSP file format, or in wiggle format. The MEME PSP file format requires that a PSP be included for every position in the sequence to be scanned. This format is usually only practical for relatively small sequence databases. The wiggle format accommodates sequence segments with missing PSP. When no PSP is available for a given position, MCAST will use the median PSP from the PSP distribution file. The wiggle format will work with large sequence databases, including full genomes.

The PSP and PSP distribution files can be generated from raw scores using the create-priors utility available when you download and install the MEME Suite on your own computer.

A full description of the algorithm may be found in:

Input

Motifs

A file containing DNA motifs in MEME formatted. Outputs from MEME and DREME are supported, as well as Minimal MEME Format. You can also input DNA motifs in TRANSFAC format if you specify the --transfac. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite. Input motifs that are likely to appear in the sequences.

Sequence Database

A collection of DNA sequences in FASTA format.

Output

MCAST will create a directory named mcast_out (the name of this directory can be overridden via the --o or --oc options) The directory will contain:

The score reported in the GFF3 output is min(1000, -10*(log10(pvalue))).

Options

Option Parameter Description Default Behaviour
General Options
--alphaalpha The fraction of all TF binding sites that are binding sites for the TF of interest. 1.0
--hardmask Nucleotides in lower case will be converted to the wildcard 'N'. This prevents these positions from being considred in motif matches. This is useful when the input sequence file has been soft-masked for tandem repeats. Without hard masking, MCAST may assign sequence segments containing tandem repeats a highly significant score. Nucleotides in lower case are converted to upper case.
--max-gapmax gap The value of max gap specifies the longest distance allowed between two hits in a match. Hits separated by more than max gap will be placed in different matches. Note: Large values of max gap combined with large values of pthresh may prevent MCAST from computing E-values. The maximum gap is set to 50.
--max-stored-scoresmax Set the maximum number of scores that will be stored. Keeping a complete list of scores may exceed available memory. Once the number of stored scores reaches the maximum allowed, the least significant 50% of scores will be dropped. In this case, the list of reported motifs may be incomplete and the q-value calculation will be approximate. The maximum number of stored matches is 100,000.
--motif-pthreshpthresh sets the scale for calculating pscores for motif hits. The p-score for a hit with p-value p is
S = -log2(p/pthresh),
The motif scaling pvalue defaults to 0.0005.
--output-ethreshout E-value The E-value threshold for displaying search results. If the E-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. The E-value threshold is 10.0.
--output-pthreshout p-value The p-value threshold for displaying search results. If the p-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. The E-value is used as the threshold. See --output-ethresh option.
--output-qthreshout q-value The q-value threshold for displaying search results. If the q-value of a match is greater than this value, then the match will not be printed. If any combination of --output-ethresh, --output-pthresh, or --output-qthresh is given, whichever option occurs last on the command-line will determine the effective output filter. The E-value is used as the threshold. See --output-ethresh option.
--parse-genomic-coord  When this options is specified each sequence header will be checked for UCSC style genomic coordinates. These are of the form:
>sequence name:starting position-ending position
Where
  • sequence name is the name of the sequence,
  • starting position is the index of the first base and
  • ending position is the index of the final base.
The sequence name may not contain any white space. If genomic coordinates are found they will be used as the coordinates in the output. When no coordinates are found the default behaviour is used.
The first position in the sequence will be assumed to be 1.
--pspfile File containing position-specific priors (PSP) in MEME PSP format or wiggle format. This file can be generated using the create-priors utility. A uniform position-specific prior is used.
--prior-distfile File containing binned distribution of priors. This file can be generated using the create-priors utility. A uniform position-specific prior is used.
--synth Use synthetic scores for distribution. A 0th-order Markov model of nucleotide frequencies will be created by choosing a GC content at random between the observed minimum and maximum values. This model will be used to generate synthetic sequences, and the synthetic sequences will be used to estimate the distribution of p-values. No synthetic sequences will be generated.
--text Limits output to plain text sent to standard out.
--transfac MCAST will assume that the motif file is in TRANSFAC matrix format. MCAST assumes the motif file is in MEME format.
--version  Display the version and exit. Run as normal.

HTML output

The HTML output contains

Text output

The plain text output contains a line for each match. Each line contains the following fields:

The lines are sorted by score in descending order.

Citing

If you use MCAST in your research please cite the following paper:
Timothy Bailey and William Stafford Noble, "Searching for statistically significant regulatory modules", Bioinformatics (Proceedings of the European Conference on Computational Biology), 19(Suppl. 2):ii16-ii25, 2003. [full text]