Usage:

spamo [options] <sequences> <primary motif> <secondary motifs>+

Description

Inputs

<sequences>

The name of a FASTA formatted file containing sequences (ideally of about 500bp) centered on a genomic location expected to be relevant to the primary motif. This would typically be generated by expanding either side of a ChIP-seq peak to obtain sequences of about 500 bases in length.

SpaMo scans the central section, excluding the margin on either edge, for the primary motif. As the margin on each edge is excluded then if the sequence is shorter than two times the margin plus the trimmed length of the primary motif the sequence will always be discarded.

<primary motif>

The name of a file containing at least one MEME formatted motif. Outputs from MEME and DREME are supported, as well as Minimal MEME Format. You can convert many other motif formats to MEME format using conversion scripts available with the MEME Suite. The primary motif is the motif for which you are trying to find cofactors. If the file contains more than one motif then the first will be selected by default or another can be selected using the -primary or -primaryi options.

<secondary motifs>

The names of one or more MEME formatted motif files containing DNA motifs (see Primary Motifs, above). The secondary motifs are tested for a significant spacing with the primary motif which might imply they act together. If the motif databases contain motifs which you don't wish to scan, the motifs can be filtered based on their name by using the -inc and -exc options.

Outputs

SpaMo outputs its output to files in a directory named spamo_out, which it creates if necessary. You can change the output directory using the -o or -oc options.

The main output file is an HTML file named spamo.html, and it can be viewed with a web browser. A tab-separated values (TSV) output file named spamo.tsv is also generated that contains a single line for each significant primary-secondary motif spacing. Detailed documentation on the meanings of the columns in the is provided at the bottom of the TSV file.

Additional outputs may be requested using the -dumpseqs, -dumpsigs, -eps and -png options, as described below.

Options

Option Parameter Description Default Behaviour
Input/Output
-eps   Output histograms in Encapsulated PostScript format which can be included in publications. This option can be used with the -png option. Image files are not output by default as the webpage is capable of generating the graphs on demand.
-png   Output histograms in Portable Network Graphic format which is good for webpages. This option can be used with the -eps option. Image files are not output by default as the webpage is capable of generating the graphs on demand.
-dumpseqs   Write space separated values in columns, describing the motif matches used to make the histograms, to output files named seqs_<primary_motif>_<secondary_db>_<secondary_motif>.txt. The rows are initially in sequence name order but various command-line tools can be used to sort them on other values. The columns contain:
column(s)contents
1Trimmed lowercase sequence with uppercase matches
2Position of the secondary match within the whole sequence
3Sequence fragment that the primary matched
4Strand of the primary match (+/-)
5Sequence fragment that the secondary matched
6Strand of the secondary match (+/-)
7Is the primary match on the same strand as the secondary (s/o)
8Is the secondary match downstream or upstream (d/u)
9The gap between the primary and secondary matches
10The name of the sequence
11The p-value of the bin containing the match, adjusted for the number of bins
If the sequence names are in Genome Browser position format (e.g., "chr5:36715616-36715623"), the following additional columns appear:
12-14Position of primary match in BED coordinates
15Position of primary match in Genome Browser coordinates
16-18Position of secondary match in BED coordinates
19Position of secondary match in Genome Browser coordinates
No specific match information is output.
-dumpsigs   Same as -dumpseqs, but only secondary matches in significant bins are dumped. As in -dumpseqs.
Scanning
-numgen seed Specify a number as the seed for initializing the pseudo-random number generator used in breaking scoring ties. The seed is included in the output so experiments can be repeated. If you wish to run multiple experiments with different seeds then you can use the special value 'time' (without the quotes) which sets the seed to the system clock. A seed of 1 is used.
-margin size The distance either side of the primary motif site which makes up the region that can contain the secondary motif site. Additionally it is the minimum gap between the primary motif site and the edge of the sequence. These constraints mean that input sequences shorter than the trimmed length of the primary motif plus two times the margin size can not be used by SpaMo. A margin of 150 is used. For an input sequence of length 500 this means the central 200 bases are scanned for the best primary motif match and then the 300 bases surrounding the best primary site are scanned for the best secondary site.
-minscore value The minimum score accepted as a match to either the primary or secondary motif. This value can greatly affect the results of SpaMo. If it is too high, there will be no matches to the primary motif. If too low, sequences with non-significant matches to the primary and/or secondary motif will reduce the effectiveness of the spacing analysis. Note: If value is in the range [-1,0) then the minimum score is set to the absolute value of value times the maximum possible match score. A minimum score of 7 bits is used.
-bin size The size of the bin used to calculate the histogram and p-values. A bin size of 1 is recommended as it gives better output. A bin size of 1 is used.
-range size The distance from the primary motif site for which p-values are calculated to include in significance tests. A small value for range may miss significant peaks but this is a trade-off as a the larger the range the more bins have to be tested leading to a larger factor used in the Bonferroni correction for multiple tests. A range of 150 is used.
-shared fraction Redundant sequences are removed that have more than this fraction of identical residues. After the primary motif site has been selected in each sequence the sequence is trimmed to only include a region of size margin on either side of the primary motif site. This aligned and trimmed sequence (and its reverse complement) is then compared with all the other sequences and the fraction of shared bases is calculated, not including the bases in the match to the primary motif. If the fraction of shared bases between the sequence (or its reverse complement) is larger than this limit, then the second sequences is eliminated. To disable this feature set the shared fraction to 1. The shared fraction is set to 0.5 which means that the trimmed, aligned sequences must share 50% or more of their bases to be declared redundant.
-odds odds ratio To speed up the elimination of redundant sequences their positions are compared in a random order and comparison stops whenever the number of matches is so small that the odds ratio is greater than this value. The odds ratio is the probability of the given number of matches given that the sequences were generated by the background model, divided by the same probability given they have at least fraction matching positions (as specified by the option -shared). The odds ratio is set to 20.
Summarizing
-cutoff p-value The p-value cutoff for bins to be considered significant. This is the p-value of the Binomial Test on the number of observed secondary spacings or more falling into the given bin, adjusted for the number of bins tested. Note that the p-value is only calculated and tested for bins within the distance of the primary motif as specified by the option -range. A bin p-value smaller than or equal to 0.05 is considered significant.
-evalue E-value The minimum secondary motif E-value for its results to be printed. For each secondary motif, this is the minimum p-value of all tested bins multipled by the number of secondary motifs. The E-value estimates the expected number of random secondary motifs that would have the given E-value or lower. Results for all secondary motifs with E-value smaller than or equal to 10 are printed.
-overlap size To determine if two motifs are redundant the most significant bin in the tested range for each of the motifs is compared. For the motifs to be considered redundant it needs to be possible that the sites that got counted in the bin could have overlapped, and this parameter sets the minimum overlap. For a bin size larger than 1 the overlap of the bins can not be precisely calculated as the actual site positions are not stored and so the maximum possible overlap is used. A minimum overlap of 2 is required.
-joint fraction To determine if two motifs are redundant the most significant bin in the tested range in each of the motifs is compared. The most significant bin in each motif has the list of sequence identifiers which had a primary and secondary at the correct spacing to go into that bin. To compare the motifs for redundancy this set of sequence identifiers is compared and the size of the intersection is counted. This intersection size is divided by the size of the smaller of the two sequence sets to get the joint sequence fraction. A minimum joint sequence fraction of 0.5 is required for two motifs to be considered redundant.
Motif Loading
-pseudo count The pseudocount added to loaded motifs. A pseudocount of 0.1 is added to loaded motifs.
-trimbits Trim the edges of motifs based on the information content. The positions on the edges of the motifs with information content less than bits will not be used in scanning. Positions on the edges of the motifs with information content less than or equal to 0.25 will be trimmed.
-primaryname The name of the motif to select as the primary motif. This option is incompatible with -primaryi as only one primary motif can be selected. The first motif in the file is selected.
-primaryinum The index of the motif to select as the primary motif counting from 1. This option is incompatible with -primary as only one primary motif can be selected. The first motif in the file is selected.
-keepprimary  If the same file is specified for the primary and secondary motifs then by default the primary motif is excluded but specifying this option keeps it. The primary motif is excluded from the secondaries if the same file is used for the primary and secondary motifs.
-incpattern Select the motifs with names matching the pattern. The pattern can contain shell like wildcards (e.g., '*') though they must be escaped or quoted to prevent the shell from auto-expanding them. This option may be may be repeated and all the patterns will be used. Unless the -exc option has been specified all the motifs are used.
-excpattern Exclude the motifs with names matching the pattern. The pattern can contain shell like wildcards (e.g., '*') though they must be escaped or quoted to prevent the shell from auto-expanding them. This option may be may be repeated and all the patterns will be used. Unless the -inc option has been specified all the motifs are used.
Miscellaneous
-help  Print out a help message.  
-version  Display the version and exit. Run as normal.

Citing

If you use SpaMo in your research please cite the following paper:
Tom Whitington, Martin C. Frith, James Johnson and Timothy L. Bailey,
"Inferring transcription factor complexes from ChIP-seq data",
Nucleic Acids Research, 39(15):e98, 2011. [full text]