beadstring [options] <motifs> <database>
Beadstring builds a linear hidden Markov model (HMM) from the motifs and motif occurences listed in the motif file, and uses that HMM to search a sequence database for a particular ordered series of motifs. A description of the algorithm is found in:
Grundy, Bailey, Elkan and Baker. "Meta-MEME: Motif-based Hidden Markov Models of Protein Families". Computer Applications in the Biosciences. 13(4):397-406, 1997.By default, the order and spacing of motifs in the model is determined from the "Summary of Motifs" section of the MEME input file. Beadstring searches the summary for the sequence that contains the maximal number of distinct motif occurrences. If there is a tie, then beadstring selects the sequence with the smallest combined p-value. Beadstring then eliminates all but the most significant occurrence of each motif and uses the resulting order and spacing of motif occurrences to initialize the HMM. This procedure can be overridden by selecting the --motif, --motif-e-thresh, --motif-p-thresh or --order options.
The command line option --p-score activates an alternative scoring mode, called "p-value scoring." This scoring method is described in:
Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics 19(Suppl 2):ii16-ii25, 2003.A file of motifs from MEME (DREME does not have the "Summary of Motifs" section).
A database of sequences in FASTA format.
Beadstring will create a directory, named beadstring_out
by default. Any existing output files in the directory will be
overwritten. The directory will contain:
beadstring.xml
using the
CisML
schema.model.xml
using the MEME_HMM
schema.beadstring.html
beadstring.text
The default output directory can be overridden using the or options which are described below.
Option | Parameter | Description | Default Behaviour |
---|---|---|---|
Input/Output | |||
--bgfile | bfile | Read background frequencies from bfile.
The file should be in
MEME background file format.
The default is to use frequencies embedded in the application from
the non-redundant database. If the argument is the keyword
motif-file , then the frequencies will be taken from
the motif file. |
Use NR frequencies. |
--e-thresh | ev | Only print results with E-values less than ev. | Print results with E-values less than 0.01. |
--max-seqs | max | Print results for no more than max sequences. | All matches are reported, up to the specified E-value threshold (see --e-thresh). |
--model-file | model file | Creation of the HMM will be skipped, and the HMM will be read from the file instead. | The HMM will be created. |
--no-search | This option turns off the search phase of beadstring. The HMM will be stored if the --model option is specified. | The search phase runs as normal. | |
--progress | value | Print to standard error a progress message approximately every value seconds. | No progress message. |
--score-file | score file | Cause a score file (in BLAST format) to be read and used
instead of the built-in PAM (for proteins) or
transition/transversion (for DNA) score file. Several score files
are provided (including BLOSUM62) in the directory
doc . Other, user-provided score files may be
specified as well, as long as they are in the proper format. |
Uses the built-in score file. |
Motif Selection | |||
--motif | id | Use only the motif identified by id. This option may be repeated. | Use all motifs that pass the other motif selection options. |
--motif-e-thresh | ev | Use only motifs with E-values less than ev. | Use all motifs that pass the other motif selection options. |
--motif-p-thresh | pv | Use only motifs with p-values less than pv. | Use all motifs that pass the other motif selection options. |
--order | string | The given string specifies the order and spacing of the motifs within the model, and has the format "l=n=l=n=...=l=n=l", where "l" is the length of a region between motifs, and "n" is a motif index. Thus, for example, the string "34=3=17=2=5" specifies a two-motif linear model, with motifs 3 and 2 separated by 17 letters and flanked by 34 letters and 5 letters on the left and right. If the motif file contains motif occurrences on both strands, then the motif IDs in the order string should be preceded by "+" or "-" indicating the strandedness of the motif. | The order and spacing is determined from the motif file. |
Building The Model | |||
--fim | Gaps between motifs are not penalized. Spacer states between motifs are represented as free-insertion modules (FIM). A FIM is an insert state with 1.0 probability of self-transition and 1.0 probability of exit transition. Thus, traversing such a state has zero transition cost. Specifying this option causes all spacers to be represented using FIMs. | ||
--gap-extend | cost | This switch causes all spacer self-loop log-odds scores to be set to cost. In addition, it causes all other transitions out of a spacer to be set to zero. Together with the --gap-open switch, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (self-loop transition probabilities of gap states). | |
--gap-open | cost | This switch causes all transitions into a spacer state to be assigned a log-odds score equal to cost. Together with the --gap-extend switch, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (transition probabilities into and out of gap states). | |
--motif-pseudo | num | A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency. | No pseudocount is added to motif matrix probabilities. |
--nspacer | value | By default each spacer is modeled using a single insert state. The distribution of spacer lengths produced by a single insert state is exponential in form. A more reasonable distribution would be a bell-shaped curve such as a Gaussian. Modeling the length distribution explicitly is computationally expensive; however, a Gaussian distribution can be approximated using multiple insert states to represent a single spacer region. The --nspacer option specifies the number of insert states used to represent each spacer. | A single insert state is used. |
--spacer-pseudo | value | Specify the value of the pseudocount used in converting transition counts to spacer self-loop probabilities. | No pseudocount is added to self-loop probabilities. |
--trans-pseudo | value | Specify the value of the pseudocount used in converting transition counts to transition probabilities. | A pseudocount of 0.1 is added to transition probabilities. |
--zselo | Spacer emission log-odds scores to be set to zero. This prevents regions of unusual base/residue composition matching spacers well when the spacer emission frequencies are different than the background frequencies. It is particularly useful with DNA models. | ||
Scoring | |||
--allow-weak-motifs | In p-value score mode, weak motifs are defined as ones where the best possible hit has a p-value greater than the p-value threshold. Such motifs cannot contribute to a match in p-value score mode. By default, the program rejects any search results containing weak motifs, unless the --allow-weak-motifs switch is given. In that case, the search will proceed, but the weak motifs will never appear in any matches. Note:This switch only applies to p-value score mode. | ||
--global | Scores are computed for the match between the entire sequence and the model. | Use the maximal local score. | |
--pam | distance | By default, target probabilities are derived from the
distance-250 PAM matrix for proteins, and from a
distance-1 transition/transversion
matrix for DNA. With the -pam switch,
you can specify a different integer distance from 1 to 500.
(This can be overridden with the
--score-file switch below). The
distance-1 transition/transversion joint
probability matrix for DNA is given below:
A C G T A .990 .002 .006 .002 C .002 .990 .002 .006 G .006 .002 .990 .002 T .002 .006 .002 .990 |
|
--paths | single|all | This option determines how the program computes raw scores. With the single option, the program computes the Viterbi score, which is the log-odds score associated with the single most likely match between the sequence and the model. The all option yields the total log-odds score, which is the sum of the log-odds of all sequence-to-model matches. | Viterbi scoring is used. |
--p-score | num | The --p-score switch activates p-value
score mode with the given threshold. In p-value score mode, motif
match scores are converted to their p-values. They are then
converted to bit scores as follows:
S = -log2(p/T)
where S is the bit score of the hit, p is the p-value of the
log-odds score, and T is the p-value threshold. In this way, only
hits more significant than the p-value threshold get positive
scores. The p-value threshold, T, must be in the range 0<T≤1. |
log-odds score mode is used. |