Beadstring (unsupported)

Usage:

beadstring [options] <motifs> <database>

Description

Beadstring builds a linear hidden Markov model (HMM) from the motifs and motif occurences listed in the motif file, and uses that HMM to search a sequence database for a particular ordered series of motifs. A description of the algorithm is found in:

Grundy, Bailey, Elkan and Baker. "Meta-MEME: Motif-based Hidden Markov Models of Protein Families". Computer Applications in the Biosciences. 13(4):397-406, 1997.

By default, the order and spacing of motifs in the model is determined from the "Summary of Motifs" section of the MEME input file. Beadstring searches the summary for the sequence that contains the maximal number of distinct motif occurrences. If there is a tie, then beadstring selects the sequence with the smallest combined p-value. Beadstring then eliminates all but the most significant occurrence of each motif and uses the resulting order and spacing of motif occurrences to initialize the HMM. This procedure can be overridden by selecting the --motif, --motif-e-thresh, --motif-p-thresh or --order options.

The command line option --p-score activates an alternative scoring mode, called "p-value scoring." This scoring method is described in:

Bailey and Noble. "Searching for statistically significant regulatory modules." Bioinformatics 19(Suppl 2):ii16-ii25, 2003.

Input

Motifs

A file of motifs from MEME (DREME does not have the "Summary of Motifs" section).

Database

A database of sequences in FASTA format.

Output

Beadstring will create a directory, named beadstring_out by default. Any existing output files in the directory will be overwritten. The directory will contain:

The default output directory can be overridden using the or options which are described below.

Options

Option Parameter Description Default Behaviour
Input/Output
--bgfilebfile Read background frequencies from bfile. The file should be in MEME background file format. The default is to use frequencies embedded in the application from the non-redundant database. If the argument is the keyword motif-file, then the frequencies will be taken from the motif file. Use NR frequencies.
--e-threshev Only print results with E-values less than ev. Print results with E-values less than 0.01.
--max-seqsmax Print results for no more than max sequences. All matches are reported, up to the specified E-value threshold (see --e-thresh).
--model-filemodel file Creation of the HMM will be skipped, and the HMM will be read from the file instead. The HMM will be created.
--no-search  This option turns off the search phase of beadstring. The HMM will be stored if the --model option is specified. The search phase runs as normal.
--progressvalue Print to standard error a progress message approximately every value seconds. No progress message.
--score-filescore file Cause a score file (in BLAST format) to be read and used instead of the built-in PAM (for proteins) or transition/transversion (for DNA) score file. Several score files are provided (including BLOSUM62) in the directory doc. Other, user-provided score files may be specified as well, as long as they are in the proper format. Uses the built-in score file.
Motif Selection
--motifid Use only the motif identified by id. This option may be repeated. Use all motifs that pass the other motif selection options.
--motif-e-threshev Use only motifs with E-values less than ev. Use all motifs that pass the other motif selection options.
--motif-p-threshpv Use only motifs with p-values less than pv. Use all motifs that pass the other motif selection options.
--orderstring The given string specifies the order and spacing of the motifs within the model, and has the format "l=n=l=n=...=l=n=l", where "l" is the length of a region between motifs, and "n" is a motif index. Thus, for example, the string "34=3=17=2=5" specifies a two-motif linear model, with motifs 3 and 2 separated by 17 letters and flanked by 34 letters and 5 letters on the left and right. If the motif file contains motif occurrences on both strands, then the motif IDs in the order string should be preceded by "+" or "-" indicating the strandedness of the motif. The order and spacing is determined from the motif file.
Building The Model
--fim  Gaps between motifs are not penalized. Spacer states between motifs are represented as free-insertion modules (FIM). A FIM is an insert state with 1.0 probability of self-transition and 1.0 probability of exit transition. Thus, traversing such a state has zero transition cost. Specifying this option causes all spacers to be represented using FIMs.
--gap-extendcost This switch causes all spacer self-loop log-odds scores to be set to cost. In addition, it causes all other transitions out of a spacer to be set to zero. Together with the --gap-open switch, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (self-loop transition probabilities of gap states).
--gap-opencost This switch causes all transitions into a spacer state to be assigned a log-odds score equal to cost. Together with the --gap-extend switch, this allows you to specify an affine gap penalty function, overriding the gap penalty implicit in the model (transition probabilities into and out of gap states).
--motif-pseudonum A pseudocount to be added to each count in the motif matrix, after first multiplying by the corresponding background frequency. No pseudocount is added to motif matrix probabilities.
--nspacervalue By default each spacer is modeled using a single insert state. The distribution of spacer lengths produced by a single insert state is exponential in form. A more reasonable distribution would be a bell-shaped curve such as a Gaussian. Modeling the length distribution explicitly is computationally expensive; however, a Gaussian distribution can be approximated using multiple insert states to represent a single spacer region. The --nspacer option specifies the number of insert states used to represent each spacer. A single insert state is used.
--spacer-pseudovalue Specify the value of the pseudocount used in converting transition counts to spacer self-loop probabilities. No pseudocount is added to self-loop probabilities.
--trans-pseudovalue Specify the value of the pseudocount used in converting transition counts to transition probabilities. A pseudocount of 0.1 is added to transition probabilities.
--zselo  Spacer emission log-odds scores to be set to zero. This prevents regions of unusual base/residue composition matching spacers well when the spacer emission frequencies are different than the background frequencies. It is particularly useful with DNA models.
Scoring
--allow-weak-motifs  In p-value score mode, weak motifs are defined as ones where the best possible hit has a p-value greater than the p-value threshold. Such motifs cannot contribute to a match in p-value score mode. By default, the program rejects any search results containing weak motifs, unless the --allow-weak-motifs switch is given. In that case, the search will proceed, but the weak motifs will never appear in any matches. Note:This switch only applies to p-value score mode.
--global  Scores are computed for the match between the entire sequence and the model. Use the maximal local score.
--pamdistance By default, target probabilities are derived from the distance-250 PAM matrix for proteins, and from a distance-1 transition/transversion matrix for DNA. With the -pam switch, you can specify a different integer distance from 1 to 500. (This can be overridden with the --score-file switch below). The distance-1 transition/transversion joint probability matrix for DNA is given below:
     A    C    G    T    
A  .990 .002 .006 .002
C  .002 .990 .002 .006
G  .006 .002 .990 .002
T  .002 .006 .002 .990
            
--pathssingle|all This option determines how the program computes raw scores. With the single option, the program computes the Viterbi score, which is the log-odds score associated with the single most likely match between the sequence and the model. The all option yields the total log-odds score, which is the sum of the log-odds of all sequence-to-model matches. Viterbi scoring is used.
--p-scorenum The --p-score switch activates p-value score mode with the given threshold. In p-value score mode, motif match scores are converted to their p-values. They are then converted to bit scores as follows:
S = -log2(p/T)
where S is the bit score of the hit, p is the p-value of the log-odds score, and T is the p-value threshold. In this way, only hits more significant than the p-value threshold get positive scores. The p-value threshold, T, must be in the range 0<T≤1.
log-odds score mode is used.