Markov Background Model Format

Description

This file format is used by many programs in the MEME Suite to model Markov background probabilities. A background model file specifies all k-mer frequencies up to a user-chosen maximum k. These define a Markov of order k-1. Some the MEME Suite programs only use the 0-order model, and ignore any higher order information in the file. You can easily create a Markov model of any order from a FASTA file of sequences using the fasta-get-markov command provided with the downloadable version of the MEME Suite.

Example Background models

0-order DNA Markov model

#   order 0
a       0.324
c       0.176
g       0.176
t       0.324
      

1st-order DNA Markov model

#   order 0
A       2.563e-01
C       2.437e-01
G       2.437e-01
T       2.563e-01
#   order 1
AA      7.020e-02
AC      5.388e-02
AG      8.089e-02
AT      5.134e-02
CA      7.575e-02
CC      7.050e-02
CG      1.659e-02
CT      8.089e-02
GA      6.280e-02
GC      5.652e-02
GG      7.050e-02
GT      5.388e-02
TA      4.751e-02
TC      6.280e-02
TG      7.575e-02
TT      7.020e-02
      

Format Specification

Each line may contain either:

For each value of k, up to the maximum you choose, the file should have exactly one line for each possible k-mer composed of the core symbols from either the standard DNA, RNA, or protein alphabet, or from a custom alphabet. The frequencies of all k-mers must preceed the frequencies of all k+1-mers.

For each value of k, the probabilities of the k-mers must sum to approximately 1.0 (small allowances for rounding are made). To define a consistent Markov model, it is necessary that, for each value of k, the sum of the probabilities of the k-mers whose suffix is a particular k-1-mer should approximately equal the probability of that k-1-mer, as given in the file.

The probabilities are numbers in the range 0 ≤ p ≤ 1. The may be in simple decimal (e.g., 0.00015) or use exponential notation (e.g., 1.5e-4). To be precise, each probability is a number p, where p can be matched by the regular expression ^([0]|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?$ and is in the range 0 ≤ p ≤ 1.

It is important to note that probabilities of zero (or one) are not allowed because these cause asymptotic conditions in the equations used by our programs. They are also unlikely to be correct - just because the dataset used to calculate a background might not contain any instances of "CGAAA" does not mean that it is impossible. For this reason the tool fasta-get-markov automatically adds pseudocounts to the observed letter counts (unless it is specifically told not to).

See Also

A background model file can be created from any FASTA sequence file using the fasta-get-markov program.