Markov Background Model Format

This file format is used by many programs in the MEME Suite to model Markov background probabilities. A background model file specifies all k-mer frequencies up to a user-chosen maximum k. These define a Markov of order k-1. Some the MEME Suite programs only use the 0-order model, and ignore any higher order information in the file. You can easily create a Markov model of any order from a FASTA file of sequences using the fasta-get-markov command provided with the downloadable version of the MEME Suite.

0-order DNA Markov model

#   order 0
a       0.324
c       0.176
g       0.176
t       0.324

1st-order DNA Markov model

#   order 0
A       2.563e-01
C       2.437e-01
G       2.437e-01
T       2.563e-01
#   order 1
AA      7.020e-02
AC      5.388e-02
AG      8.089e-02
AT      5.134e-02
CA      7.575e-02
CC      7.050e-02
CG      1.659e-02
CT      8.089e-02
GA      6.280e-02
GC      5.652e-02
GG      7.050e-02
GT      5.388e-02
TA      4.751e-02
TC      6.280e-02
TG      7.575e-02
TT      7.020e-02

Each line may contain either:

Any number of white-space characters including empty lines.
A unique k-mer and a probability separated and potentially surrounded by whitespace.
One of the other options followed by a "#" character designating the rest of the line as a comment to be ignored.

For each value of k, up to the maximum you choose, the file should have exactly one line for each possible k-mer composed of the core symbols from either the standard DNA, RNA, or protein alphabet, or from a custom alphabet. The frequencies of all k-mers must preceed the frequencies of all k+1-mers.

For each value of k, the probabilities of the k-mers must sum to approximately 1.0 (small allowances for rounding are made). To define a consistent Markov model, it is necessary that, for each value of k, the sum of the probabilities of the k-mers whose suffix is a particular k-1-mer should approximately equal the probability of that k-1-mer, as given in the file.

The probabilities are numbers in the range 0 ≤ p ≤ 1. The may be in simple decimal (e.g., 0.00015) or use exponential notation (e.g., 1.5e-4). To be precise, each probability is a number p, where p can be matched by the regular expression ^([0]|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?$ and is in the range 0 ≤ p ≤ 1.

It is important to note that probabilities of zero (or one) are not allowed because these cause asymptotic conditions in the equations used by our programs. They are also unlikely to be correct - just because the dataset used to calculate a background might not contain any instances of "CGAAA" does not mean that it is impossible. For this reason the tool fasta-get-markov automatically adds pseudocounts to the observed letter counts (unless it is specifically told not to).

A background model file can be created from any FASTA sequence file using the fasta-get-markov program.

The MEME Suite

Motif-based sequence analysis tools