Custom Alphabet Definition Format

Almost all MEME Suite programs make use of alphabets to understand the meaning of sequences or motifs. The MEME Suite programs support three standard alphabets: the DNA, RNA and protein alphabets. The MEME Suite also allows you to define a custom alphabet to allow you to explore motifs containing, for example modified nucleotides (e.g., methylcytosines) or modified amino acids (e.g., phosphorylated serines and tyrosines).

This document describes how you can specify an alphabet of your own design for use with the MEME Suite programs that support custom alphabets. When you specify a custom alphabet, you can define the symbols that represent each nucleotide or amino acid, their names, their colors and their complements (in the case of alphabets like DNA that have complentary residues). You also can also define "ambiguous" symbols that represent more than one symbol in your custom alphabet (e.g., "R" to represent purines in a DNA-based alphabet.) You then provide your custom alphabet definition as a text file to the MEME Suite programs to inform them that sequences you provide use this non-standard alphabet.

Quick Overview

You can look at some examples of alphabet definitions to get a quick idea of how they look. The first two examples define the standard DNA and protein alphabets supported by the MEME Suite. Note that everything after the "#" character anywhere on the line is treated as a comment (ignored) by the MEME Suite.

Format Specification

The alphabet definition contains the following sections:

Header
Core Symbols
Ambiguous Symbols and Aliases

Header

The header line signals that the file is an alphabet definition, gives the name of the alphabet and specifies if it is like a standard alphabet.

ALPHABET "name" standard-LIKE

The "name" is optional and gives the name that is used to refer to the alphabet in outputs. It follows all the rules of quoted text. If the name is not given, then the alphabet will be referred to by the list of its core symbols.

The standard-LIKE is also optional and can be used to specify a reference alphabet that this alphabet might be sensibly compared to under most circumstances. Adding this optional flag will require that the alphabet you are defining has the uppercase form of all the reference alphabet's core symbols as well as their complements. The possible values for standard include DNA, RNA and PROTEIN.

For example if you were creating an alphabet that extended the DNA alphabet you might use a header like:

ALPHABET "Extended DNA" DNA-LIKE

This would require that the alphabet you were defining contained A, C, G and T and that A was the complement of T and C was the complement of G.

Core Symbols

The core symbols of an alphabet can either be defined on their own or two per line with a '~' between them to show them as complements.

For example you can define the symbol 'A' and the symbol 'T' to be complements of each other as follows:

A ~ T

However if 'A' and 'T' do not have complements then you would define them on separate lines:

A
T

Listing a letter like 'A' or 'T' is the simplest way to define a symbol but you can also specify a name and color for each symbol.

Ambiguous Symbols and Aliases

The ambiguous symbols of an alphabet are listed after all the core symbols have been defined. An ambiguous symbol will have the symbol definition on the left, followed by an equals '=' and a list of core symbols on the right.

For example you can defined the symbol 'N' to represent 'A', 'C', 'G' or 'T' as follows:

N = ACGT

If there is a single core symbol on the right of an ambigous symbol definition, the symbol is considered to be an alias for the core symbol on the left.

For example you can defined the symbol 'U' to be an alias for 'T' as follows:

U = T # (U is an alias for T)

As with the core symbols you can also specify a name and color for each ambiguous and alias symbol.

How ambiguous symbols and aliases are handled differs for each of the programs in the MEME Suite. See the specific documentation for each program for how they treat ambiguous symbols.

Additional Information

Allowed Symbol Characters

Each symbol is a single letter, number or one of the characters '.', '-', '*' or '?'. Letters may be either upper- or lowercase (see Letter Case below on the interpretation of case by MEME Suite programs). The '?' is a special wildcard character, and if you use it you must define it to match all core symbols (see Wildcard Symbol, below).

Letter Case

If all the letters you define as symbols are in a single case, (al upper case or all lower case), the programs in the MEME Suite will ignore case when reading sequences. However, if you include both upper- and lowercase letters in your alphabet definition, then upper- and lowercase letters will be treated as distinct symbols.

Core Symbol Ordering

MEME Suite programs internally order core symbols so that the uppercase letters A-Z come first, followed by the lowercase letters a-z, then by the numbers 0-9 and finally by the symbols '*', '-' and '.' in that order. (Note that '?' is not included in this list because it is never allowed to be a core symbol.) This ordering is used to determine the order of the columns in motifs output by MEME Suite motif discovery programs. Note: The order in which you specify symbols within the core symbol section of your alphabet file does not matter.

Wildcard Symbol

A wildcard is an ambiguous symbol that matches any core symbol. To define a wildcard symbol, list all the core symbols after the equals sign. Since many programs in the MEME Suite require that alphabets have a wildcard symbol in order for them to work correctly, if you do not define one the MEME Suite program will automatically define the symbol '?' to be the wildcard. It is strongly recommended that all custom alphabets you define include a wildcard symbol.

If you wish, you may manually define the wildcard as '?'.

Advanced Symbol Definition

Each symbol definition can have up to three fields as follows:

character "name" color

character

The symbol "character" is a single character chosen from the list of allowed symbol characters. The symbol character is required and is always the first field.

name

The symbol "name" is optional but it will make any outputs generated using this alphabet easier to understand by providing a reference on the meaning of the symbols used. If present, the symbol name must be the second field.

The "name" follows the rules of quoted text.

color

The symbol color is optional and represented by a 6 digit hexadecimal number with digits 1 & 2 defining the red component, digits 3 & 4 defining the green component and digits 5 & 6 defining the blue component.

Here are some example colors: CC0000, 008000, 0000CC, FFB300, FF00FF, FFCCCC, FFFF00 and 33E6CC.

Find the numerical codes for colors here: = 000000.

If you do not specify the color of one or more symbols, the MEME Suite will choose evenly spaced colors for them using its own algorithm.

Quoted Text

Quoted text is used to describe things like the name of a symbol or the name of the alphabet and will be displayed in outputs. They make the alphabet definition self documenting, and while they aren't, required they are highly recommended.

There are a few restrictions on quoted text:

Must begin and end with a double-quote character which designates the bounds and is not considered part of the text
Must not contain control characters
Must not contain whitespace (ie tab, newline, ...) other than the standard 'SPACE' (U+0020)
Contained double-quote characters must be escaped as \"
Contained back-slash characters must be escaped as \\
Contained forward-slash characters may be optionally escaped as \/
Contained Unicode characters may be optionally escaped \u4 hexadecimal digits although UTF-8 encoding is also fine
No other back-slash escape combinations are allowed although \b \f \n \r and \t will be understood by the parser for the purpose of giving a better error message.
Maximum length of 40 unicode characters (not bytes) after removing the surrounding double-quotes and converting all escape sequences
The comment character # will be ignored within quoted text

Examples

1) Standard DNA alphabet

ALPHABET "DNA" DNA-LIKE

# Core symbols
A "Adenine" CC0000 ~ T "Thymine" 008000
C "Cytosine" 0000CC ~ G "Guanine" FFB300

# Ambiguous symbols
U = T # alias Uracil to Thymine (permit U in input sequences)
R = AG
Y = CT
K = GT
M = AC
S = CG
W = AT
B = CGT
D = GAT
H = ACT
V = ACG
N = ACGT # wildcard symbol
X = ACGT # alias for wildcard symbol

3) Standard RNA Alphabet

ALPHABET "RNA" RNA-LIKE

# This alphabet will accept "T" in place of "U"
# in input sequences, but logos will use "U".

# Core symbols
A "Adenine" CC0000 
C "Cytosine" 0000CC 
U "Uracil" 008000
G "Guanine" FFB300

# Ambiguous symbols
T = U   # (permit T in input sequences)
R = AG
Y = CU
K = GU
M = AC
S = CG
W = AU
B = CGU
D = GAU
H = ACU
V = ACG
N = ACGU # wildcard symbol

2) Standard Protein alphabet

ALPHABET "Protein" PROTEIN-LIKE

# Core symbols
A "Alanine" 0000CC
R "Arginine" CC0000
N "Asparagine" 008000
D "Aspartic acid" FF00FF
C "Cysteine" 0000CC
E "Glutamic acid" FF00FF
Q "Glutamine" 008000
G "Glycine" FFB300
H "Histidine" FFCCCC
I "Isoleucine" 0000CC
L "Leucine" 0000CC
K "Lysine" CC0000
M "Methionine" 0000CC
F "Phenylalanine" 0000CC
P "Proline" FFFF00
S "Serine" 008000
T "Threonine" 008000
W "Tryptophan" 0000CC
Y "Tyrosine" 33E6CC
V "Valine" 0000CC
# These are commented-out because they are not in the default MEME Suite Protein alphabet
# U "Selenocysteine" 0000CC
# O "Pyrrolysine" 0000CC

# Ambiguous symbols
B = ND
Z = QE
J = LI
#X = ARNDCEQGHILKMFPSTWYVUO # wildcard symbol including U and O
X = ARNDCEQGHILKMFPSTWYV # wildcard symbol omitting U and O

4) DNA with covalent modifications alphabet

ALPHABET "DNA with covalent modifications" DNA-LIKE

# Core symbols
A "Adenine" 8510A8 ~ T "Thymine" A89610
C "Cytosine" A50026 ~ G "Guanine" 313695
m "5-Methylcytosine" D73027 ~ 1 "Guanine:5-Methylcytosine" 4575B4
h "5-Hydroxymethylcytosine" F46D43 ~ 2 "Guanine:5-Hydroxymethylcytosine" 74ADD1
f "5-Formylcytosine" FDAE61 ~ 3 "Guanine:5-Formylcytosine" ABD9E9
c "5-Carboxylcytosine" FEE090 ~ 4 "Guanine:5-Carboxylcytosine" E0F3F8

# Ambiguous symbols
z = Cmhfc
9 = G1234
y = Cfc
8 = G34
x = mh
7 = 12
R = AG
Y = CT
K = GT
M = AC
S = CG
W = AT
B = CGT
D = GAT
H = ACT
V = ACG
N = ACGT
? = ACGTcfhm1234 # wildcard symbol

The MEME Suite

Motif-based sequence analysis tools