Custom Alphabet Definition Format

Description

Almost all MEME Suite programs make use of alphabets to understand the meaning of sequences or motifs. The MEME Suite programs support three standard alphabets: the DNA, RNA and protein alphabets. The MEME Suite also allows you to define a custom alphabet to allow you to explore motifs containing, for example modified nucleotides (e.g., methylcytosines) or modified amino acids (e.g., phosphorylated serines and tyrosines).

This document describes how you can specify an alphabet of your own design for use with the MEME Suite programs that support custom alphabets. When you specify a custom alphabet, you can define the symbols that represent each nucleotide or amino acid, their names, their colors and their complements (in the case of alphabets like DNA that have complentary residues). You also can also define "ambiguous" symbols that represent more than one symbol in your custom alphabet (e.g., "R" to represent purines in a DNA-based alphabet.) You then provide your custom alphabet definition as a text file to the MEME Suite programs to inform them that sequences you provide use this non-standard alphabet.

Quick Overview

You can look at some examples of alphabet definitions to get a quick idea of how they look. The first two examples define the standard DNA and protein alphabets supported by the MEME Suite. Note that everything after the "#" character anywhere on the line is treated as a comment (ignored) by the MEME Suite.

  1. Standard DNA alphabet
  2. Standard RNA alphabet
  3. Standard protein alphabet
  4. DNA with covalent modifications alphabet

Format Specification

The alphabet definition contains the following sections:

  1. Header
  2. Core Symbols
  3. Ambiguous Symbols and Aliases

Header

The header line signals that the file is an alphabet definition, gives the name of the alphabet and specifies if it is like a standard alphabet.

ALPHABET "name" standard-LIKE

The "name" is optional and gives the name that is used to refer to the alphabet in outputs. It follows all the rules of quoted text. If the name is not given, then the alphabet will be referred to by the list of its core symbols.

The standard-LIKE is also optional and can be used to specify a reference alphabet that this alphabet might be sensibly compared to under most circumstances. Adding this optional flag will require that the alphabet you are defining has the uppercase form of all the reference alphabet's core symbols as well as their complements. The possible values for standard include DNA, RNA and PROTEIN.

For example if you were creating an alphabet that extended the DNA alphabet you might use a header like:

ALPHABET "Extended DNA" DNA-LIKE

This would require that the alphabet you were defining contained A, C, G and T and that A was the complement of T and C was the complement of G.

Core Symbols

The core symbols of an alphabet can either be defined on their own or two per line with a '~' between them to show them as complements.

For example you can define the symbol 'A' and the symbol 'T' to be complements of each other as follows:

A ~ T

However if 'A' and 'T' do not have complements then you would define them on separate lines:

A
T

Listing a letter like 'A' or 'T' is the simplest way to define a symbol but you can also specify a name and color for each symbol.

Ambiguous Symbols and Aliases

The ambiguous symbols of an alphabet are listed after all the core symbols have been defined. An ambiguous symbol will have the symbol definition on the left, followed by an equals '=' and a list of core symbols on the right.

For example you can defined the symbol 'N' to represent 'A', 'C', 'G' or 'T' as follows:

N = ACGT

If there is a single core symbol on the right of an ambigous symbol definition, the symbol is considered to be an alias for the core symbol on the left.

For example you can defined the symbol 'U' to be an alias for 'T' as follows:

U = T # (U is an alias for T)

As with the core symbols you can also specify a name and color for each ambiguous and alias symbol.

How ambiguous symbols and aliases are handled differs for each of the programs in the MEME Suite. See the specific documentation for each program for how they treat ambiguous symbols.

Additional Information

Allowed Symbol Characters

Each symbol is a single letter, number or one of the characters '.', '-', '*' or '?'. Letters may be either upper- or lowercase (see Letter Case below on the interpretation of case by MEME Suite programs). The '?' is a special wildcard character, and if you use it you must define it to match all core symbols (see Wildcard Symbol, below).

Letter Case

If all the letters you define as symbols are in a single case, (al upper case or all lower case), the programs in the MEME Suite will ignore case when reading sequences. However, if you include both upper- and lowercase letters in your alphabet definition, then upper- and lowercase letters will be treated as distinct symbols.

Core Symbol Ordering

MEME Suite programs internally order core symbols so that the uppercase letters A-Z come first, followed by the lowercase letters a-z, then by the numbers 0-9 and finally by the symbols '*', '-' and '.' in that order. (Note that '?' is not included in this list because it is never allowed to be a core symbol.) This ordering is used to determine the order of the columns in motifs output by MEME Suite motif discovery programs. Note: The order in which you specify symbols within the core symbol section of your alphabet file does not matter.

Wildcard Symbol

A wildcard is an ambiguous symbol that matches any core symbol. To define a wildcard symbol, list all the core symbols after the equals sign. Since many programs in the MEME Suite require that alphabets have a wildcard symbol in order for them to work correctly, if you do not define one the MEME Suite program will automatically define the symbol '?' to be the wildcard. It is strongly recommended that all custom alphabets you define include a wildcard symbol.

If you wish, you may manually define the wildcard as '?'.

Advanced Symbol Definition

Each symbol definition can have up to three fields as follows:

character "name" color
character

The symbol "character" is a single character chosen from the list of allowed symbol characters. The symbol character is required and is always the first field.

name

The symbol "name" is optional but it will make any outputs generated using this alphabet easier to understand by providing a reference on the meaning of the symbols used. If present, the symbol name must be the second field.

The "name" follows the rules of quoted text.

color

The symbol color is optional and represented by a 6 digit hexadecimal number with digits 1 & 2 defining the red component, digits 3 & 4 defining the green component and digits 5 & 6 defining the blue component.

Here are some example colors: CC0000, 008000, 0000CC, FFB300, FF00FF, FFCCCC, FFFF00 and 33E6CC.

Find the numerical codes for colors here: = 000000.

If you do not specify the color of one or more symbols, the MEME Suite will choose evenly spaced colors for them using its own algorithm.

Quoted Text

Quoted text is used to describe things like the name of a symbol or the name of the alphabet and will be displayed in outputs. They make the alphabet definition self documenting, and while they aren't, required they are highly recommended.

There are a few restrictions on quoted text:

Examples

1) Standard DNA alphabet

3) Standard RNA Alphabet

2) Standard Protein alphabet

4) DNA with covalent modifications alphabet