update-sequence-db

Download sequence databases.

Creates a SQLite database called fasta_db.sqlite and downloads sequences from multiple sources while storing information about the sequences in the database.

The program will start in status display mode where it will give regular updates on what it is doing. You can switch it to command mode by pressing Enter. In command mode you can type the two basic commands "help" which will show the available commands and "status" which will switch it back to status mode. While sequences are downloading you may use the command "exit" to stop any further downloading.

Sequence Database Directory

The folder to store downloaded database files. The MEME Suite expects to find sequence databases in a folder called fasta_databases either inside in the folder MEME Install Folder/db or in the folder specified to the configure script --with-db DB Install Folder . Depending on how you configured the MEME Suite you should either specify MEME Install Folder/db/fasta_databases or DB Install Folder/fasta_databases .

By default, all of the standard sequence databases supported by the MEME Suite will be updated. Specifying one or more specific types of databases overrides this default, and then only the specified types of sequence database will be updated. You can also specify individual types of database to omit using the --no_X options, where X is one of the allowed database types (see the section "Select Databases to Update", below).

The program creates a folder called downloads and a folder called logs. It also creates a SQLite database called fasta_db.sqlite. Every sequence database that is downloaded is initially put in the folder downloads until it has been completely downloaded. When the sequence has been downloaded it will be decompressed or merged from multiple sources as required and put into a sequence file with either a .faa or .fna extension for protein or DNA sequences. Once the sequence has been expanded it will be processed by fasta-get-markov to calculate a 1st order background model in a file with the extension .bfile. Additionally fasta-get-markov will calculate the number of sequences, the shortest, longest and average size and all this information will be stored in the SQLite database.

Configuration files that tweak the behaviours of the sequence database downloaders will be automatically generated in the conf/ subdirectory within the specified sequence database directory.

Additionally the miscellaneous source downloader will check the conf/ subdirectory for any files ending with the extension .csv which it reads to determine sequence sources. The MEME Suite includes two files db_general.csv and db_other_genomes.csv in the distribution's etc folder which may be moved into the conf folder, though this is not done automatically during install.

Option	Parameter	Description	Default Behaviour
Help
--help		Display a help message and exit.	Run like normal.
Select Databases to Update
--[no_]ensembl		[Do not] update genomes from Ensembl.	Update all sequence databases.
--[no_]genbank		[Do not] update genomes from GenBank.	Update all sequence databases.
--[no_]ucsc		[Do not] update the genomes from UCSC.	Update all sequence databases.
--[no_]rsat		[Do not] update the upstream sequence databases from RSAT.	Update all sequence databases.
--[no_]epd		[Do not] update the Eukaryotic Promoter Database.	Update all sequence databases.
--[no_]misc		[Do not] update the miscellaneous sequence databases specified in `.csv` files in the database subdirectory `conf/`. There are two example `.csv` files in the MEME Suite `etc/` directory.	Update all sequence databases.
--updater	classname	Experimental Specify the classname of a custom updater.
File Cleanup
--obsolete	file pattern	Mark any sequence databases that match the given glob syntax file pattern as obsolete causing them to be hidden from the interface. This option may be repeated to specify multiple patterns. After the files are obsoleted the updater exits.	Run as normal.
--delete_old		Sequence databases marked as obsolete (on a previous update) will be deleted.	Sequence databases marked as obsolete will be left untouched.
--retain_missing		Database entries for missing files are retained.	Database entries for missing files are removed.
Backwards compatibility
--csv:directory		Create a csv file and index file that lists all the databases to enable backwards compatibility with older releases. The directory to create the csv and index file can be specified if desired but if it is not specified then the csv and index file will be placed in the sequence database directory.	Don't create a csv or index file.
Miscellaneous
--bin	directory	Specify the location to find the fasta-get-markov tool.	The program will search the configured bin directory and if fasta-get-markov is not present it will search the path.
--log	log file	Specify the file to write logs.	A log will be written the `logs` directory below the sequence database directory.
-v	log level	Specify the logging level [1-8].	A default logging level of 3 is used which outputs errors, warnings and summary information.
--priors	tsv file	Specify a tab separated values file listing all the priors that should be listed in the database. The updater will exit after changing the priors. Note that pre-existing priors will be removed!	Run as normal.

MCAST and FIMO support priors for sequence databases but adding them is still a manual process. The process will probably be automated in future however until then this is how you add priors.

Create priors with prior sources like DNase1 hypersensitivity sequence tag counts using create-priors. This will create two files: priors.wig and priors.dist which you should rename in a way that makes sense.
Run gzip on each of the files. This should leave them with the extension ".gz" - it is important that you leave this extension so they can be ungzip-ed by the webservice script later.
Move the gzip-ed ".wig" and ".dist" files into the sequence database directory. They may be at the top level or nested within folders however they must be accessible by a relative path without any ".." elements.

Create a file listing all the priors one per line with the fields separated by tabs.

Field	Description
Sequence File	The path to the sequence file relative to the sequence database directory.
Wig File	The path to the gzip-ed ".wig" file relative to the sequence database directory.
Dist File	The path to the gzip-ed ".dist" file relative to the sequence database directory.
Biosample	A short descriptive name for the sample used in the experiment that the priors were derived from.
Assay	A short descriptive name for the experiment that the priors were derived from.
Source	A short descriptive name of the lab or group that performed the experiment that the priors were derived from.
URL	A URL linking to further information on the experiment.
Description	A description of the experiment which may contain HTML.

Finally run
update-sequence-db --priors priors tsv sequence database directory
which will replace the existing priors in the database with those listed in the TSV file. Check the log file generated by update-sequence-db to ensure that all the priors were added without error.

As well as downloading the sequence files from many sources, the updater tracks the files using a SQLite database. The schema of the database is given below.

tblCategory

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the category. Other tables reference this field.
name	TEXT	UNIQUE NOT NULL	The unique name of the category as shown to users.

tblListing

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the listing. Other tables reference this field.
categoryId	INTEGER	NOT NULL REFERENCES tblCategory (id)	The identifier of the category that contains this listing.
name	TEXT	NOT NULL	The name of the listing shown to users.
description	TEXT	NOT NULL	The description of the listing shown to users.

The combination of the fields categoryId and name is unique.

tblSequenceFile

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the sequence file.
retriever	INTEGER	NOT NULL	An identifier for the code module that downloaded this sequence. It allows the individual code modules to ensure they don't change the records of files downloaded by other modules.
listingId	INTEGER	NOT NULL REFERENCES tblListing (id)	The identifier of the listing that contains this sequence file.
alphabet	INTEGER	NOT NULL CHECK (alphabet IN (1, 2, 4))	Represents the alphabet as powers of 2 so they can be combined into a bitset. RNA = 1, DNA = 2, Protein = 4.
edition	INTEGER	NOT NULL	A machine readable version. This field is used for sorting. Larger numbers are considered newer.
version	TEXT	NOT NULL	A human readable version which is displayed to the user.
description	TEXT	NOT NULL	The description of the sequence file, often containing information about the source.
fileSeq	TEXT	UNIQUE NOT NULL	The relative path to the sequence file.
fileBg	TEXT	UNIQUE NOT NULL	The relative path to the background file.
sequenceCount	INTEGER	NOT NULL	The number of sequences.
totalLen	INTEGER	NOT NULL	The total end-to-end combined length of the sequences.
minLen	INTEGER	NOT NULL	The length of the shortest sequence.
maxLen	INTEGER	NOT NULL	The length of the longest sequence.
avgLen	REAL	NOT NULL	The average length of the sequences.
stdDLen	REAL	NOT NULL	Currently unused! Intended to store the standard deviation of the average length.
obsolete	INTEGER	DEFAULT 0	Used to flag sequences as obsolete. Sequences flagged as obsolete are hidden from the interface.

The combination of the fields listingId, alphabet and edition is unique.

tblPriorFile

Column	Type	Constraint	Description
id	INTEGER	PRIMARY KEY	A auto-generated unique identifier for the prior.
sequenceId	INTEGER	NOT NULL REFERENCES tblSequenceFile (id)	The identifier of the sequence that is associated with this prior.
filePrior	TEXT	UNIQUE NOT NULL	The relative path to the wig file (which may be gzipped).
fileDist	TEXT	UNIQUE NOT NULL	The relative path to the dist file (which may be gzipped).
biosample	TEXT	NOT NULL	A short descriptive name for the sample used in the experiment that the priors were derived from.
assay	TEXT	NOT NULL	A short descriptive name for the experiment that the priors were derived from.
source	TEXT	NOT NULL	A short descriptive name of the lab or group that performed the experiment that the priors were derived from.
url	TEXT	NOT NULL	A URL linking to further information on the experiment.
description	TEXT	NOT NULL	A description of the experiment which may contain HTML.

The MEME Suite

Motif-based sequence analysis tools

Usage:

Description

Input

Sequence Database Directory

Output

Configuration

Options

Adding Priors

Database Schema

tblCategory

tblListing

tblSequenceFile

tblPriorFile