< The Matrisome Project: Annotated Workflow

Commented workflow for deriving the in silico matrisome

From Naba et al., 2011 (Supplemental Information, Extended experimental protocols)
Comments

Domain-based definition of the in silico matrisome

First, we defined lists of InterPro domains commonly found in i) known ECM glycoproteins and proteoglycans (55 domains, Figure 3), ii) ECM-affiliated proteins (6 domains; namely, syndecan, glypican, semaphorin or Sema, plexin/semaphorin/integrins or PSI, galectin carbohydrate recognition and annexin domains), iii) ECM regulators, including ECM-remodeling enzymes and their regulators (25 domains) and iv) secreted factors (39 domains), including growth factors, cytokines, etc. These defining domain lists were compiled independently based on previous knowledge, data from the literature and iterative query of UniProt to ensure efficient capture of known candidate proteins. 
Second, we defined lists of “excluding domains” whose presence excludes a protein from i) ECM glycoproteins and proteoglycans (20 domains), ii) ECM regulators (12 domains), or iii) secreted factors (17 domains). 
Because of the specificity of the defining domains designed to identify ECM-affiliated proteins, no “excluding domains” were necessary.  These “inclusion” and “exclusion” domain lists were refined by iterative cycles to optimize efficiency of both capture and exclusion. It is worth noting that the presence of certain domains disqualifies a protein from being part of the matrisome, for example the tyrosine-protein kinase, catalytic domain (IPR020635) or the serine/threonine/tyrosine-protein kinase domain (IPR001245). On the other hand, some domains are excluding proteins for one category and defining domains for another. This is the case for the ADAM-TS Spacer 1 domain, that if present in a protein excludes it from belonging to the core matrisome but serves as a defining domain for ECM regulators.

Bioinformatic procedures for deriving the in silico matrisome

It is important to keep in mind that our workflow relies on the availability of high-quality gene models for the organism at hand. This is because only then, a reliable association with fragmentary sequences and the gene they are derived from is possible. Consequently, we anticipate that modified procedures will become necessary to establish the matrisomes of organisms for which the state of genome-wide gene identification and annotation is not as advanced as for human and mouse. This is the case even for some established model organisms, e.g. the zebrafish or the clawed frog.
The downloadable InterPro index file protein2ipr.dat (ftp://ftp.ebi.ac.uk/pub/databases/interpro/, downloaded February 4, 2010), linking UniProt protein entries to InterPro domain information was searched independently for the presence of each set of “inclusion domains”. This was done in parallel for both human and murine protein databases [1]. The resulting lists were highly redundant from a gene perspective, since the UniProt protein database comprises sequences of both intact protein isoforms and fragments. Thus, the collection of UniProt accessions was made gene-centric. UniProt comprises both manually curated entries and automatically translated ones (formerly SwissProt and TrEMBL, respectively). We decided against the option of exclusively considering curated UniProt entries, because important protein isoforms might be missed that way, negatively impacting sensitivity.
Because direct cross-referencing between the protein database UniProt and the gene database Entrez Gene [2] is incomplete, we chose a strategy that relied on GenPept [3] and Ensembl [4] as intermediary protein databases with the best cross-index coverage in both UniProt and Entrez Gene. This step - an additional "detour" via GenPept and Ensembl - adds significant complexity. As the quality of UniProt cross-referencing has been improving steadily, we recommend assessing if the consideration of these additional databases is warranted for a given organism of interest.
Applying Perl scripts to manipulate and “join” (sensu databases) cross-reference flat files gene_info, gene2accession.txt, gene2ensembl (ftp://ftp.ncbi.nih.gov/gene/DATA/, downloaded March 29, 2010) and idmapping_selected.tab (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/, downloaded April 7, 2010), we derived index files linking GenPept and Ensembl gene accessions to both UniProt accessions and Entrez Gene IDs. These Perl scripts are our own research-grade developments. Alternatives achieving similar functionality can for example be found in Galaxy (free) or in Spotfire (commercial).

As a first step, to make working with these files easier,  we recommend creating species-specific index files subsets using the NCBI taxonomy identifiers.
The resulting correspondence tables were manually curated.
Ambiguous cases, arising either from assignment of multiple GenPept accessions to one UniProt accession, from disagreements between GenPept- and Ensembl-based results or from missing gene name, were resolved by direct protein-to-genomic sequence comparison, using BLAT [5].
The need for manual curation steps remains important bottleneck in this entire procedure.  It is not easily overcome and works against a fully automated implementation of this workflow.
The resulting candidate gene lists were used to derive protein sequence data from RefSeq [6] and Ensembl (via the Entrez Gene gene2refseq and gene2ensembl index files) and UniProt (directly via the original accession numbers). UniProt accessions were then searched for the presence of “excluding domains” (via the data in the InterPro file protein2ipr.dat, see above). In the gene-centric representation of the matrisome lists, genes were demoted to non-member status if at least one of their UniProt members had an “excluding domain”.
In addition, protein sequences were subjected to two transmembrane prediction programs, TMHMM 2.0 (http://www.cbs.dtu.dk/services/TMHMM/) and Phobius (http://phobius.sbc.su.se/), the latter also predicting the presence of signal peptides within the protein sequence [7, 8]. Results obtained helped guide decisions on non-obvious candidate genes. With the exception of a few known transmembrane collagens, we considered the presence of a transmembrane domain as incompatible with the definition of core matrisome protein.
Finally, genes were assigned to one of two divisions: core matrisome or matrisome-associated and within these divisions to a category, namely, ECM glycoproteins, collagens or  proteoglycans within the core matrisome division or ECM-affiliated proteins, ECM regulators or secreted factors within the matrisome-associated division.
A schematic representation of the complete bioinformatic pipeline is presented below.

Workflow for deriving the in silico matrisome (Naba et al., 2011):

workflow grahics


REFERENCES

  1. The Universal Protein Resource (UniProt). (2009) Nucleic Acids Res. 37, D169-174
  2. Maglott D., Ostell J., Pruitt K.D. and Tatusova T. (2011) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 37, D26-31
  3. Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J. and Sayers E.W. (2009) GenBank. Nucleic Acids Res. 37, D26-31
  4. Flicek P., Aken B.L., Ballester B., Beal K., Bragin E., Brent S., Chen Y., Clapham P., Coates G., Fairley S., et al. (2010) Ensembl’s 10th year. Nucleic Acids Res. 38, D557-562
  5. Kent W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res. 12, 656-664
  6. Pruitt K.D., Tatusova T., Klimke W. and Maglott D.R. (2009) NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32-36
  7. Käll L., Krogh A. and Sonnhammer E.L.L. (2004) A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 338, 1027-1036
  8. Krogh A., Larsson B., von Heijne G. and Sonnhammer E.L. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567-580