Finding Structure and Sequence Information on the 'Net

Downloading structure files - directly from the PDB:

Protein and nucleic acid structures are archived at the Brookhaven Protein DataBank, or PDB. Getting a file from the PDB has generally been a pain in the neck. However, several new tools for looking through the database have appeared on the network,and I suggest that you try to use these. They all are available over the internet and can be accessed using Netscape, Mosaic, or another hypertext-based browser for the Internet. Versions of these programs are available for Macs, PCs, and many kinds of UNIX boxes, and both Nescape and Mosaic are available on Athena. Generally Netscape has more advanced features.

On Athena, the easiest way onto the web is to use the DASH menu at the top of the screen. Choose "WorldWideWeb" from the "Communication" menu, and then choose "Mosaic - MIT home page", or "Netscape - MIT home page".

One Mosaic is running the MIT home page or another home page. Home pages are starting points for internet exploration, that may contain useful information or links to other sites. Highlighted entries can be accessed by clicking on them. Sometimes you will see highlighted pictures, or icons for sounds or movies. Most of the entries are highly cross-indexed to related entries, so that finding your way around the network is fairly easy. Text, sounds, and graphics can be examined, and downloaded for later use. If you find a place that you like, you can save it by adding it to your hotlist (in the Navigate Menu). You can always jump to a place in your hotlist (under the Navigate Menu), or you can type in its URL if you know it (in Open URL under the File Menu, or in the html box).

One way around the internet is to use an index or directory page, and follow its links. Yahoo (www.yahoo.com) is the classic directory but there are many others. Recently several search engines have become available for finding sites based on their titles or content (www.altavista.com, www.excite.com, www.lycos.com) . These are very useful. Your internet browser may have buttons linked to directories and search engines.

If you know the URL for a site you would like to visit, you can simply type its URL and go there directly.

The URL for the Brookhaven PDB is: http://www.pdb.bnl.gov You can save it on your hotlist by clicking "add current to hotlist" from the Navigate menu.

Their are several ways to find your PDB file. A good starting point is "Searching and Browsing the PDB" using the text-based "PDB's WWW browser". The PDB site is pretty active and the names for some of the links may change.

I searched for "compound name" trypsin and found a lot of entries. Some were for chymotrypsin, some for trypsinogen, some for trypsin inhibitors, but some seemed to be what I was looking for. I looked through all of the bona-fide trypsin entries to find a good one. To examine an entry, "fetch" it to get the file header. The headers have useful information about the structure. If the header corresponds to a file that you wish to view, you may be able to do so using the Rasmol program if Netscape is set up to do this. If not, save the coordinates to a file ("send entire file", "save as") and use an external viewer. We will use Quanta later, but for a quick view you can use Rasmol. In a UNIX window, type

  • add rasmol
  • rasmol
  • If you do not type the filename on the command line you can enter it from one of the Rasmol menus. The mouse buttons move the molecule, and you can pick different representations from the menus. For later use with Quanta you will need to download a copy of the coordinates to a file in your account.

    Downloading structure files - from other servers:

    The PDB server is useful only if you know the name of the macromolecule and decipher the header. Many other sites have developed cross-indexes to the PDB, and have servers that help find protein structures. If you have trouble finding your protein using the search engine at the PDB gopher, you should try the more sophisticated searches that can be done using the utilities at: The NIH Molecular Modeling Home Page , also known as http://molbio.info.nih.gov/cgi-bin/pdb . Here you can find several different indexes to the PDB (called "Organizational lists") which should help you find a structure when you do not quite know what you are looking for. All of these use a viewing utility called MOLECULES ARE US to examine the coordinates. The viewing capabilities of MOLECULES ARE US are much better than at the PDB, and you can request different representations of the molecule (line drawing, ball-and-stick, space-filling, or ribbons), and can rotate the molecule to get the view that you want. Read the instructions.

    RL_3D (http://www.gdb.org/Dan/proteins/nrl3d.html) is a very handy sequence- structure database derived from the 3 dimensional structure of proteins deposited with theBrookhaven National Laboratory's Protein Data Bank, which allows you to do complicated searches using boolean terms relating characteristics of the structure. For example: a search for "kinase and helix" will select kinases which have a helix in them.

    Another useful searcher is SCOP (Structural Classification Of Proteins) which classifies the PDB entries by "type of proteins", and by secondary and tertiary structural elements. You can find SCOP at http://www.bio.cam.ac.uk/scop/ or http://scop.mrc-lmb.cam.ac.uk/scop/.

    Once you have an entry, you can examine it or download it as described above for the PDB Gopher.

    Many of these sites change frequently, and you may want to consult general biochemstry sites for a current list of useful links. Everyone's favorite list-of-links is Pedro's Biomolecular Research Tools, at http://www.fmi.ch/biology/research_tools.html.

    Downloading structure files - last resort - anonymous ftp:

    If all else fails, you can also access the PDB directly by anonymous ftp instead of Mosaic. ftp (File Transfer Protocol) only deals with files, not images, but is a very common way to move information from one computer to another. From a UNIX shell, type
  • ftp pdb.pdb.bnl.gov
  • Name: anonymous
  • Password: ( type your Email address: I used stern@mit.edu)
  • ftp> ls (this gives a list of the directory)
  • ftp> cd index (move to the directory called index)
  • ftp> type ascii (set the type of file transfer to ascii (text), not binary)
  • ftp> get entries.idx (retrieve entries.idx - a list of all of the PDB entries)
  • You can look through this file on your local machine to find the PDB entry that you want. In a new UNIX window, try something like:
  • % more entries.idx
  • % grep TRYPSIN entries.idx | more % grep TRYPSIN entries.idx | egrep -v 'CHYMO|OGEN|INHIB' | more
  • The first way uses a UNIX utility called more to look through the file entries.idx one page at a time. The second way uses a UNIX utility called grep to select those lines that contain TRYPSIN. The result of this is sent (piped) directly to more. The third way adds another filter called egrep. The -v means give us the lines that do not contain the target (this works with grep also), and the "|" means "or". Thus egrep gives us the lines that do not contain "chymo","ogen", or "inhib". This filters out all of the chymotrypsin, trypsinogen, and bovine trypsin inhibitor entries.

    There is also a useful file that I found on the net called "An annotated guide to the PDB". This list has more useful information than the PDB's index, but it is not guaranteed to be up to date. You can find this file in your home directory.

    Eventually, you will find the entry that you want. Here, it was 1TPP. We still have to find where this file is located. Back to the ftp window:

  • ftp> cd .. (move up one directory)
  • ftp> ls (list the files and directories here)
  • ftp> cd all_entries (move to the all_entries directory)
  • ftp> ls (another list)
  • ftp> cd uncompressed_files (this one looks about right)
  • ftp> ls *tpp* (is our file here? The *'s are ftp wildcards that match anything. This will list any files with tpp in their name. You could list all of the files by ls as before, but there are a lot of them)
  • ftp> get pdb4ptp.ent (it was in the list, so retrieve it)
  • ftp> quit (leave ftp)
  • Downloading sequence files - Genbank:

    Protein and nucleic acid sequences are archived at many places, including SWISSPROT and PIR for protein sequences and GENBANK and EMBL for DNA sequences. Note that the DNA databases generally include all of the information in protein databases - most protein sequences are actually determined and deposited at the DNA level. Traditionally, these databases had to be purchased and updated regularly. For example, one is maintained by Whittaker College at MIT.

    Recently, the GENBANK people have started an on-line DNA sequence server. You can run searches using their "Entrez browser" at http://www.ncbi.nlm.nih.gov/. This server links protein and DNA databases, and also Medline literature references (and abstracts!). From the Entrez browser, select "search the nucleotide database", enter your search term and hit return. You can do subsearches combining several search results (with and, or, etc.), and you can limit your search to particular terms, ex. organism human. If one or more matching records are found, you can retrieve all of them or just selected ones. You can view a selected report in one of several formats (for use with various DNA sequence manipulation programs) or follow links to the protein and Medline databases. You can save a file that you are reading using the "SAVE AS" option from the Mosaic FILE menu.

    Other on-line searches may also be available. If you find a good one let me know.

    An example DNA sequence record is attached. The sequence records, especially DNA sequences, are designed to be read by programs that can manipulate and display them in a form more conducive to human understanding. Such programs are available for all platforms, but one is not yet running on Athena.

  • Sample DNA Sequence
  • Sample Protien Sequence