M.I.T. Theses Digital Library - history and development

lcs 3-Sep-98

Introduction

This is a record of the genesis of the experimental M.I.T. Theses Digital Library. The intent is to document how it came about and explain the logic behind early design decisions.

Note that there are actually two collections of M.I.T. theses in electronic form: this "retrospective" collection of scanned images, and a proposed future collection of theses submitted as digital objects. The rest of this text is about the former retrospective collection; the latter is only in the planning stages as of September 1998.

History and Motivation

The Theses digital library is a joint effort between the M.I.T. Libraries Document Services department, and Information Systems' Team Athena.

It started as an online archive of digitized theses: When Document Services gets a request for a printed & bound copy of a thesis, they usually produce it by running the archived microform of the thesis through an automatic scanner and printing the digital images on a laser printer. Those images are now captured and preserved in an online archive and re-used if that thesis is requested again.

The digital library makes this information available to the public, in an easy-to-use form, over the World Wide Web. Users can view page images of Theses free of charge, at a quality that makes them quite legible on typical desktop workstation displays (100dpi). The intent is to encourage scholarly use of M.I.T. theses by all researchers with access to the Internet and an image-capable Web browser.

Some Technical Details

When a thesis is scanned from microform, it is stored as a set of image files in a directory named after the OCLC number of its library catalog entry. This number is used because it is short and has, for our purposes, a 1:1 mapping onto the set of theses, so if the same thesis is requested again the DS staff will get the same OCLC number and find the scanned images already archived.

The scanning process does not leave any metadata associated with the thesis -- i.e. the title, author, date, etc. -- when it is first stored, just the trace of the OCLC number implicit in the directory name. Since the digital library (currently, at least) needs metadata to index its collection, it must generate its own metadata at some point.

Eventually, we may be able to coordinte with the Libraries to use the library catalog (Barton) directly for searching and indexing the digitized theses, but this is currently not an option.

Implementation

To turn the online archive into a true digital library, we applied the Dienst software developed for the NCSTRL (Networked Computer Science Technical Research Library) project.

Dienst is a complex, full-featured, and fairly flexible digital library system, and more importantly, we already had a good deal of experience with it from developting MIT's NCSTRL node. One vestige of its CSTR-orientation is that Dienst requires metadata for each document in the form of a RFC 1807 "bib" record. Dienst requires that each document have at least an author, title, and date listed in the metadata, and it will also display (and search on) an abstract found there.

Since the theses did not come with any metadata, the "bib" file had to be synthesized for each one. We found the most accurate way to do this was to request a MARC record from Barton and extract the metadata values from it. The title, author, and date are obtained from the catalog entry. Any notes (MARC 500 fields) in the catalog record are preserved as NOTES fields in the bib record. Search keywords (MARC 730 fields) are also imported as bib KEYWORD fields.

Barton currently does not include abstracts for any theses, sow the metadata files do not have abstracts. The Theses Dienst server was modified to match subject keywords by searching KEYWORD fields in the metadata instead of abstracts.

Additional Metadata

We also add three kinds of metadata that did not have a place in the RFC 1807 record format (because of its CSTR-orientation). This required a local modification to the RFC 1807 format which we judged permissible since these "bib" records will not be used directlyto interoperate with other digital libraries (the only current networked theses project is much more loosely-coupled than the incestuous NCSTRL). They are stored in made-up extension fields which are, by field name,

X_DEPARTMENT: The M.I.T. department or laboratory which granted the degree. Used to automatically fill-in the Document Services order form.
X_DEGREE: The abbreviation for the degree granted for the thesis, e.g. "Ph.D.", "Sc.D.", "M.S.". Used to fill in the Document Services order form.
X_SUPERVISOR: The name of the student's primary advisor or supervisor (usually in forename-first format, which is different from the author name). Used to fill in the Document Services order form. May also eventually be used as an index key in searching.

Current Retrospective Thesis Workflow

This section documents precisely how the data for a retrospectively-scanned thesis travels from its earliest inception to availability in the digital library. It is provided in detail to assist later maintainers of the system.

Life Cycle of a Retrospectively-scanned Thesis

A request for a reprint of a thesis arrives at Document Services.
The DS office staff look up the thesis (the order form includes entries for author, title, year, and department) in Barton to get the OCLC number of its entry. This is added to the order for processing, and to check if it is already in the online image archive.
For a thesis not already in the online image archive, the archived microform is obtained for scanning. It is run through an automatic scanner controlled by a PC, which writes the page images (via NFS) into a directory on a Sun workstation in Document Services. The directory is named with the OCLC number of the thesis. These images are also printed for the customer.
At this point, Document Services simply leaves the thesis images on Intaglio. A daily daemon running on Intaglio looks for newly-scanned theses and uploads them to the I/S image archive server, Origen. The images are simply packaged into a ZIP file and uploaded to a drop-box via anonymous FTP, so the DS workstation does not need secure access to the server. See ds/ds-scripts/import-theses.pl in the cstrdev locker.
The daemon also checks that the OCLC number each thesis really corresponds a thesis entry. It looks up that number's record in Barton and checks for a 502 field, which MIT Libraries now uses exclusively on theses. This isn't a perfect check, but it should generate false negatives if anything, and is primarily intended to catch typos in the OCLC number or a catalog lookup which was not a thesis. Document Services staff are notified of failures by email.
On the image-archive server, a periodic daemon empties the drop box and stores the images in a document directory in the archive. See ds/cgi-bin/process-putimages.pl in the cstrdev locker. The actual images are moved to a subdirectory of the document directory (which is still indexed by OCLC number). Once this is done, the images can be retrieved by Document Services (using a private, secure, Web page interface) to print another copy.
Yet another daemon identifies newly-uploaded theses and adds them to the digital library. See dienst/theses/LibMgt/seek_new_theses in the cstrdev locker. It does several things for each new thesis:
1. It obtains metadata and puts them in an RFC1807-style "bib" record. It gets the MARC record from Barton using the OCLC number and extracts the metadata from that by invoking a specialized Perl script, dienst/theses/LibMgt/docid_build in the cstrdev locker.
2. This same script, docid_build, also creates a document-ID (docid) that uniquely identifies the thesis within the digital library. (We decided not to use OCLC numbers or other cataloging numbers for this purpose since some scanned theses may not be in any catalog, and the new electronically-submitted theses won't be cataloged when we get them.) The document-ID consists of the 4-digit year of publication of the thesis, followed by a dash and a unique serial number.
3. The script then runs a Dienst script to build other "formats" (e.g. sets of viewable images) such as the thumbnail views. See dienst/theses/LibMgt/db_build in the cstrdev locker.
4. Now there is a complete document for the digital library, but the script has to run another Dienst script to add the document to Dienst's index tables. Without that, it will not even display correctly. So it runs dienst/theses/Indexer/build-inverted-indexes.pl in the cstrdev locker.
5. Finally, when all documents have been processed, the script prods the running Dienst to reload its indexes, so the newly-added documents are found on searches. Dienst does not automatically reload its indices from the files on disk.

Future Work and Goals

Link the Theses digital library more closely with Barton. The Dienst searching and indexing is not as strong as what is available through the OPAC. More significantly, our digital library only has a small subset of all MIT theses. It would be better to have the digital library search through and display all theses, even when the only option to "view" a document is to order a hardcopy.
This breaks down into:
1. Cataloging puts live links from relevant thesis catalog entries to documents available in Theses digital library.
2. The Theses digital library defers searching functions to Barton, perhaps hidden through a Z39.50 interface or perhaps directly through the web interface to Barton.
There are a few hundred retrospectively-scanned theses that were not stored in OCLC-number directories. The OCLC numbers for these should be found and the directories renamed so they become part of the digital library.
The digital library needs to allow lookups or document references by OCLC number, so Document Services staff can use it to check for archived imaegs.

Last modified on $Date: 1998/09/24 21:28:39 $ by $Author: lcs $