M.I.T. Theses Digital Library - history and development
lcs 3-Sep-98
Introduction
This is a record of the genesis of the experimental
M.I.T. Theses Digital Library.
The intent is to document how it came about and explain the
logic behind early design decisions.
Note that there are actually two collections of M.I.T. theses
in electronic form: this "retrospective" collection of scanned images,
and a proposed future collection of theses submitted as digital objects.
The rest of this text is about the former retrospective collection;
the latter is only in the planning stages as of September 1998.
History and Motivation
The Theses digital library is a joint effort between the
M.I.T. Libraries
Document Services department, and
Information Systems' Team Athena.
It started as an online archive of digitized theses: When Document
Services gets a request for a printed & bound copy of a thesis, they
usually produce it by running the archived microform of the thesis through
an automatic scanner and printing the digital images on a laser printer.
Those images are now captured and preserved in an online archive and
re-used if that thesis is requested again.
The digital library makes this information available to the public, in an
easy-to-use form, over the World Wide Web. Users can view page
images of Theses free of charge, at a quality that makes them quite legible on
typical desktop workstation displays (100dpi). The intent is to encourage
scholarly use of M.I.T. theses by all researchers with access to the
Internet and an image-capable Web browser.
Some Technical Details
When a thesis is scanned from microform, it is stored as a set
of image files in a directory named after the OCLC number of its
library catalog entry. This number is used because it is short and has,
for our purposes, a 1:1 mapping onto the set of theses, so if the same
thesis is requested again the DS staff will get the same OCLC number and
find the scanned images already archived.
The scanning process does not leave any metadata associated with
the thesis -- i.e. the title, author, date, etc. -- when it is
first stored, just the trace of the OCLC number implicit in the directory name.
Since the digital library (currently, at least) needs metadata to index its
collection, it must generate its own metadata at some point.
Eventually, we may be able to coordinte with the Libraries to
use the library catalog (Barton) directly for
searching and indexing the digitized theses, but this is currently
not an option.
Implementation
To turn the online archive into a true digital library, we applied
the Dienst software developed for the
NCSTRL (Networked Computer Science Technical Research Library) project.
Dienst is a complex, full-featured, and fairly flexible digital library
system, and more importantly, we already had a good deal of experience
with it from developting MIT's NCSTRL
node.
One vestige of its CSTR-orientation is that Dienst requires metadata for
each document in the form of a
RFC 1807 "bib" record. Dienst requires that each
document have at least an author, title, and date
listed in the metadata,
and it will also display (and search on) an abstract found there.
Since the theses did not come with any metadata, the "bib" file
had to be synthesized for each one. We found the most accurate way
to do this was to request a MARC record from Barton
and extract the metadata values from it. The title, author, and date are
obtained from the catalog entry. Any notes (MARC 500 fields) in the
catalog record are preserved
as NOTES fields in the bib record. Search keywords (MARC 730 fields)
are also imported as bib KEYWORD fields.
Barton currently does not include abstracts for any theses, sow
the metadata files do not have abstracts. The Theses Dienst server
was modified to match subject keywords by
searching KEYWORD fields in the metadata instead of abstracts.
Additional Metadata
We also add three kinds of metadata that did not have a place
in the RFC 1807 record format (because of its CSTR-orientation).
This required a local modification to the RFC 1807 format which we
judged permissible since these "bib" records will not be used directlyto
interoperate with other digital libraries (the only current networked
theses project is much more loosely-coupled than the incestuous NCSTRL).
They
are stored in made-up extension fields which are, by field name,
-
X_DEPARTMENT
-
The M.I.T. department or laboratory which granted the degree. Used to
automatically fill-in the Document Services order form.
- X_DEGREE
-
The abbreviation for the degree granted for the thesis, e.g.
"Ph.D.", "Sc.D.", "M.S.".
Used to fill in the Document Services order form.
- X_SUPERVISOR
-
The name of the student's primary advisor or supervisor (usually in
forename-first format, which is different from the author name).
Used to fill in the Document Services order form. May also eventually
be used as an index key in searching.
Current Retrospective Thesis Workflow
This section documents precisely how the data for a retrospectively-scanned
thesis travels from its earliest inception to availability in the
digital library. It is provided in detail to assist later maintainers
of the system.
Life Cycle of a Retrospectively-scanned Thesis
- A request for a reprint of a thesis arrives at Document Services.
- The DS office staff look up the thesis (the order form includes
entries for author, title, year, and department) in Barton to get
the OCLC number of its entry. This is added to the order for processing,
and to check if it is already in the online image archive.
- For a thesis not already in the online image archive, the archived
microform is obtained for scanning. It is run through an automatic
scanner controlled by a PC, which writes the page images (via NFS) into a
directory on a Sun workstation in Document Services. The directory
is named with the OCLC number of the thesis. These images are also printed
for the customer.
- At this point, Document Services simply
leaves the thesis images on Intaglio.
A daily daemon running on Intaglio looks for newly-scanned theses
and uploads them to the I/S image archive server, Origen. The images
are simply packaged into a ZIP file and uploaded to a drop-box via
anonymous FTP, so the DS workstation does not need secure access
to the server.
See
ds/ds-scripts/import-theses.pl in the cstrdev locker.
- The daemon also checks that the OCLC number each thesis really
corresponds a thesis entry.
It looks up that number's record in Barton and checks for a 502
field, which MIT Libraries now uses exclusively on theses. This isn't
a perfect check, but it should generate false negatives if anything, and
is primarily intended to catch typos in the OCLC number or a catalog
lookup which was not a thesis. Document Services staff are notified
of failures by email.
- On the image-archive server, a periodic daemon empties the drop
box and stores the images in a document directory in the archive.
See
ds/cgi-bin/process-putimages.pl in the cstrdev locker.
The actual images are moved to a subdirectory of the document directory
(which is still indexed by OCLC number). Once this is done, the
images can be retrieved by Document Services (using a private, secure,
Web page interface)
to print another copy.
- Yet another daemon identifies newly-uploaded theses and adds them
to the digital library. See
dienst/theses/LibMgt/seek_new_theses in the cstrdev locker.
It does several things for each new thesis:
- It obtains metadata and puts them in an RFC1807-style "bib" record.
It gets the MARC record from Barton using the OCLC number and
extracts the metadata from that by invoking a specialized Perl script,
dienst/theses/LibMgt/docid_build in the cstrdev locker.
- This same script, docid_build, also creates a document-ID (docid)
that uniquely identifies the thesis within the digital library. (We decided
not to use OCLC numbers or other cataloging numbers for this purpose since
some scanned theses may not be in any catalog, and the new
electronically-submitted theses won't be cataloged when we get them.)
The document-ID consists of the 4-digit year of publication of the
thesis, followed by a dash and a unique serial number.
- The script then runs a Dienst script to build other "formats"
(e.g. sets of viewable images) such as the thumbnail views.
See dienst/theses/LibMgt/db_build in the cstrdev locker.
- Now there is a complete document for the digital library, but
the script has to run another Dienst script to add the document to Dienst's
index tables. Without that, it will not even display correctly. So
it runs
dienst/theses/Indexer/build-inverted-indexes.pl
in the cstrdev locker.
- Finally, when all documents have been processed, the script prods the
running Dienst to reload its indexes, so the newly-added documents are
found on searches. Dienst does not automatically reload its indices
from the files on disk.
Future Work and Goals
- Link the Theses digital library more closely with Barton. The
Dienst searching and indexing is not as strong as what is available
through the OPAC. More significantly, our digital library only has a
small subset of all MIT theses. It would be better to have the digital
library search through and display all theses, even when the only
option to "view" a document is to order a hardcopy.
This breaks down into:
- Cataloging puts live links from relevant thesis catalog entries to
documents available in Theses digital library.
- The Theses digital library defers searching functions to Barton,
perhaps hidden through a Z39.50 interface or perhaps directly through
the web interface to Barton.
- There are a few hundred retrospectively-scanned theses that were not
stored in OCLC-number directories. The OCLC numbers for these should be found
and the directories renamed so they become part of the digital library.
- The digital library needs to allow lookups or document references
by OCLC number, so Document Services staff can use it to check
for archived imaegs.
Last modified on $Date: 1998/09/24 21:28:39 $ by $Author: lcs $