Outline of E-Theses Uploading and Digital Library Workflow

lcs 23-Jul-98
Last modified on $Date: 1998/09/28 20:36:10 $ by $Author: lcs $

  1. Assemble: After successfully defending thesis, student assembles necessary materials:
    1. Metadata: author, title, department, degree, abstract, etc. (Note that metadata "abstract" must be stripped of any exotic characters and images such as equations, since it will be presented by Web browsers and embedded in catalog records.)
    2. Answer to question of whether to make thesis available worldwide, or to what degree to restrict access.
    3. An accurately-rendered PDF file of thesis document.
    See Note.

  2. Upload: Student brings up "E-Thesis Submission" Web form on a machine that has a copy of the PDF file. Student fills in fields for metadata, and indicates the pathname of the PDF file. Web browser transfers PDF file along with the form's answers to the E-Theses digital library server.

    See Note.

  3. Authorize: Need a procedure to AUTHORIZE this thesis upload before the next stage. How do we know the student was qualified to upload a thesis? Will each department supply a list of usernames (Kerberos principals, the identifiers in certificates)? If we don't have this, the system is wide open to fraud and abuse.

    See Note.

  4. Hold: Digital library server installs the thesis in a "holding-tank".
    1. Make a temporary unique identifier based on e.g. the author's name or department name followed by unique integer.
    2. Build a complete digital library document, including metadata and page images from the PDF.
    3. At this point, anyone who knows the temporary document ID can look at it (perhaps we can limit that to MIT community or a subset with certificates if necessary) but it will NOT be in the E-Theses server's own index.
    4. Notify the author by email that their submission has reached this stage, since it should happen within hours (or less) of the time they upload.

    See Note.

  5. Print: [Maybe] Digital library automatically generates an email message to Document Services, who print the PDF file to paper to microfilm it. Make the microfilm [?], then send the paper to Archives.

  6. Check: Archives validates the [DS-printed?] paper copy before binding it for preservation. Check that all the pages are present in the right order, structure and formatting are correct, etc. Ideally, compare the page images in the online copy to printed pages.

    NOTE: This check is a bit late in the game for the purpose of DS' microfilming, but assuming DS can't proof the paper, it saves the workflow complication of having Archives send the paper back to DS for microfilming before Archives binds it.

  7. Notify: Digital library automatically generates an email message to Cataloging (via an archived mailing list) mentioning the arrival of the new document and providing an URL. Mailing list is archived because this is the critical path for getting E-Theses into the catalog so it must be possible to recover e-mail lost by accident.

    NOTE: Should there be any interlock with the proofing of the PDF printout by Archives to make sure e.g. any changes to parts of the document from which cataloging data are derived get propagated? Changes are unlikely, so in the event of a change, will Archives just signal it to Cataloging?

  8. Catalog: Cataloging brings up the temporary document "summary" page in a web browser. From there, they can view images of any document pages needed to get the cataloging information. The abstract (or abstract summary) can be obtained directly from the digital library's metadata collected in the Upload step above. Cataloging generates a new catalog entry (and enters it into OCLC?)

  9. OCLC-Number: When Cataloging obtains an OCLC number for the document, they can once again bring up the web page for the temporary document. A link on that document summary page takes them to a Web form where they can enter the OCLC number and optionally correct or add to the metadata. At this point the digital library has everything it needs for a complete E-Thesis document.

    Cataloging can also add a reference to the Web version of the document since the URL can be predicted from the OCLC number, something like: URL:http://theses.mit.edu/hdl/mit.theses/OCLC-01234567

  10. Finish: The digital library then re-files the document under a new ID based on its OCLC number and adds it to its own index (for searches). It also creates a handle or PURL, as needed, for external references (e.g. from Barton).

Additional Notes

Assemble:
Later, possibly extend uploading to allow inclusion supporting data files such as source code and multimedia. In that case they must be bundled into a single archive file with 'tar' on Unix, or 'winzip' on NT, or 'Mac tar' on the Macintosh. Alternately, the interface could just be extended to allow multiple files to be uploaded (one at a time) for the same document.

Upload:
The use of a certificate gives us positive identification of the submitter.

The metadata is then checked for validity. Since the file upload may take a long time and it would be unfortunate to have to repeat it if metadata is rejected, It may be wiser to do this with two separate forms: get the metadata with the first form and when is accepted, proceed to the file-upload form.

Authorize:
Do we need some kind of interlock with the student's committee to make sure the thesis actually passed the defense, or collect their approval somehow? This might require an extra workflow stage, a separate holding tank, or status marker on the document.

Hold:
We have to assign a unique identifier that indexes to this Thesis/Dissertation positively, to locate it in the digital library. We must also prevent the student from uploading a replacement of the document's contents.

The first verison uploaded becomes the copy of record. (Is it enough to match on the username in the Web certificate and disallow any further upload of the same class of thesis (i.e. PhD) for that user? It's conceivable that a single user may upload both an undergraduate paper and master's thesis in a single session, so that should be allowed.

Maybe we need to check for an existing thesis of the same class, to prevent a user from attempting to upload a new copy of his PhD thesis!

This gets even more complicated if we consider some types of theses, such as "major papers", where a single user may author more than one document instance! Then we have to really test precisely (how?) for a duplicate or attempted revision of an existing document.