_______________________________________________________________________________ [Scan and OCR from a paper copy. Page 15 was missing in the available copy. It has been replaced with a version dated four days earlier] _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 1 EECS Colloqium Technology, Networks, and the Library of the Future October 28, 1991 4-5 p.m. M.I.T. Room 34-301 enscript -f Courier-Bo1d15 1ibta1k.txt _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 2 (SLIDE: TITLE) Background: I do research on the engineering of computer systems. My approach is to look for real applications that stress systems, to get guidance on what systems problems need work. The application of interest here is automating the library. In raising the question, we revisit an old idea: Vannevar Bush, 1945 John Kemeny, 1960 JCR Licklider, 1965 Until now, the idea was good but the technology was not up to it. Our thesis: the technology has changed. (NEXT SLIDE: OVERVIEW) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 3 (SLIDE: OVERVIEW) This talk consists of three sections: Remind you of the vision Examine the driving technology Look at the system engineering challenges (NEXT SLIDE: WHAT HATH WORD PROCESSING WROUGHT) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 4 First, we have to figure out what a library is, or will be in 1999... (SLIDE: WHAT HATH WORD PROCESSING WROUGHT) On-line bulletin boards e-mail personal files library scientific data publishing government & business reports also related: paperless offices; corporate database; collaborative work The boundaries among these several areas will be the subject of battles over the next decade, as people stake out revenue streams and novel ideas enter the arena. It isn't plausible to try to innovate across the whole area and at the same time within each area. My approach: assume that the field will proceed by successive approximation, with individual areas first working under the assumption that each will maintain roughly the traditional interfaces to the others. Then, as adjacent areas think through their new paradigms at least a little, they are then in a position to explore pairwise negotiation of the boundaries that separate them. There is no reason to believe that this is the BEST way to do it, just that this is the way things will work in practice. (NEXT SLIDE: LIBRARY) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 5 So we take as the initial library boundary these characteristics: (SLIDE: LIBRARY) - Selective (Publisher/editor/collector/curator --someone chooses) - Archival (Expected to persist for decades) - Shared (Results of collection are used by many people) NEXT SLIDE: How Computers might/help 1) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 6 (SLIDE: How Computers might help 1) The traditional view of how computers might help libraries concentrates on two quite different concepts: 1. Discovery of relevant documents ("Search" or "Information Retrieval") . Computer scientists have focused on it almost exclusively because it is sexy. It ranges from database queries to knowledge-based measures of document "relatedness". ("Find documents like this one.") It is hard, because "relevance" is an elusive concept. Because of 30 years of bad experience, librarians have learned to be wary of computer people. 2. Back office automation. Acquisition, cataloguing, circulation, overdue notices, serials control, inter-library loans. Librarians have traditionally embraced this one, with varying degrees of enthusiasm. Bringing the automated catalog out where patrons can get at it is a fairly recent innovation. There are now more than 100 university library catalogs available on the internet. (NEXT OVERLAY: COMPHELP 2) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 7 (OVERLAY: COMPHELP 2) 3. Storage, Browsing, Navigation (Identification) My focus is different and simpler: look at the computer system primarily as a storage and browsing device, enhancing the speed of access to a very large body of material. Navigation is moving from one work to another via citations. Identification supports navigation; it comes into play when a citation is incomplete or ambiguous. We can think of Browsing, Navigation, and Identification as an extension of traditional cataloguing activities along two dimensions - to include the documents themselves - to catalog contents of journals (NEXT SLIDE: SPECTRUM) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 8 (SLIDE: SPECTRUM) and discovery/search may be viewed as being at the other end of the identification spectrum. Market: every network-attached desktop workstation in the world is a potential client for this service; availabiLity of this service will increase the demand for desktop workstations and for communications service. (NEXT SLIDE: THE NEW TECHNOLOGIES) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 9 Let's slip ahead into the next discussion for just a moment: What is driving the move to an online library? (SLIDE: THE NEW TECHNOLOGIES) Build on four technologies: - high-resolution desktop displays. The displays you see today are not quite adequate, but all you have to add is 4 bits of gray scale and the psychology of acceptance skyrockets. - Megabyte/second data communication rates from the office to the nearest library (LAN) plus Megabit/sec data communications from there to more distant major libraries (NFSNET, B-ISDN, etc.) - magnetic storage, observation: with optical storage media today, the physical space needed to hold a scanned-in book is now less than the original book. But today's optical storage is driven by requirements of a different world, and they don't match well enough in performance. But magnetic storage DOES have the necessary performance match and it will soon be cheap enough. This is so important that we will come back to it. - CLIENT/SERVER architecture looks like just the right modularity tool for dealing with a bunch of problems that are traditional barriers to progress in the library. We will come back to this, too. (NEXT SLIDE: A SCANNED BOOK PAGE) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 10 Finally, the vision--in the form of two system goals: (SLIDE: SCANNED BOOK PAGE) 1. A person with a desktop workstation can look at any book, paper, newspaper, technical report, or manuscript without being required to visit the library. This is from an 1888 report of the Royal Society on the explosion at Krakatoa in 1883. It was scanned, and then printed, at 100 dpi with grey scale, and it looks slightly better than this on a 4-bit gray scale display. (You can try to figure out how far back from the screen you should be standing to produce the same effect as sitting in front of a computer screen.) Primary implication: full text on line, all text, in image form. 2. While reading a document, if it contains a reference or citation, one should be able to mark that citation, request that that work appear on the screen, and expect it to appear immediately in an adjacent window. Primary implication: links, with a flavor of hypertext. (Check out the citation) N.B.: I'm not trying to replace books, I'm trying to augment them. I assume that there will still be a way to get your hands on a copy of the book to curl up with in the hammock. The goal here is to allow you to browse the book first. (NEXT SLIDE: SCANNING)) _______________________________________________________________________________ libtalk.tact Mon Oct 28 13:44:21 1991 11 We now move to the driving technologies. First, some implications of image storage (SLIDE: SCANNING) - rough calculations of the storage space involved: - a 9 x 12 book has about 70 sq in/page of printed area - 400 pages -> 30000 sq inches of images per (big) book. - 300 pixels/inch -> 90,000 pixels/sq in. -> 10,000 bytes/sq in (monochrome scanning with 1 bit/pixel) - -> 300 Mbytes, or about half a CD-ROM - Other resolutions scale with the square; going to 450 pixels/inch doubles the space required to 600 Mbytes/book. - Compression: The usual experience is that going to reversible compression drops the 300-pixel/inch space by a factor of 10, to 30 Mbytes/book, for ordinary printed text. Other resolutions should scale, after compression, with the square of the log, so doubling the resolution might add 20% to the bit count. N.B.: Neither of these numbers may apply for things other than ordinary text. (Why to use reversible compression--see below.) - ASCII: 70 characters/line x 40 lines = 2800 characters; with form and content markup that would expand to 6000 bytes/page, or 3 Mbytes/book. Summary: ASCII 3 Mbytes/book Compressed 300 pixel/inch 30 Mbytes/book Raw 300 pixel/inch 300 Mbytes/book (NEXT SLIDE: SIZES) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 12 This leads to library sizes that look like this... (SLIDE: SIZES) [ One of the secrets of keeping up with the computer business, is that you need to learn a new metric prefix each decade. Tera is beginning to show up as a common term just this year, and Peta will be with us by the end of the decade. If you want to keep one step ahead. Exabytes will be the unit of discourse in 2010. ] (NEXT SLIDE: DISKS) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 13 Supply vs. Demand That was the amount of space we will NEED. Now, let's see hoaae much space we can GET. (SLIDE: DISKS) Today's 1 Gbyte disk costs about $2500 and occupies about 1 cu ft. It can hold about 25 books (compressed 300 pixel/inch), at a cost of $100/book and a storage volume of 70 cubic inches, about the size of the book. (Offline CD ROMS can put 25 books in a jewel box and thus take up much less space, but they are offline so they don't count. Current juke boxes of CD ROMS take up about 1/10 of the volume of the book, and cost 1/10 as much, but they have something like 1/100 of the performance, so are hard to see how to fit in.) This slope is amazingÑthat is a log scale on the left. The REAL limit is this red line, though. (NEXT OVERLAY: RED LINE) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 14 (OVERLAY: RED LINE) There are two factors of ten behind us, and two more appear to be available by 1999. The cumulative impact of four orders of magnitude is that all prior assumptions about what is feasible must be revisited. ************************************************* * * There are three insights hiding on this slide: * * - scanned image on magnetic disk already * takes up no more space than the book; * * - The next factor of 10 (1994) makes image * cost in the same ballpark as paper but * the space is 10 times smaller. * * - the second factor of 10 (1999) makes image * so cheap and space so small that you can't * avoid it. ************************************************** (NEXT SLIDE: SPACE) _______________________________________________________________________________ libtalk.txt Thu Oct 24 16:23:48 1991 10 [now page 15] (SLIDE: SPACE) - Assume 4-foot stacking, 4 disks/square foot; building space (at $200/square foot) to house disks will add about 2% to the cost. ********************************************************************** * * There are three insights hiding on this slide: * * - scanned image on magnetic disk already takes up no more space * than the book; * * - The next factor of 10 (1994) makes image cost in the same ballpark * as paper but the space is 10 times smaller. * * - the second factor of 10 (1999) makes image so cheap and space so * small that you can't avoid it. * ********************************************************************** - The Library of Congress holds 60M works (sometimes said to be 30M) . They probably are on average somewhat smaller than that book, but I don't know. Assume they are all that size -> 1.8 Million Gigabytes = 2 Petabytes. (Also need to learn expansion rate/year) If we assume the factor of 100, you get 2,500 books/disk, and need 24,000 disks. Plus a building of 6,000 square feet. PUTTING SCANNED IMAGES ON MAGNETIC DISK, THE 1999 LIBRARY OF CONGRESS FITS IN ONE FLOOR OF A SMALL OFFICE BUILDING AND THE STORAGE EQUIPMENT COSTS $60M. _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 16 So much for the driving technology. The next subject is the system engineering challenge. The basic observation is that availability of the technologies only enables the solution; creating a workable SYSTEM involves solving a lot of interesting engineering problems. Research problems: (SLIDE: CHALLENGES) Three, plus a list. applying client/server design how to represent links persistent storage (NEXT SLIDE: CLIENTSERV) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 17 (SLIDE: CLIENTSERV) This one appears, perhaps, to be the easiest: the client/server architecture is an off-the-shelf technology that solves several problems: - workstations can be owned by the customer, not necessarily by the library, with effect on capital budgets and ubiquity. - presentation management, customization, and inquiry state can be the responsibility of the customer's device and third party competition - network protocols can assure stable interfaces in the face of evolution of user facilities. - can separate indexing systems, search engines, and storage devices. - can make multiple/parallel inquiries, e.g., to the local libary, to the Library of Congress, and to the Books in print server, so that one can give the answer, "found what you wanted, but we don't have it in the local library." - can separate the catalog server from the circulation system, which must be fast response, and the bulk storage system, which can be slow. - helps with longevity requirement: can individually replace any obsolescent component without replacing whole system. (NEXT SLIDE: LINKS) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 18 The second area of system interest is links. (SLIDE: LINKS) - some books consist of nothing but links, together with opinions as to their value. - in the traditional system, link authorship is varied: We have a spectrum, with intellectually generated links at one end, mechanically generated ones at the other. Insights: creating links is an act of authorship; links are a kind of published work. For older works, creditable research may be required to identify the target that the original author intended. Probably want to use normal "publication" filters to decide which links are worth archiving. Mechanically generated links are interesting, but authored links are important, too. - the mechanisms for navigating around the library are fundamentally nothing but links, but used in various ways: - plain links as describe above - "other things by this author" uid for author should eliminate stuff by similarly-named authors. (There is an implicit work consisting is a list of works by the author; we are just using links in that implicit work) - "other things in this journal" links from its table of contents - "other things published by people in this organization" (links from an implicit "organization" work) - "other things catalogued as being on the _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 19 same subject" (links provided by the librarian) -------------------------- Underlying links is an interesting challenge of systems research: representation. How do you knit data together when the pieces may be - on the same machine - on a machine elsewhere running the same program - on a machine elsewhere running the same program but administered by someone else - on a machine elsewhere running a different program that is alleged to meet the same specification - on a machine elsewhere running a different model of the universe. Most research on distributed systems has been on a program-oriented model of cooperation (RPC), in which one machine asks another to run a program. Links call for a different model of cooperation in which one machine needs to maintain over a long time references to data stored by the other. Requires a carefully engineered blend of direct reference (for performance) and stand-offishness (for insulation against failures, change, and lack of cooperation). Clark's information mesh research bears on this topic. Links involve, but are not limited to, the rendezvous provided by naming services. --------------------------- (NEXT SLIDE: PERSISTENCE) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 20 (SLIDE: PERSISTENCE) The third subject is persistence: 3. Media obsolescence. By the time the disks are filled, they will be obsolete. But the proper goal is to preserve the information, not the disks. So part of the system design is to have a component, like backup, that automatically moves data to new technology without getting in anyone's way. It will probably be running most of the time, and it is just as important as backup. A carefully designed ceremony is required to copy the data to be sure that it all gets copied and the new copy hasn't been corrupted by the copying process itself. Copying is a long-running job, which means it must be coordinated with updates that go on at the same time. Is it safe to compress data? Should the compression algorithm be stored with it? Or at least the name and parameters of the compression algorithm? How do you read data compressed 75 years ago? Answer: refresh the compression algorithm whenever you replace the disks. It seems inadvisable to use the non-reversible (lossy) compression algorithms being suggested for moving video. Reason: over a course of 50 years, one may have to expand and recompress (with the newer standard) five or ten times; losses from incompletely reversible algorithms would be expected to accumulate in the form of increasing degradation. Where should forward error correction be placed to insure that data will be readable despite occasional media errors? Perhaps only error detection is needed, plus having copies at several sites and placed on different brands of media. If forward error correction is used, the algorithm needs to be updated periodically, just like the compression algorithm. But while the data is _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 21 being run through the upgrade, it is unprotected and thus vulnerable; a very carefully designed ceremony is required, (e.g., read old data, decode it, recode it with the new algorithm, write the new data, then read the new and old data through decoders and compare. Then move old data to tape where it is stored until the next such transformation, just in case.) (NEXT SLIDE: OTHERS) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 22 (SLIDE: OTHERS) Some Other Research problems: (Many of these problems are more general than the library application:) 1. What strategy of sprinkling extra copies of popular images and indexes? 2. How to connect/identify scanned image with ASCII representations? 3. Reliability: A looming threat is media loss because of disasters. (Over a long enough time period there will be a disaster.) Traditional backup is done by making a copy of tape. But keeping up with updates leads to an operational mess with lots of tapes. Are multiple copies at widely separated sites the answer? They don't have the advantage of an alternate physical organization to reduce the chance of a software mistake ruining both the original and the backup. But they prevent loss in the case of an earthquake, riot, or fire. Combining backup with media refresh is interesting. 5. There is a cluster of political/administrative problems involved in startup: - continuing production continuity - high cost of some components at outset despite expectation of low cost later dislocations of revenue flows - copyright, which is a stalking horse for the revenue flows. - relation of public/non-profit libraries to commercial services - who stores what? Both authoritative bits and cached bits. 6. With very large data storage, the need for _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 23 coordination of variant copies will get more pressing. It will be common that the local system has an old copy of something, while the remote system has an up-to-date copy but is out of touch. The storage system needs to provide semantics to deal with this situation gracefully. To the extent that the information is textual and will be examined by a person, it may be reasonable to go interactive and offer the opportunity. (NEXT SLIDE: CONCLUDE) _______________________________________________________________________________ libtalk.txt Mon Oct 28 13:44:21 1991 24 (SLIDE: CONCLUDE) A revolution is available, all we have to do is seize the opportunity. _______________________________________________________________________________