Accessibility
This is the text version of a paper originally prepared with FrameMaker, so
details such as italics and footnotes are missing.  The full citation of the
paper is as follows:

  Saltzer, J. H., "Needed:  A Systematic Structuring Paradigm for
  Distributed Data," Operating Systems Review 27, 2 (April, 1993),
  pp. 77-81.  Originally distributed as paper #41 in 5th ACM
  SIGOPS Workshop on Models and Paradigms for Distributed Systems
  Structuring, September 21-23, 1992, Le Mont Saint-Michel, France,
  pp. 1-5.

This paper is also available in PostScript form.

==============================================================================

NEEDED: A SYSTEMATIC STRUCTURING PARADIGM FOR DISTRIBUTED DATA

by Jerome H. Saltzer
Library 2000
M.I.T. Laboratory for Computer Science
September 10, 1992



The purpose of this note is to alert the distributed and operating
systems communities to a research and design exercise that is going on
in an adjacent application-oriented community. This research and design
exercise involves inventing a paradigm for distributed system
structuring that is distinctly different from that of remote procedure
call and related paradigms that are the usual focus of the distributed
systems community.

The class of applications involved might be roughly characterized as
distributed, linked data. The remote procedure call paradigm, in which
a client asks a server to perform some named computation on a set of
supplied arguments and return a result, is not an especially congenial
fit to this application, although it may be a useful tool at a lower
level of abstraction in resolving links. Perhaps because the application
of linked data has not been widely considered, it is not clear yet what
is an appropriate form for the structuring paradigm, or even whether or
not the problem has been posed carefully enough to allow one to propose
specific structures. This note describes the problem as it has been posed
in the distributed data community, and points to a few early attempts to
suggest mechanisms; it does not claim to settle the issue.

The interesting requirement in distributed storage of related data
lies in those data relations that can be characterized as cross-
reference. There are many different applications in which one object
might need to contain a cross-reference to another, remote, object. As
examples, one might propose to build a distributed hypertext system, a
sales reporting system for a corporation that has many branch offices,
an office system that allows one worker to incorporate an element of a
remote spreadsheet into a local one by reference, a distributed legal
case database, with cross-references among cases, and also from state to
federal cases and back, or simply an electronic library.

For concrete scenarios consider the specific application of an on-
line electronic library (comparable scenarios arise in most of the other
application examples.) In the electronic library, the primary scenarios
of interest are the following: The maintainer of a document storage
service has installed a new document and wants to advertise the new
document to clients and clients of clients, and allow search services to
store and pass along cross-references (in this application a cross-
reference is usually called a “citation”) to the new document. In a
related scenario, the client of a search service has completed a search
and wants to be able to store a persistent cross-reference to one of the
items discovered. Storing the search query that was used to discover the
item is one common suggestion, but that method is not very satisfactory,
because that particular search may have returned several items in its
result set, or it may have been framed in terms of something unrelated
to what makes the document interesting. In addition, since collections
grow and shrink, there is no guarantee that the search service will
return the same result set for the same query at a future time—especially
if the document itself might change in such a way as to cause the query
not to find it. In yet another related scenario, the user of a document
discovers in it a cross-reference to another document, and wants to make
a copy of that cross-reference, in persistent storage for future use.
And in a final scenario, a client includes a cross-reference in a
document which it then submits to be stored by some storage service,
possibly a different one.

In each scenario, some client intends to place the cross-reference in
persistent storage somewhere, for presentation to the storage service at
an unknown time in the future. When the client eventually presents the
cross-reference to the storage service, it hopes to retrieve the cited
item.

In an electronic library, such cross-references may be used:

•	by an author, in creating a new document, to cite previous, related
documents.

•	by the librarian of a storage service, in adding a document to a
collection, making concrete an author's traditional text citations.

•	by a user of a search service, to prepare a list on a personal note
pad of discovered documents.

•	by a search service as the internal connection between its index and
the documents held by a storage service. (e.g., to connect a
bibliographic record with the corresponding document).

•	by a search service, as the method of telling the client how to obtain
the item from a storage service.

•	by a client, to pass along to another client

•	by an author, to cite anchor (that is, identified internal) points
within another document.

The following simple scenario perhaps captures best of all the
required structuring behavior: At the application level, a user has
brought up a document in a display window, and upon browsing through it
noticed that it mentions another document. The browser provides as a
feature that the user be able to point to the cross-reference with the
mouse, click, and expect to see the cited document appear on the screen
in another window.

There are a number of interesting constraints on the solution, at
least in the electronic library application. Some of the other
applications have analogous constraints:

•	The storage service to which the cross-reference is eventually
presented may be different from the one that provided the original
cross-reference—it may be a backup system, or even a competitive
provider.

•	The interval between creation and use of the cross-reference can be
quite long, perhaps decades, during which time the storage service may
have been upgraded, relocated, merged with other storage services, and
placed under different administrative control.

•	Cross-references will persist long enough that the system that is
intended to resolve them will become obsolete.

•	Between creation and use of the cross-reference, the cited document
may have been deleted, updated, or superseded. Something graceful
should happen if multiple versions are available. If the “document”
is actually a dynamic piece of data such as a current stock quote,
perhaps something different, yet graceful, should happen.

•	Between creation and use of the cross-reference, the organization of
the library may have been revised, and the target document may now be
classified differently.

•	Between creation and use of the cross-reference, the physical and low-
level logical configuration of the storage server may have been
revised, and the target document may be in a different directory or
on a different physical volume.

•	Between creation and use of the cross-reference, the storage
representation of the target document may have been discovered to be
defective, and restored from a backup copy.

•	The initial discovery mechanism may identify only the document; a
later discovery process may identify anchor points of interest.

•	The user who presents the cross-reference may or may not be authorized
to obtain the document.

•	A single document may by indexed by several different search services
that are under different administrations.

•	In response to presentation of a cross-reference, a storage server
may, rather than delivering the desired document, instead return
another cross-reference.

•	If a client makes inquiries of several different search services, it
would like to merge the several responses, which requires that it be
possible to figure out which of the returned cross-references are
duplicates.

As can be seen from this laundry list of constraints, the requirements
on cross-references among distributed data objects read more like the
list of requirements for a sophisticated name service than they are like
the requirements on an RPC service.

It would appear that at the minimum, a cross-reference internally must
be composed of at least two components. The first would be an identifier,
perhaps to be presented to a name server, that allows the client to
discover an appropriate and current server name, port, and protocol to
use, and to verify upon connection that the service at the other end is
the intended one. A second component probably is a specific object
identifier that server is expected to recognize. Beyond that, one moves
into the realm of speculation. There might be an expiration date after
which the server doesn't guarantee to honor the cross-reference, and
perhaps also a backup query, which might be useful in identifying the
object after the expiration date (or in the case of some other failure)
of the original cross-reference. Finally, one might want to include some
kind of check data to verify that the object retrieved is actually the
one that was previously cited. Another potential component, whose
rationale is much less clear, is the identity of an application that
knows how to interpret the stored object, and that should be launched in
conjunction with the arrival of a response from the server, or perhaps
spliced into the path between the client and the server.

The details of how such a cross-reference might be engineered, so that
it can be stored, passed from client to client, and in the end be
recognizable by the server, are an interesting design challenge. Several
projects have run up against the challenge, and have suggested various
strategies that solve parts of the problem. At least six somewhat
different ideas are extant:

•	Tim Berners-Lee has proposed the cross-reference scheme used in his
World-Wide Web. This proposal is moderately complete, but it
concentrates mostly on developing a syntax that can be parsed by a
computer and also read by a person. It takes the view that it should
be possible to create a document identifier that is both unique and
perpetually valid.

•	Clifford Lynch, to stimulate discussion of the topic within the
Coalition for Networked Information developed a list of requirements,
and for each some observations about mechanics that might address that
requirement.

•	Brewster Kahle has proposed the cross-reference scheme used in his
Wide-Area Information Service.

•	F. H. Ayers has made a proposal for a universal standard book number.

•	Theodor Nelson, for the Xanadu® hypertext system, proposed a
universal, hierarchical document numbering scheme with provision for
versions and internal anchor points. It covers several of the
requirements mentioned earlier by assuming complete homogeneity among
the linked items.

•	Apple Computer, in the System 7 Alias Manager for the Macintosh, has
worked out a sophisticated system for linking files within a Macintosh
and across a network of cooperating machines. The Alias Manager uses
a combination of symbolic relative and absolute path names as well as
unique file, volume, and system identifiers, to maintain links in the
face of renaming, hierarchy restructuring, and restoration from
backup.

Each proposal addresses one or more parts of the problem, but none
covers the entire range of requirements. More important, the discussions
by Lynch and by Kahle are characterized by mentions of requirements not
met, and possible alternative approaches, together with questions about
whether or not the requirements are real. Interestingly, almost as if to
recursively emphasize the need for a solution to this problem, most of
the current literature on the subject is not found in traditional
journals, reports, or libraries, but rather is found only on-line in
various repositories within the internet.



Conclusion

As mentioned at the outset, this note has only described the problem,
not proposed any solutions; even this description is not likely to
resemble the one that will someday seem obvious.



Acknowledgements

Ideas and suggestions came from discussions with Mitchell Charity,
Jim O’Toole, Mark Day, Tim Berners-Lee, Andrew Birrell, Dave Redell, Paul
McJones, and Ron Weiss.



Bibliography

Tim Berners-Lee, Jean-Francois Groff, and Robert Cailliau. “Universal
Document Identifiers on the Network.” CERN, February 1992. On-line
location: info.cern.ch:/pub/www/doc/udi1.ps

Clifford Lynch. “Workshop on ID and Reference Structures for
Networked Information.” 24 October 1991. (Call for participation. On-
line location: CNI-ARCH listserv mailing list at uccvma.bitnet. Also
found in WAIS-discussion digest #33, 27 November, 1991.)

Brewster Kahle. “Document Identifiers, or International Standard Book
Numbers for the Electronic Age.” Version 2.2, September 1991. On-line
location: quake.think.com:/pub/wais/doc/doc-ids.txt

F. H. Ayres. “The Universal Standard Book Number (USBN): why, how and
a progress report.” Program: Automated Library and Information Systems
10, 2 pp. 75-80. (London: Association for Information Management: April,
1976)

Theodore H. Nelson. Literary Machines, Edition 87.1. (San Antonio,
Texas: Theodor H. Nelson: 1987)

Apple Computer.  “Alias Manager,” Inside Macintosh Volume VI, chapter
27. (New York: Addison-Wesley: 1991)