
6.033--Computer System Engineering

Suggestions for classroom discussion of:

Tim Berners-Lee, Robert Cailiau, Ari Luotonen, Henrik Frystyk Nielsen, and Arthur Secret. The world-wide web. Communications of the ACM 37, 8 (August, 1994) pages 76-82.

by J. H. Saltzer, March 17, 1996

1.  Who wrote this paper? What can we learn by studying the citations,
acknowledgement, and general environment surrounding this paper?

(author's affiliations:  CERN, a high-energy physics outfit, not a
computer research or development organization.  citations:  none?
Actually,  There are some citations buried in the glossary.
Acknowledgement:  none!)

2.  Given this background and environment, what might we expect when
compared with papers written, say, by authors from DEC SRC?

(a.  An unexpected viewpoint, orthogonal thinking, novel ideas that
might not have been thought of by people who have spent years learning
about computer systems from people who have spent years working with
computer systems.

b.  Very little contribution from the world of computer systems and
computer science; reinvention of well-known mistakes.)

3.  So where are the examples of novel ideas?
    (see item 5 below)
    (the integrating and encompassing strategy of having a protocol
    specified as part of the URL [http, ftp, telnet, gopher, etc.] is not
    only novel, but one of the things that gives the WWW its power.  One
    browser can integrate all of the previous mechanisms.)

4.  And how about some examples of mistake reinvention?  (See 8, below.)

5.  There have been several web-like ideas developed over the last few
years.  For example, gopher.  How does WWW compare with gopher?

    -  Gopher has architected directories, like a file system.
    Information-storing objects can't contain links.  It looks like it
    was designed by someone who was too familiar with file systems and
    couldn't break out of that way of thinking.  Gopher is what you get
    if you ask a collection of Unix gurus to design a world-wide web.

    -  WWW allows any document to serve both as an information repository
    and as a directory with links.  This isn't the model of the usual
    file system, so a computer systems person probably wouldn't have
    thought of it.

6.  Two ideas, one mechanism:

     -  organize a single document as a set of linked pages.
     -  link independent documents.

7.  What is a *stateless* protocol?  Why does it matter?

8.  Closures in the World-Wide Web.  Review design project 1, Spring
1995 (handout 9), and the solutions to design project 1 (handout 22).
The following paragraphs are mildly revised versions of paragraphs from
the solution handout.

Technical description of the problem, in 6.033 terms:

The Web browser supplies an implicit closure for relative names (also
called "partial URL's") found in Web pages. The implicit closure it
supplies is simply the URL that the browser used to retrieve the page
that contained the relative name, truncated back to the last slash
character. This closure is the name of a directory at the server that
should be used to resolve the (first component of) the relative name.

Some servers provide a URL namespace by simply using the local (for
example, UNIX) file system namespace. When the local file system
namespace allows synonyms (symbolic links and NFS mounts are two
examples) for directory names, the mapping of local file system
namespace to URL namespace is not unique: There can thus be several
different URL's with different path names for the same object. Trouble
can arise when the object that has multiple URL's is a directory whose
name is used as a closure.

Example:  suppose that file B.html contains the web link {A HREF =
"C.html"}. Both B.html and C.html are stored in the directory /real/.
Suppose further that the browser obtained B.html by requesting the URL

where "pages" is a directory that doesn't actually contain B.html but
instead has a UNIX soft link named B.html that points to to
/real/B.html. Since the file B.html does not arrive at the browser with
an accompanying closure, the browser provides an implicit closure by
truncating the original URL to obtain:

and then uses this as a context to retrieve the web link by

This URL will probably produce a "not found" response.  (Or worse,
return a different file that happens to be named C.html.  The confusion
may be compounded if the different file with the same name turns out to
be an out-of-date copy of the current C.html.)

Another problem can arise when interpreting the relative name "..". This
name is, conventionally, the name for the parent directory of the
current directory. UNIX provides a semantic interpretation: look up the
name ".." in the current directory, where it evaluates (in inode
namespace) to the parent directory. The Web, in contrast, specifies
that ".." is not a name to be looked up in some context, but rather a
syntactic signal to modify the implicit closure by discarding the least
significant component of the directory name. Despite these drastically
different interpretations of "..", the result is usually the same,
because the parent of an object is usually the thing named by the
next-earlier component of that object's path name. The exception (and
the problem) arises when the syntactic modification is applied to a URL
that contains a synonym for a directory name. If the path name of the
synonym does not come through the directory's parent, syntactic
interpretation provides an implicit closure different from the one that
would be supplied by semantic interpretation.

The problem can be fixed in at least three fundamentally different ways:

1. Arrange things so that the current implicit closure always works.
    -  forbid use of UNIX links, or require use of complete link farms.
    -  forbid use of ".." in web links.

2. Do a better of job of choosing an implicit closure.
    -  client sends original URL plus link to the server and lets it
       figure out how to deal.

3. Provide an explicit closure.
    -  Server fills in "location" field of header with an absolute URL.
    -  Client uses that URL as the closure.

A misleading characterization of the problem:

One might suggest that the implementor of the server (or the writer of
the pages containing the relative links) failed to take heed of the
warning in the Web URL specifications for path names that "The
similarity to unix and other disk operating system filename conventions
should be taken as purely coincidental, and should not be taken to
indicate that URIs should be interpreted as file names." (Tim
Berners-Lee, Universal Resource Identifiers: Recommendations.) That
suggestion, however, is misleading.

Unfortunately, the problem is built in to the Web naming specifications.
Those specifications require that relative names be interpreted
syntactically, yet they do not require that every object have a unique
URL. Unambiguous syntactic interpretation of relative names requires
that the closure consist of a unique path name. Since the browser
derives the closure from the path name of the object that contained the
relative name, and that object's path name does not have to be unique,
it follows that syntactic interpretation of relative names will
intrinsically be ambiguous. When servers try to map URL path names to
UNIX path names, which are not unique, they are better characterized as
exposing, rather than causing, the problem.

That analysis suggests that one way to conquer the problem is to change
the way in which the browser acquires the closure. If the browser could
somehow obtain a canonical path name for the closure, the same
canonical path name that the UNIX system uses to reach the directory
from the root, the problem would vanish.

If this description seems mysterious, check the design project handouts
and the encounter page that demonstrates the problem.  The second
handout also has four pages of solutions collected from some 50 student
design projects, some quite novel.

Comments and suggestions: