A partly read-only portable web-site
Edward Kogan
March 20, 1997
6.033 Recitation 11
Abstract
A design of a portable modifiable hypertext network is presented. Parts
of the network may be located on read-only media, yet documents can be
freely added to it, deleted or updated. The database can be easily
copied, and out-of-date copies updated from the master record. It is
assumed that the hypertext network is contained on a single host and
contains no links to outside documents. Several solutions are
considered, with different tradeoffs between space and time overheads
and a preferred solution is chosen based on the assumed computing
environment characteristics
1.0 Introduction
With the growth of World Wide Web, hypertext media, such as HTML, has
become very widespread. But its utility need not be limited to Internet
applications. Even if a site is not connected to the Internet, a web
browser can be used to view documents on local disk. Hypertext is a
convenient organization of knowledge for human consumption, since only
the currently relevant part of the information can be on display at any
time and related pieces of data which do not fit well into linear
presentation of plain text can be immediately linked with hyperlinks. In
addition, if the data is represented in HTML, then any browser can serve
as a powerful graphical interface to it without any extra effort on the
part of the content provider. For these reasons, it is convenient to use
HTML technology to encode data even it is not intended or not primarily
intended for Internet distribution.
The target use of the portability techniques discussed in this paper is
an Egyptology database, containing HTML files, images and sound
files. The author, when going on field expeditions to digs in Egypt,
would like to carry it with him. Since there are no computer networks on
archaeological digs, the database has to be physically present on the
authors computer, and, since the database is quite large, the only
practical way of doing so is to carry it on read-only CD-ROM
media. Despite the fact that CD-ROM's are read-only, it would be
convenient if the archeologist could add working notes, update old and
delete obsolete data. He should also be able to give extra copies of his
database to any of his interested colleagues, with any of his field
updates attached. There should exist mechanisms to keep all of these
copies: the master copy in the home computer, the author's local copy
and several users' copies reasonably up-to-date. The main constraint on
the design is that a standard web browser is to be used as the front-end
to any copy of the database, regardless of any changes to it, presumably
since our users do not have the resources to implement their own
customized web client.
The situation described above can be restated as a special case of
generic groupware application. There exists a shared database
with an author and a number of viewers, with the author
being the only one authorized to modify the database. As a possible
extension of this special case, we can have several authors for the
database, but that is beyond the scope of this paper. The database has
several copies: the master copy, the author's modifiable local
copy and the viewers' read-only local copies. Contact between these
copies may be intermittent, so that some copies may become outdated with
time and will need to be updated from the more recent ones. Since there
is only one author in our case, there will always be the most recent
copy of the database, simplifying the matters considerably. Another
complication is that, although the author's copy is logically fully
modifiable, and viewers' copies are logically read-only, physically they
are composed of both read-only and writable storage media.
2.0 Design Considerations
The master copy is located on a stationary computer, which has a large
enough filesystem to keep it completely in disk and high-enough
computing power to act as the web server for it. It will be assumed to
have no hardware limitations in either processor speed, storage size or
network connectivity within reasonable bounds. The portable computer of
the author as well as equipment available to the viewers are more
limited. All of them have a CD-ROM drive and enough computing power to
run a web browser. This leaves a large amount of uncertainty to the
speed and memory size of the computer, since a web browser may range
from something as simple as a text-mode interface to something as
complex and memory-hogging as Netscape Communicator. Still, it is
probably safe to assume that the client's computer has around half a
dozen megabytes free on its hard drive and ability to run several
processes simultaneously; for example, a local copy of an http server
and a web browser. A greater limiting factor is the network
connectivity. It is unavailable most of the time, and even if a computer
does manage to connect to the Internet, the throughput of the connection
will be very small. Minimizing network communication is a high
priority design goal.
2.2 Design goals
2.2.1 Simplicity
One of the prerequisites of the design is that a standard web browser
may serve as a front end to the database. There are two possible
reasons for this requirement. The first is that a customized front end
will be prohibitively expensive to develop, or, alternatively, some
users of the database may be remote and so will not have access to a
customized front end. In the usage scenario given, the database's local
copy will not have any remote users, so the stumbling block must be the
complexity of implementing a custom web browser. In addition, the
problem specification implied that a single person should be able to do
the implementation. The conclusion is that simplicity of the resulting
system is an important design goal. Correspondingly, a solution should
not require rewriting or modifying a major application or part of the
operating system. A solution needing a custom editor, or a special file
system is not acceptable. Ideally, only a short collection of utility
programs or scripts should be needed.
The system should be hidden away from the user as much as possible.
Ideally, the user should not even be aware that some of the database is
located on CD-ROM, and in any case, she shouldn't have to memorize
dozens of steps which must be done exactly and in order to change
anything in the database.
2.2.3 Performance and Efficiency
As noted in Section2.1, the
network bandwidth should be conserved as much as possible. The time to
access a document from the local copy of the database should be at most
several times the time required to simply read a file from the CD-ROM.
In addition, the database retrieval system should not overload the
computer with running processes. For example, older computers, such as
many Egyptian archeologists might have, would be hard put to run a
modern web browser, and would not be able to run any other programs
concurrently. While the disk space is assumed to be the least scarce of
the computing resources, it should be used as efficiently as
possible.
2.2.4 Portability
The viewers' computers might run many different systems - MS-DOS, MAC
OS, different flavors of Windows or UNIX. It would be desirable if the
portable database system worked under all of these. There is a
standard format of CD's which is mountable on all the above mentioned
operating systems, so that data incompatibility is not a problem. There
should be versions of the database system included in the distribution
for as many operating systems as practical, with utilities to copy
updated data between different platforms. Portability encourages
Simplicity, since a simpler system is much easier to port than a complex
one. If implementing some design of the portable web site includes
modifying some aspect of the operating system, then copying the
installation to a viewer's computer would include reinstalling or
reconfiguring the operating system on it, sure to be an impractical
suggestion. Thus portability precludes operating or file system hacks as
parts of the database implementation.
2.2.5 Easy Modification
As already mentioned in Section2.2.2, it should be relatively
effortless to modify the database. In addition, the system should be
able to handle a large amount of new material without seriously
degrading in performance
2.2.6 Easy Update and Copying
It should be easy to copy the updated part of the database from one
computer to another. Several methods should be made available to do
this, and at least some method should work reasonably efficiently in any
situation. Whenever owners of two copies of the database meet, it should
be possible for the owner of the earlier copy to update the state of his
database to that of the latter copy. Ideally, it would not matter
wherever the author is meeting with one of the viewers or the author or
viewer has managed to connect to the master copy, or even if two viewers
have met with one having a latter version of the files.
2.3 Operations supported
The following operations need to be performed efficiently by the
partially read-only portable web site:
- 7 Add, Delete, Change, Get
-
These are operations on documents. They respectively add a new document
to the hypertext database, delete or change an existing one, or deliver
the contents of a document.
- 7 Copy
- Copy operates on the whole database. It is invoked when the author
gives a CD-ROM containing the read-only part of the data to a viewer,
and copies the update records and the necessary binaries from the hard
drive of the author's computer to the hard drive of the viewer's
computer, though the necessary binary files would not need change and
can be also put on the CD-ROM. Between same platforms, copy could be
implemented by third-party file transfer programs using any one of
several methods: backing up the update files to floppy disks or tape
backup drive and then restoring to the viewer's computer, or connecting
the two computers with a serial cable and using it for file
transfer. The best option depends on the hardware available to the
viewer. Transferring between different systems would be trickier, since
some possible designs have to be implemented differently depending on
which features are supported by the operating systems.
- 7 Update Remote
-
Update remote operation is invoked when the author gets a network
connection to the master copy and transfers the updates to the master
copy. It consists of the following steps:
- establish connection
- authenticate author identity
- get all updated objects
- transmit updated objects
- last_update_time = now
Since the network connection is assumed to be very slow, instead of
transmitting entire files, only patches ("diffs") describing
the changes made since last time can be transmitted. Since the network
connection may crash before step (d) is finished, some updates may be
retransmitted, so the master copy must be prepared to receive spurious
updates. Note that the get updated objects operation must also return
some record of deletions that took place.
- 7 Update Local
- This operation is in some sense inverse of update remote. It is used
when a viewer achieves network connection and would like to update its
local copy from the master copy of the database. The viewer will
transmit the list of the files it has with their last modification
times, and the master copy will reply with the patches to the files that
has changed since then, notifications of file deletions and additions to
the database.
3.0 Design analysis
3.1 The URL to Object Mapping
Every object in the Egyptology database can have links pointing to it
from other documents and from it to other documents. If that object is
changed, then the old copy will remain on the CD-ROM, and the new,
updated copy needs to be stored elsewhere. The database system must
somehow redirect links going to old copy of the data to point to the new
copy and make sure that links from the new document still point to the
correct objects. Of course, the database files containing the other
links can't be modified, since they themselves are located on the
CD-ROM. The links need to be updated so that a standard web browser
would still be able to follow them correctly to and from new data. To
understand how these links could be modified, we first need to
understand how hyperlinks are usually resolved. This understanding will
show several alternative methods of attack on the problem.
FIGURE 1. Steps in translation
of URLs to objects
A hyperlink is stored in HTML page in form of anchor tag <a
href=URL>, where URL either absolute or relative. A
relative URL is resolved by the web browser using one of two methods:
- if a <BASE HREF=BASE_URL> tag is present in the head
of the page, then BASE_URL is used to resolve relative URLs
- otherwise the URL of the current page is used to resolve relative
URLs
The resulting absolute URL can specify one of several methods of
access. If the URL is in the file domain, then it contains the absolute
pathname of the file wanted, and the request for the file is sent by the
web browser directly to the file system.1
If the resulting URL is in http domain then there is an extra level of
indirection.2 The web
browser contacts the http server on the specified host and port and
gives it the URL. The server maps the URL to a filename on its host and
then passes it to its own file system.
3 The file system then maps the filename to the file
which is passed back up through the layers.
If all the URL's inside the database are relative, then the way the web
browser accesses the pages will have the following property: if the top
level page was accessed using the file domain, then all accesses will
use the file method, and if the top level page was accessed through the
http server, all later accesses will go through that http server.
There are two common misconceptions about the http server operations
that are relevant to the portable database project. The first is that a
web client can not communicate to an http server on the same host, or,
if it can, it needs a working network connection to do so. If the
operating system allows several processes to execute simultaneously, the
client can connect to the server running on the same machine without
needing access to the network.4 The
second is that a web server needs to map URLs into corresponding
filenames, perhaps changing the starting directory, e.g. that a URL
http://localhost/pub/file1.html must map to some file named
file1.html in the file system. In fact, the mapping can be completely
arbitrary.
Above, we saw that a locally-running http server can provide the extra
indirection in mapping URLs to objects needed for our application, even
in the absence of a network connection. In addition, if the top level
page of the database is accessed through this server, and all the links
are relative, then all the requests will go through the web server,
guaranteeing the needed level of indirection.
Another, less straightforward way, of guaranteeing indirection can be
achieved by modifying the file systems mapping of filenames to
files. Most modern file systems allow symlinks,a mechanism by which
a single file can be named by several different filenames. This
mechanism is called symbolic links in UNIX systems and shortcuts in
Windows NT/95 (See [3] for further detail). If all requests to a file on
CD-ROM go through a symlink stored on the hard drive, then by modifying
the symlink, we can effectively modify the read-only file. If the file
system does not support symlinks, such as MS-DOS file system, then the
HTML 3.2 standard provides an alternative which can accomplish the same
purpose. If an HTML page includes the following tag:
<META HTTP-EQUIV="refresh" CONTENT="n; URL=REAL_URL"&
gt;
then the web browser will automatically load REAL_URL in n seconds.
If n is set to 0, then using this tag is equivalent to symlinking on
HTML level.
3.2 A Solution Modifying the Filename to Object Mapping
In this solution, a directory tree of symlinks is kept on the hard
drive, exactly mirroring the CD-ROM directory tree. Initially, all the
symlinks point to the corresponding file on the CD-ROM. If a file is
added then it is written in an appropriate mirror directory, if a
file is changed, the new contents are written in the mirror
directory over the symlink file, if a file is deleted, then it's
entry in the mirror file system, be it symlink or regular file, is
deleted. To get a file, the web browser uses the file method to
request files in the mirror hierarchy from the file system. If the
document has not been changed, then its entry in the shadow file system
will be a symlink pointing to the correct file on the CD-ROM and it will
be followed automatically by the file system. If the document has been
changed or it is a newly added document, then the file system will
locate a regular file with the up-to-date data in the correct spot. If
the document has been deleted from the database, there will be nothing
in its spot in the mirror hierarchy and the file system will return a
`File not Found' error, if asked to get it. To get all update
files, which is needed for update remote operation, the program would
have to walk the CD-ROM and disk mirror trees in parallel and list the
discrepancies. Copy operation can be done by third-party
software, unless one wishes to copy from a file system which supports
symlinks to one which does not, in which case it is necessary to convert
symlink files into HTML pages with the META-refresh tag.
3.3 A Solution Modifying the URL to Filename Mapping
An alternative solution is to use a local http server to achieve
indirection. The server would need to keep a table mapping URL's to
filenames. Each slot of the table would also record the last
modification time of the document and an archive bit, telling wherever
it has been uploaded to the master copy. Initially, the table would be
empty, indicating a default one-to-one mapping. When a document would
be added or changed, the new data would be written to the
servers data directory and the mapping from the URL to the data would be
recorded in the mapping table. If a document has been deleted,
then a delete mark is recorded in the slot for the document's URL.
During copy, all that need to be copied are the server mapping
table and the server data directory. During update remote, the
server mapping table is consulted: if some filled slot in the table has
archive bit equal to zero, that means the file has been changed since
the last upload to the master copy, and needs to be uploaded now. For
each file, the archive bit is set after a successful upload.
3.4 Comparison of two solutions
Both of the solutions presented above have similar degrees of user
transparency. Both of them provide for easy modification of documents
and easy updating and sharing of the database. The file system solution
is less complex than the web server solution, since the latter involves
modifying an http server, while the first uses only a couple of utility
programs and mainly relies on the services provided by the standard file
system software. On the other hand, the file system approach takes more
disk space, since it requires full symlink mirror directory tree, while
the web server approach only uses space for data which is actually new
or changed. The web server approach is also more portable, since it
does not rely on the file system's support of symlinks to work. A big
disadvantage of this approach is that on older, weaker computers,
running a web server and a web client simultaneously would overload the
system. Also, even on more modern systems, the performance of the file
system method will be better because it incurs the overhead of following
one symlink at worst, while the second design involves a full HTTP
protocol query response connection. This performance gap would become
more noticeable with large data files, such as high resolution images or
movies.
4.0 Conclusion
Based on these considerations, I recommend the file system based
solution as the superior of the two. It provides for better
performance, is portable and exceedingly simple, since most of the
functionality it requires is already implemented by symlink feature of
the OS's file system. This approach allows easy duplication and
updating of the changed material and hides the mechanisms of indirection
from the user. It is an extremely economical solution for the task at
hand.
5.0 References
- [1] Wilbur - HTML 3.2
- (http://www.htmlhelp.com/reference/wilbur/)
- [2] Jerome H. Saltzer. "Name Binding in Computer Systems",
- Section 5 of Engineering of Computer Systems, MIT EECS Dept.
- [3] Andrew S. Tanenbaum. Modern Operating Systems.
- Prentice-Hall, 1992.
- [4] Design Project #1: A Partly Read-Only Portable Web Site
- 6.033 Handout 16, Spring 97
Footnotes
- (1)
- That is the file domain URL for file /foo/bar would be
file://localhost/foo/bar
- (2)
- There are other URL domains, such as ftp, gopher, wais, etc. but
they are irrelevant to the current discussion
- (3)
- The other possibilities, such as mapping a URL to output of a
program such as a CGI script, or mapping the URL to another URL on some
other host are not useful for the purposes of this discussion.
- (4)
- Some TCP/IP stack implementations do require that a working network
connection be present before they allow for connections even to local
host, but that is an incorrect feature of those implementations.