A partly read-only portable web-site

Edward Kogan

March 20, 1997
6.033 Recitation 11

Abstract

A design of a portable modifiable hypertext network is presented. Parts of the network may be located on read-only media, yet documents can be freely added to it, deleted or updated. The database can be easily copied, and out-of-date copies updated from the master record. It is assumed that the hypertext network is contained on a single host and contains no links to outside documents. Several solutions are considered, with different tradeoffs between space and time overheads and a preferred solution is chosen based on the assumed computing environment characteristics

1.0 Introduction

With the growth of World Wide Web, hypertext media, such as HTML, has become very widespread. But its utility need not be limited to Internet applications. Even if a site is not connected to the Internet, a web browser can be used to view documents on local disk. Hypertext is a convenient organization of knowledge for human consumption, since only the currently relevant part of the information can be on display at any time and related pieces of data which do not fit well into linear presentation of plain text can be immediately linked with hyperlinks. In addition, if the data is represented in HTML, then any browser can serve as a powerful graphical interface to it without any extra effort on the part of the content provider. For these reasons, it is convenient to use HTML technology to encode data even it is not intended or not primarily intended for Internet distribution.

The target use of the portability techniques discussed in this paper is an Egyptology database, containing HTML files, images and sound files. The author, when going on field expeditions to digs in Egypt, would like to carry it with him. Since there are no computer networks on archaeological digs, the database has to be physically present on the authors computer, and, since the database is quite large, the only practical way of doing so is to carry it on read-only CD-ROM media. Despite the fact that CD-ROM's are read-only, it would be convenient if the archeologist could add working notes, update old and delete obsolete data. He should also be able to give extra copies of his database to any of his interested colleagues, with any of his field updates attached. There should exist mechanisms to keep all of these copies: the master copy in the home computer, the author's local copy and several users' copies reasonably up-to-date. The main constraint on the design is that a standard web browser is to be used as the front-end to any copy of the database, regardless of any changes to it, presumably since our users do not have the resources to implement their own customized web client.

The situation described above can be restated as a special case of generic groupware application. There exists a shared database with an author and a number of viewers, with the author being the only one authorized to modify the database. As a possible extension of this special case, we can have several authors for the database, but that is beyond the scope of this paper. The database has several copies: the master copy, the author's modifiable local copy and the viewers' read-only local copies. Contact between these copies may be intermittent, so that some copies may become outdated with time and will need to be updated from the more recent ones. Since there is only one author in our case, there will always be the most recent copy of the database, simplifying the matters considerably. Another complication is that, although the author's copy is logically fully modifiable, and viewers' copies are logically read-only, physically they are composed of both read-only and writable storage media.

2.0 Design Considerations

2.1 Hardware availability

The master copy is located on a stationary computer, which has a large enough filesystem to keep it completely in disk and high-enough computing power to act as the web server for it. It will be assumed to have no hardware limitations in either processor speed, storage size or network connectivity within reasonable bounds. The portable computer of the author as well as equipment available to the viewers are more limited. All of them have a CD-ROM drive and enough computing power to run a web browser. This leaves a large amount of uncertainty to the speed and memory size of the computer, since a web browser may range from something as simple as a text-mode interface to something as complex and memory-hogging as Netscape Communicator. Still, it is probably safe to assume that the client's computer has around half a dozen megabytes free on its hard drive and ability to run several processes simultaneously; for example, a local copy of an http server and a web browser. A greater limiting factor is the network connectivity. It is unavailable most of the time, and even if a computer does manage to connect to the Internet, the throughput of the connection will be very small. Minimizing network communication is a high priority design goal.

2.2 Design goals

2.2.1 Simplicity

One of the prerequisites of the design is that a standard web browser may serve as a front end to the database. There are two possible reasons for this requirement. The first is that a customized front end will be prohibitively expensive to develop, or, alternatively, some users of the database may be remote and so will not have access to a customized front end. In the usage scenario given, the database's local copy will not have any remote users, so the stumbling block must be the complexity of implementing a custom web browser. In addition, the problem specification implied that a single person should be able to do the implementation. The conclusion is that simplicity of the resulting system is an important design goal. Correspondingly, a solution should not require rewriting or modifying a major application or part of the operating system. A solution needing a custom editor, or a special file system is not acceptable. Ideally, only a short collection of utility programs or scripts should be needed.

2.2.2 Transparency

The system should be hidden away from the user as much as possible. Ideally, the user should not even be aware that some of the database is located on CD-ROM, and in any case, she shouldn't have to memorize dozens of steps which must be done exactly and in order to change anything in the database.

2.2.3 Performance and Efficiency

As noted in Section2.1, the network bandwidth should be conserved as much as possible. The time to access a document from the local copy of the database should be at most several times the time required to simply read a file from the CD-ROM. In addition, the database retrieval system should not overload the computer with running processes. For example, older computers, such as many Egyptian archeologists might have, would be hard put to run a modern web browser, and would not be able to run any other programs concurrently. While the disk space is assumed to be the least scarce of the computing resources, it should be used as efficiently as possible.

2.2.4 Portability

The viewers' computers might run many different systems - MS-DOS, MAC OS, different flavors of Windows or UNIX. It would be desirable if the portable database system worked under all of these. There is a standard format of CD's which is mountable on all the above mentioned operating systems, so that data incompatibility is not a problem. There should be versions of the database system included in the distribution for as many operating systems as practical, with utilities to copy updated data between different platforms. Portability encourages Simplicity, since a simpler system is much easier to port than a complex one. If implementing some design of the portable web site includes modifying some aspect of the operating system, then copying the installation to a viewer's computer would include reinstalling or reconfiguring the operating system on it, sure to be an impractical suggestion. Thus portability precludes operating or file system hacks as parts of the database implementation.

2.2.5 Easy Modification

As already mentioned in Section2.2.2, it should be relatively effortless to modify the database. In addition, the system should be able to handle a large amount of new material without seriously degrading in performance

2.2.6 Easy Update and Copying

It should be easy to copy the updated part of the database from one computer to another. Several methods should be made available to do this, and at least some method should work reasonably efficiently in any situation. Whenever owners of two copies of the database meet, it should be possible for the owner of the earlier copy to update the state of his database to that of the latter copy. Ideally, it would not matter wherever the author is meeting with one of the viewers or the author or viewer has managed to connect to the master copy, or even if two viewers have met with one having a latter version of the files.

2.3 Operations supported

The following operations need to be performed efficiently by the partially read-only portable web site:

7 Add, Delete, Change, Get

These are operations on documents. They respectively add a new document to the hypertext database, delete or change an existing one, or deliver the contents of a document.

7 Copy

Copy operates on the whole database. It is invoked when the author gives a CD-ROM containing the read-only part of the data to a viewer, and copies the update records and the necessary binaries from the hard drive of the author's computer to the hard drive of the viewer's computer, though the necessary binary files would not need change and can be also put on the CD-ROM. Between same platforms, copy could be implemented by third-party file transfer programs using any one of several methods: backing up the update files to floppy disks or tape backup drive and then restoring to the viewer's computer, or connecting the two computers with a serial cable and using it for file transfer. The best option depends on the hardware available to the viewer. Transferring between different systems would be trickier, since some possible designs have to be implemented differently depending on which features are supported by the operating systems.

7 Update Remote

Update remote operation is invoked when the author gets a network connection to the master copy and transfers the updates to the master copy. It consists of the following steps:

establish connection
authenticate author identity
get all updated objects
transmit updated objects
last_update_time = now

Since the network connection is assumed to be very slow, instead of transmitting entire files, only patches ("diffs") describing the changes made since last time can be transmitted. Since the network connection may crash before step (d) is finished, some updates may be retransmitted, so the master copy must be prepared to receive spurious updates. Note that the get updated objects operation must also return some record of deletions that took place.

7 Update Local

This operation is in some sense inverse of update remote. It is used when a viewer achieves network connection and would like to update its local copy from the master copy of the database. The viewer will transmit the list of the files it has with their last modification times, and the master copy will reply with the patches to the files that has changed since then, notifications of file deletions and additions to the database.

3.0 Design analysis

3.1 The URL to Object Mapping

Every object in the Egyptology database can have links pointing to it from other documents and from it to other documents. If that object is changed, then the old copy will remain on the CD-ROM, and the new, updated copy needs to be stored elsewhere. The database system must somehow redirect links going to old copy of the data to point to the new copy and make sure that links from the new document still point to the correct objects. Of course, the database files containing the other links can't be modified, since they themselves are located on the CD-ROM. The links need to be updated so that a standard web browser would still be able to follow them correctly to and from new data. To understand how these links could be modified, we first need to understand how hyperlinks are usually resolved. This understanding will show several alternative methods of attack on the problem.

FIGURE 1. Steps in translation of URLs to objects

A hyperlink is stored in HTML page in form of anchor tag <a href=URL>, where URL either absolute or relative. A relative URL is resolved by the web browser using one of two methods:

if a <BASE HREF=BASE_URL> tag is present in the head of the page, then BASE_URL is used to resolve relative URLs
otherwise the URL of the current page is used to resolve relative URLs

The resulting absolute URL can specify one of several methods of access. If the URL is in the file domain, then it contains the absolute pathname of the file wanted, and the request for the file is sent by the web browser directly to the file system.¹

If the resulting URL is in http domain then there is an extra level of indirection.² The web browser contacts the http server on the specified host and port and gives it the URL. The server maps the URL to a filename on its host and then passes it to its own file system. ³ The file system then maps the filename to the file which is passed back up through the layers.

If all the URL's inside the database are relative, then the way the web browser accesses the pages will have the following property: if the top level page was accessed using the file domain, then all accesses will use the file method, and if the top level page was accessed through the http server, all later accesses will go through that http server.

There are two common misconceptions about the http server operations that are relevant to the portable database project. The first is that a web client can not communicate to an http server on the same host, or, if it can, it needs a working network connection to do so. If the operating system allows several processes to execute simultaneously, the client can connect to the server running on the same machine without needing access to the network.⁴ The second is that a web server needs to map URLs into corresponding filenames, perhaps changing the starting directory, e.g. that a URL http://localhost/pub/file1.html must map to some file named file1.html in the file system. In fact, the mapping can be completely arbitrary.

Above, we saw that a locally-running http server can provide the extra indirection in mapping URLs to objects needed for our application, even in the absence of a network connection. In addition, if the top level page of the database is accessed through this server, and all the links are relative, then all the requests will go through the web server, guaranteeing the needed level of indirection.

Another, less straightforward way, of guaranteeing indirection can be achieved by modifying the file systems mapping of filenames to files. Most modern file systems allow symlinks,a mechanism by which a single file can be named by several different filenames. This mechanism is called symbolic links in UNIX systems and shortcuts in Windows NT/95 (See [3] for further detail). If all requests to a file on CD-ROM go through a symlink stored on the hard drive, then by modifying the symlink, we can effectively modify the read-only file. If the file system does not support symlinks, such as MS-DOS file system, then the HTML 3.2 standard provides an alternative which can accomplish the same purpose. If an HTML page includes the following tag:

<META HTTP-EQUIV="refresh" CONTENT="n; URL=REAL_URL"& gt;

then the web browser will automatically load REAL_URL in n seconds. If n is set to 0, then using this tag is equivalent to symlinking on HTML level.

3.2 A Solution Modifying the Filename to Object Mapping

In this solution, a directory tree of symlinks is kept on the hard drive, exactly mirroring the CD-ROM directory tree. Initially, all the symlinks point to the corresponding file on the CD-ROM. If a file is added then it is written in an appropriate mirror directory, if a file is changed, the new contents are written in the mirror directory over the symlink file, if a file is deleted, then it's entry in the mirror file system, be it symlink or regular file, is deleted. To get a file, the web browser uses the file method to request files in the mirror hierarchy from the file system. If the document has not been changed, then its entry in the shadow file system will be a symlink pointing to the correct file on the CD-ROM and it will be followed automatically by the file system. If the document has been changed or it is a newly added document, then the file system will locate a regular file with the up-to-date data in the correct spot. If the document has been deleted from the database, there will be nothing in its spot in the mirror hierarchy and the file system will return a `File not Found' error, if asked to get it. To get all update files, which is needed for update remote operation, the program would have to walk the CD-ROM and disk mirror trees in parallel and list the discrepancies. Copy operation can be done by third-party software, unless one wishes to copy from a file system which supports symlinks to one which does not, in which case it is necessary to convert symlink files into HTML pages with the META-refresh tag.

3.3 A Solution Modifying the URL to Filename Mapping

An alternative solution is to use a local http server to achieve indirection. The server would need to keep a table mapping URL's to filenames. Each slot of the table would also record the last modification time of the document and an archive bit, telling wherever it has been uploaded to the master copy. Initially, the table would be empty, indicating a default one-to-one mapping. When a document would be added or changed, the new data would be written to the servers data directory and the mapping from the URL to the data would be recorded in the mapping table. If a document has been deleted, then a delete mark is recorded in the slot for the document's URL. During copy, all that need to be copied are the server mapping table and the server data directory. During update remote, the server mapping table is consulted: if some filled slot in the table has archive bit equal to zero, that means the file has been changed since the last upload to the master copy, and needs to be uploaded now. For each file, the archive bit is set after a successful upload.

3.4 Comparison of two solutions

Both of the solutions presented above have similar degrees of user transparency. Both of them provide for easy modification of documents and easy updating and sharing of the database. The file system solution is less complex than the web server solution, since the latter involves modifying an http server, while the first uses only a couple of utility programs and mainly relies on the services provided by the standard file system software. On the other hand, the file system approach takes more disk space, since it requires full symlink mirror directory tree, while the web server approach only uses space for data which is actually new or changed. The web server approach is also more portable, since it does not rely on the file system's support of symlinks to work. A big disadvantage of this approach is that on older, weaker computers, running a web server and a web client simultaneously would overload the system. Also, even on more modern systems, the performance of the file system method will be better because it incurs the overhead of following one symlink at worst, while the second design involves a full HTTP protocol query response connection. This performance gap would become more noticeable with large data files, such as high resolution images or movies.

4.0 Conclusion

Based on these considerations, I recommend the file system based solution as the superior of the two. It provides for better performance, is portable and exceedingly simple, since most of the functionality it requires is already implemented by symlink feature of the OS's file system. This approach allows easy duplication and updating of the changed material and hides the mechanisms of indirection from the user. It is an extremely economical solution for the task at hand.

5.0 References

[1] Wilbur - HTML 3.2: (http://www.htmlhelp.com/reference/wilbur/)
[2] Jerome H. Saltzer. "Name Binding in Computer Systems",: Section 5 of Engineering of Computer Systems, MIT EECS Dept.
[3] Andrew S. Tanenbaum. Modern Operating Systems.: Prentice-Hall, 1992.
[4] Design Project #1: A Partly Read-Only Portable Web Site: 6.033 Handout 16, Spring 97

Footnotes

(1): That is the file domain URL for file /foo/bar would be file://localhost/foo/bar
(2): There are other URL domains, such as ftp, gopher, wais, etc. but they are irrelevant to the current discussion
(3): The other possibilities, such as mapping a URL to output of a program such as a CGI script, or mapping the URL to another URL on some other host are not useful for the purposes of this discussion.
(4): Some TCP/IP stack implementations do require that a working network connection be present before they allow for connections even to local host, but that is an incorrect feature of those implementations.