Utilizing a Union File System to Implement a Mutable CD-ROM Based Information Archive

Venkatesh Satish

Abstract

I propose a method for remotely updating a World-Wide Web server with file modifications made by a user on a portable system (a laptop). The system makes use of a Union File System mount to allow a user to maintain a simple record of any changes he wishes to make to an archive that is contained on a CD-ROM. The Union File System allows the effective merging of a writable medium, like a hard drive, and a read-only file system. In this implementation, the hard drive would only contain the differences between the overall working copy and the CD-ROM, so one could simply package the local files and send them to a web server to keep it updated and facilitate the updating of files on arbitrary systems that only had a copy of the CD-ROM. The benefits of such a system become more evident when the system is compared to a major alternative: keeping a table of all file modifications (changes, creations, deletions) and transferring all necessary files to the web server. Although this latter method may allow the most efficient use of bandwidth, it introduces a great deal of complexity and a lack of flexibility in the software the laptop user can employ. The union mount is a simple, elegant solution that is efficient, functional, and relatively easy to implement. Several issues that relate to the implementation of both systems are discussed; examples include avoidance of naming conflicts, the handling of different file operations, and details of the setup of these methods.

Introduction

The scheme presented in this paper is designed to allow an Egyptologist to update a web site by making changes on his laptop in the field and then downloading the modifications to the server when he gets Internet access. Additionally, he makes a snapshot of his web files and places them on a CD-ROM every time he makes a trip to Egypt, presenting copies of the CD-ROM to his colleagues on each journey. There are 400 Megabytes of files stored on the first snapshot, and all the web pages are heavily interlinked. The user wishes to accommodate two related needs that his colleagues have. The first is for colleagues to receive any content updates directly from the Egyptologist's laptop when they meet. His colleagues must also be able to download updates from the web site, keeping in mind that there are several versions of CD-ROM snapshots that must be accommodated. A standard web browser must serve as the user interface for all parties wishing to read files on the web site.

There are a number of issues to address in order to satisfy the Egyptologist's demands. One of the major problems is dealing with the remote update, since the laptop and the web site have distinct file systems. The laptop must somehow store and categorize all the changes to the working copy and upload the proper files to the web site. The site must perform the same changes made to the working copy on the laptop, and this might require writing a software package solely for that purpose. Any system put in place must be able to handle the different modification types, which can generally be categorized into the following operations: creating files, changing existing files, and deleting files. Renaming or moving files and directories can be viewed as combinations of the above operations. Another problem embedded in the feature specifications is the manner in which the system will "change" the files on the CD-ROM. Clearly, one cannot actually alter the files held on the disc, since it is a read-only medium. Instead, one must have a way of simulating a deletion or modification of a file on the CD, so that the operation can be communicated to the home web site and executed there.

Allowing colleagues to obtain updates from the laptop means that all cumulative changes made to the snapshot stored on the CD must be archived somewhere. Allowing for multiple versions of CD-ROMs means that there needs to be a way for the colleagues to specify a version and receive only the changes that were made since their version was produced. This could get tricky since these people may use a wide variety of systems and would need software that could implement all the recorded changes on their respective platforms.

Design Criteria

While the overall objectives may seem straightforward, there are general design issues that must be considered. The system design must incorporate all of the necessary functionality, while also balancing run-time efficiency against the complexity of implementing an optimal system. For instance, the system has to record the state of the web site in an efficient way. A possible solution to the updating problem is to keep all of the files on the laptop's hard disk, modify them as one wishes, and then send every file to the web server. Obviously, this implementation's limitation is bandwidth, since every update would require sending at least 400 MB of data even if there were only a handful of changes. While extremely simple to implement, the idea leaves a lot to be desired in terms of efficiency. One can begin to see the trade-off between bandwidth efficiency and the complexity in designing the system. The desire to keep computer systems simple is a general rule that is often pointed out in system design literature, and this is done repeatedly in Lampson's Hints for Computer System Design [1].

In addition, there has to be a relatively simple way for the system to keep track of changes to files. This falls under the run-time efficiency vs. complexity argument. An optimal system really would send only changed information to the web server because this practice saves bandwidth. As the system becomes more complicated, however, more software must be written to handle all the information. For example, a system might keep track of every file block that is changed and only send the parts of a file that were changed in order to really minimize the bandwidth used. However, maintaining a record that is large and writing software that can deal with lower-level management might be inefficient to implement.

The system must be tailored to address the specific problem at hand. A reasonable assumption might be that the vast majority of the data is not altered because it is unlikely that the researcher would modify graphics files, which happen to take up much of the space on the CD-ROM. The researcher may mainly alter text and HTML files, which make up a small amount of data in absolute terms. A system may very well transmit redundant data on a regular basis, but the added cost of network time might not be that significant. In that case, which seems a likely scenario for the Egyptologist, we would prefer a system that is easy to implement even though it does not present the optimal bandwidth solution. Since Internet connectivity appears to be a limiting factor, the total number of times the researcher gets to update his system may not warrant the implementation expense of creating a maximally efficient run-time system.

The solution has to try to minimize the amount of special software that has to be written for the design to work. Quite simply, the time needed to implement the design will directly translate into the Egyptologist's expenditures, as he will have to employ software engineers to create the system. If the solution can use existing software packages, it also reduces the chance for coding and programming errors. Also, the design must bear in mind that the software used should be standard across a variety of systems, since the researcher's colleagues will want to update their information expeditiously and correctly.

If such standard software can be found, the design might circumvent many platform-dependency issues. We also do not want to restrict the Egyptologist to project-specific software tools (e.g., editors), since it is generally best to allow some flexibility for future modifications. Using standard packages might give the user that flexibility.

Furthermore, there is some usefulness to keeping the file changes somewhat transparent. The researcher should be able to modify the files on his laptop without having to worry about the internal bookkeeping of the system that tracks changes to the working copy. To a degree, the user should be able to make file changes on the laptop in basically the same way he would do so on the web site at home. This concept allows for a user-friendly setting, so that the Egyptologist does not have to change his habits to accommodate all the features.

Implementation Analysis

Mounting a Union File System on the laptop may help solve the difficulty of maintaining a file system composed of writable and read-only media. The essence of the union mount is that the CD-ROM reader and the hard disk containing the modified files will be mounted to a single directory that can be examined using a standard web browser. At the application level, files on both the hard disk and CD-ROM will appear to be in the same directory, with the only difference being that files on the hard disk will supersede files with the same name and path on the CD-ROM [the figure I used to show this does not translate into HTML].

In this way, one could simply store all working copy modifications on the hard disk, which automatically delineates the files that should be sent to the web server to update it. These files can be archived on the laptop and extracted by the web server, effectively executing all operations. Moreover, the union mount would allow any editor, like Emacs, to modify a file on the CD-ROM transparently by saving the modified file on the hard drive.

The general alternative to the union mount is a table-based solution that records all the changes that the Egyptologist makes to the file system on his laptop. The user stores all 400+ MB of web pages on the laptop's hard drive, and modifies the files using a special editor. This tool's purpose is to make table entries that record the operation type and any necessary file data for transfer to the web site. This method also needs software to receive/transmit the table and files, as well as execute the operations on the web site.

The basic file structure for both systems would be the same as the web site. In the union mount case, this is done so that the laptop's hard drive can be archived and extracted on to the web site. Having a different file structure than the web site would defeat the initial point of the union mount, which was constructing a local file hierarchy whose purpose was to represent all modifications to the last CD-ROM snapshot. Different file structures would require the maintenance of records of file locations on the laptop to sort out the eventual destination on the web site. That introduces complexity without adding features or improving the functionality, since the researcher does not gain much by just altering the file structure. The same file structure is used in the table method so that the design avoids the use of the BASE element to prevent naming conflicts. As long as the user ensures that the web server would function on the laptop, there should not be any naming conflicts on the web site. The goal of keeping a table of changes to the working copy is to duplicate all laptop mutations on the web server. In particular, the BASE element is necessary if one were going to implement the system using symbolic links. However, that method mandates a system to track all of the symlinks, and this requirement introduces significant complexity. So, the choice to keep the laptop organization the same as the web site stems from the desire to simply duplicate the laptop changes on the server.

File Modification Handling

The union mount system goes hand-in-hand with the tar (tape archive) utility [2]. It is possible to use this command, which is available on all versions of Unix, to copy all of the hard disk files (the cumulative working copy modifications since the last snapshot) on the laptop into a bundle. The archive file can then be sent to the web site, where the files are extracted. This tar file not only keeps the web site up to date, but provides a way for the Egyptologist to send all the updates to friends who only have a copy of the CD-ROM. If they want the update, all he needs to do is send them the archive file. Since tar is widely supported, the design provides a solution for colleagues who have differing platforms, which was one of the important criteria for selecting a solution. It is relatively clear how the union mount system handles new and altered files and directories. Since the archive utility stores the new files and directories in the archive file, corresponding directories and files are extracted on the web server's end.

However, it might not be apparent how the deletion of files on the laptop can be paralleled on the web site only using an archive utility. A system of tagging deleted files is necessary so that the "delete" operation on the laptop can be effectively transferred to the web server. Simply deleting the file on the portable machine is not sufficient; in the union mount system, one needs a way of superseding the CD-ROM file with a null one in order to make sure the researcher does not observe a file on the CD-ROM that theoretically should have been deleted. The union mount system I envisioned is based on the mount_union command in many Berkeley Software Distribution (BSD) systems [3]. Incorporating a transparent delete is a logical extension of the mount, although it is unclear if this is actually included in different packages. In any case, a simple fix will allow deletions to occur. The delete command on the laptop should be aliased to delete the file and then create a symbolic link to a non-existent file with the deleted filename. For a standard convention, the new symlinks could point to "/deleted," a file name that is never used otherwise. The invalid symlink means that attempts to access those files will result in error messages being displayed, so the effective result is that the original file no longer "exists" on the laptop [4]. The move and rename commands on the laptop should also be aliased to create symlinks to "/deleted" for any files they modify. Tar actually handles symlinks, so these links will be copied to the home website, overwriting the original files that were to be deleted. There should be a script on the web site to periodically check for symlinks pointing to "/deleted" and delete any symlinks which do so. This is simply to get rid of files that clutter the web site's directory. However, the invalid symlinks must remain on the union mounted system (laptop) until the next snapshot is written to a CD-ROM, to prevent naming conflicts between the hard disk and the CD-ROM. This same system applies to colleagues who download the tar file to update their systems. They would want to run scripts which check for symlinks pointing to "/deleted" and erase them. In summary, combining the tape archive utility with simple scripts and aliases covers all the file operations that are needed to update the web site.

While the concept of creating a file modification table to update the web site is straightforward, the mechanics of the system are somewhat cumbersome to implement. Not only is a special editor needed to track file modifications on the laptop and make the table, but specific software running on the web site must interpret table entries and execute those operations. This method requires a case for each operation category. However, writing the specific software does afford an opportunity to optimize certain areas. For instance, the union mount system sends the entire net difference between the last snapshot and the working copy, which may result in a number of duplicate files being sent. By keeping track of all the latest changes, the table method will only send the files that have actually changed. This is a point in the table method's favor, although the feature requires writing all the necessary software to support it.

The categories of file modifications on the web site include file creation, alteration, renaming, and deletion. Directory creation, renaming, and deletion, as well as symlink creation are examples of other common cases that must be handled. For each of these operations, the software needs certain information, including the pathname and/or file contents. Only the pathname is needed for the deletion of files and directories and the creation of directories. The symlink and renaming operations require two pathnames to be recorded in the table; the alteration operation and file creation both need a file to be sent to the web site. In the case of creation, all of the file content must be included. However, the file alteration operation offers an opportunity to make an interesting optimization. At the time of modification, the editor should run the diff command on the revised file and the old file, which creates a file that encodes the difference between the versions. The diff file will be sent to the web site along with the pathname of the original file. On the receiving side, the system will have to use the patch command on the original file and the diff file to obtain the revised file. That way, the researcher may make small changes in the web pages without wasting the bandwidth of sending an entire file when only a small one will do. Still, maintaining a sequential log does have some drawbacks. Whenever a delete operation is recorded, the system software must check to make sure preceding table entries that refer to the deleted file are removed. Otherwise, the laptop system might try to transfer a deleted file to the web site, which would result in errors. Furthermore, the renaming of a file or directory should also prompt the removal of preceding table entries that refer to that object, since system errors would occur otherwise. The table solution seems to introduce a number of new issues that are not problems with the union mount case.

There are a variety of problems to address with regard to the Egyptologist's colleagues trying to get updates from the web site using the table system. In order for a colleague to obtain an update, he would have to communicate he received the most recent update, so the subsequent modifications could be performed on his machine. The maintenance of a sequential log means the table includes all of the changes to the web site since the first CD-ROM snapshot! This system would be difficult to manage, especially when one considers all the necessary file content and names that need to be organized. Matters get more complicated when the system must accommodate different CD-ROM versions. It would be possible to get the right set of updates via a time stamp check, although this system would be somewhat complex. The union mount system only requires one tar file to be stored per CD-ROM version. Only one archive file is needed per version because the file contains all the cumulative changes that have been made on the web server since the last CD-ROM snapshot. That makes it relatively simple for the Egyptologist's colleagues to update their information. Additionally, the table system requires custom software to run on colleagues' machines. That means that the software must be portable to all the different systems that the researcher's colleagues use. If this cannot be achieved, a new system is required to transmit the updates to colleagues, making the implementation even more cumbersome.

There is one minor limitation to the union mount system as it is currently described. As the system is set up, updates to the web site and to various colleagues' machines are easy to carry out: only a tar file of the laptop's hard disk is necessary to describe all of the file modifications to make the update possible. The web server maintains a list of all the tar files it receives, keeping only the most recent archive file and the tar files most recently received before each CD-ROM snapshot. The difficulty is that the web server maintains a record of the state of the current working copy, rather than a record of cumulative file modifications. As a result, making modifications to the web site using the server itself will not generate the tar files. That means all changes should be made using the laptop, even when the Egyptologist is home. This can be changed if the home web site is maintained as a union mount system, since it would be configured to record the differences between the working copy and the last CD-ROM version. However, a web server functioning as union mount system means that the majority of the web pages are on CD. This, in turn, means that the server is slower than the case where all the web files are on the hard drive. The slower performance is due in part to the overhead associated with the Union File System. It has to make checks to the hard drive and to the CD-ROM drive on a continual basis. Considering that the Egyptologist would probably want his web site to be faster in order to service all the people looking at his research, he is most likely better off by changing the web pages using his laptop than by trying to implement a web server using a union mount. While this limitation is a mild annoyance, it does not actually hinder the functionality or efficiency of his overall system.

In order to ensure that the data invariant of the web page system remains intact, the web server should disable any accesses while it is in the process of updating the pages. This holds for both systems, so that all updates are atomic. If people access files in the middle of, for instance, a tar file extraction, they may encounter invalid links even though the laptop system preserves the invariant. A related issue is whether a system has the ability to check the validity of links on the pages. There are standard packages available which can examine the links on a series of pages, and these could be used to perform the verification.

The union mount system can be expanded to employ smarter ways of sending updates to the web server and colleagues' machines. One way to accomplish this it to make a file-by-file comparison of the current hard disk and an extraction of the last tar file and then only send the files that are different. Directory compares are not very difficult to implement, and one could still save on bandwidth. This system still allows one to maintain a cumulative update from the last CD-ROM if that is a desirable goal. This type of update can be accomplished by extracting files from the last archive in an empty directory, copying the changes to that directory, and then re-bundling the files into an archive. More expansion can be done, although the chief selling point of the system is its amazing simplicity. If too many features are added, that would diminish the system's value.

Recommendation

The major advantage of a union mount system over a table-based one is the elegance of the solution. While the run-time efficiency of a table-based system can be tailored to outperform the union mount system described in this paper, the simplicity outweighs the minor gains for a variety of reasons. Except for a few small scripts, the whole package is composed entirely of software that is readily available. The command described in this paper, mount_union, is supported by BSD, a relatively common operating system. The tar command is supported on all versions of Unix, which makes the overall solution an excellent choice in terms of trying to implement the idea. Other solutions might require the use of custom software to handle the transfer and interpretation of data. The use of specific software makes these solutions more costly to implement, more prone to error, and less portable. Sticking to software that is already used on a common basis ensures a greater reliability and gives the Egyptologist a chance to study how other groups may have tried to solve this problem. That resource might prove useful, because it means that many of the bugs in those software packages may already have been addressed.

Another advantage of the union mount system over the table method is that the researcher may use any editing software he wants to make changes to the web pages on the portable system. In the table method, he is basically restricted to using the special editor that makes table entries. This is not a very user-friendly solution, and it might reduce his ability to effectively maintain the web site. The union mounted file system limits the amount he has to learn. Scripts can be written to deal with the minor changes, like implementing a delete operation on the union mount without the user's knowledge. Maintaining transparency in the system ties in with user-friendliness, because it allows the user to deal with the system as if nothing had really changed. That might reduce the chance of human error on his part, for instance. So, while the union mount system does not incorporate some of the features that could be included in the table-based system, it addresses all the demands by the Egyptologist while maintaining an ease of implementation. For this reason, the union mount system is an ideal one for him, and the proposed package does provide a framework that can be expanded later.

Conclusions

A union mount system helps combine the CD-ROM drive and the hard disk on a portable system to effectively implement a writable CD-ROM. This works well as a means for updating a set of World-Wide Web pages based on a CD-ROM, since changes to the system are distinctly made on the hard drive. The system exploits the illusion of merging the two file systems to maintain user-friendliness. At the same time, it achieves the desired functions by keeping track of the hard drive files, which represent all of the modifications to the web pages. It is easy to transfer the list of changes to anyone else who is using the CD as a base for an information storage system. This easy way of updating information, especially when compared to the relatively laborious table system, make the union mount solution an attractive one. Flexibility, ease of implementation, and solid performance are the idea's main strengths.

In particular, the table-based system is tough to implement because it must maintain records of all the individual file modifications. In addition, its complexity grows quickly as one attempts to include higher-level functionality. So, even though one could theoretically achieve a run-time efficiency improvement, mainly in bandwidth savings, the cost of implementation and introduction of customized software packages make the system less desirable.

References

1. Butler W. Lampson. Hints for computer system design. Proceedings of the Ninth ACM Symposium on Operating System Principles, Bretton Woods, New Hampshire (October 10-13, 1983), pages 33-48. Published as Operating System Review 17, 5 (1983).

2. On-line Manual Entry for tar command.

3. NetBSD Server (Union File System information), http://www.netbsd.org/

4. Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall, 1992. Ch. 7. pp. 266-290.

Background (not explicitly cited)

Jerome H. Saltzer. Name binding in computer systems. Section 5 of The Engineering of Computer Systems, M.I.T. Department of Electrical Engineering and Computer Science (in preparation, December 6, 1983). (Naming conventions)

Andrew D. Birrell. An introduction to programming with threads. SRC Technical Report 35. Digital Equipment Corporation. Systems Research Center, Palo Alto, California, January 6, 1989. (Data Invariant)