Maintaining the Integrity of a Digital Library

Sarah Ahmed - Matthew Berman - Kyle Jamieson

Professor Hari Balakrishnan
6.033 Computer System Engineering
#012358

June 28, 1999

While increases in disk storage capacity allow entire libraries of information to be kept online, maintaining the integrity of large amounts of data is a challenging task. We describe the design of a distributed, web-accessible system responsible for maintaining the integrity of a digitized library archive. To achieve this goal, we replicate servers and establish at least three copies of submitted data before acknowledgment of receipt. Similar processes and internal data structures manage the process of adding a new replica to the system. Throughout the system, we exploit cryptographic hash codes to streamline the implementation of object identifiers and verify the correct transmission and receipt of data over the network. Finally, we show how the above mechanisms are sufficient for the detection and repair of disk and replica failures when a maintenance process operates continuously in the background at each replica.

Introduction

With the exponential growth of the Internet and the increasing availability of cheap, high-capacity hard disks, on-line digital libraries are becoming a reality. Due to the huge quantity of information that can be stored in such libraries and the dynamic nature of the Internet, maintaining the integrity of the data becomes a major concern. The Library of Congress is planning to take advantage of these new technological possibilities and digitize its entire collection of 105,000,000 objects. Maintaining the integrity of this many objects is no easy task, and the Library desires a repository, procedures, and protocols that will address this problem. To ensure continuous availability of the digital library, the Library would like to establish replica sites in Australia, Russia, South Africa, and Ecuador, in addition to the Washington, D.C. site.

In this paper we design a repository for reliably storing a library's data. We provide the repository interfaces put and get for adding and retrieving items, a set of procedures for creating a new replica, and other protocols and procedures for updating and preserving data integrity. Our primary design goal is integrity: we provide distinct procedures for maintaining intra-replica integrity, inter-replica consistency, and ensuring reliability of disk reads and writes.

Section 2 of this paper outlines the desirable properties of our repository. Section 3 describes our design in detail. Section 4 presents alternative design choices that we considered and rejected together with the tradeoffs of the design we chose. A summary and our conclusions are given in Section 5.

Design Criteria

This section lists the major design criteria for the archival storage system of a digital library. The requirements are dominated by integrity concerns, but other considerations include performance and usability constraints.

User Functionality

The user should not be constrained to submit (put) or query (get) library data from any particular replica; instead the client software should be able to query all replicas. This lets clients optimize their own accesses, provides for the distribution of network load, and provides for load-balancing on each replica. Additionally, the system should not rely on users to avoid duplicate data; users should be able to enter the same information more than once without harming integrity.

Data Integrity

Guaranteeing that the same data submitted to the repository are available for later retrieval is the most important requirement of the system.

Isolation

We require that different put operations on the repository do not interfere with each other, so that we obtain serializability.

Inter-Replica Consistency

Each replica in the repository must eventually store every piece of data successfully submitted to the system. To meet this requirement, replicas will be obligated to engage in protocols that ensure propagation of data throughout the entire system within a reasonable amount of time. Furthermore, due to the unreliability of the network, when a replica receives a new piece of data, it must also verify its validity.

Duplicate Data Elimination

If a librarian presents the same data to the repository more than once, the repository should not dedicate any additional system resources to that piece of data. This requirement is consistent with the semantics of the put operation [#!handout27!#]; an item successfully put to the library more than once need not be treated any differently than an item successfully put to the library only once.

``Bogus'' Data Elimination

The next requirement is a condition on the validity of the data seen by librarians when they perform get operations to retrieve information. The repository should not return any data as the result of a get operation that was not supplied to the system as the argument of a put operation. We term data that violates this condition ``bogus'' data.

Fault Tolerance

The underlying hardware and network that the repository system uses is prone to failure, and a failure in the hardware should not jeopardize data integrity. Failures can be divided into the following categories:

1.: Disk-Related Errors: Files stored in replica hard disks may be incompletely written or incorrectly read; the repository system must recover from these errors. Additionally, the directory structure may be corrupted: entries may silently disappear or be damaged.
2.: Network-Related Errors: Packets sent over the network may be corrupted or dropped. We assume the use of a TCP channel between replicas, which has a 32-bit checksum. However, with a number of operations at least on the order of magnitude of $\left(\vphantom{1.0 \times 10^8\; \mbox{put operations}}\right.$ 1.0 x 10⁸ put operations $\left.\vphantom{1.0 \times 10^8\; \mbox{put operations}}\right)$ $\left(\vphantom{5\; \mbox{replicas}}\right.$ 5 replicas $\left.\vphantom{5\; \mbox{replicas}}\right)$ = 5 x 10⁸ transactions over the lifetime of the system, the probability of an error is non-negligible, so replicas must make the probability of network data corruption very low.
3.: Server Crashes: Natural disasters or unrecoverable hardware errors may cause an entire server to fail. When a server fails, all data on the server are lost. We require that server crashes not cause an inconsistency in the data stored at other servers.

Replica Creation

New replicas may be added to the system at any time; we require that the addition of a replica occur concurrently with ongoing additions and queries to the repository. Additionally, the new replica must reach a state consistent with the other replicas within a reasonable amount of time. The process that copies files to the new replica may be interrupted several times before completion, so it must be recoverable.

Naming

Object Identifiers (OIDs) name items in the system; we require that they be unique to each item and immutable. Furthermore, we require that different replicas in the system come to a consensus on the object ID assigned to a new object. No other requirements are placed on OIDs by users of the system.

Performance

We can estimate an upper bound on the required average time per request with the assumption that librarians put data to the repository only during the work week, during an 8-hour window of the day. These figures yield

$\displaystyle \left(\vphantom{ \frac{1,500,000\; \mbox{requests}}{1\; \mbox{year}} }\right.$ $\displaystyle {\frac{1,500,000\; \mbox{requests}}{1\; \mbox{year}}}$ $\displaystyle \left.\vphantom{ \frac{1,500,000\; \mbox{requests}}{1\; \mbox{year}} }\right)$ $\displaystyle \left(\vphantom{ \frac{1\; \mbox{year}}{365\, \left( \frac{5}{7} \right) \mbox{days}} }\right.$ $\displaystyle {\frac{1\; \mbox{year}}{365\, \left( \frac{5}{7} \right) \mbox{days}}}$ $\displaystyle \left.\vphantom{ \frac{1\; \mbox{year}}{365\, \left( \frac{5}{7} \right) \mbox{days}} }\right)$ $\displaystyle \left(\vphantom{ \frac{1\; \mbox{day}}{8\; \mbox{hours}} }\right.$ $\displaystyle {\frac{1\; \mbox{day}}{8\; \mbox{hours}}}$ $\displaystyle \left.\vphantom{ \frac{1\; \mbox{day}}{8\; \mbox{hours}} }\right)$
$\displaystyle \left(\vphantom{\frac{1\; \mbox{hour}}{60\; \mbox{minutes}} }\right.$ $\displaystyle {\frac{1\; \mbox{hour}}{60\; \mbox{minutes}}}$ $\displaystyle \left.\vphantom{\frac{1\; \mbox{hour}}{60\; \mbox{minutes}} }\right)$ $\displaystyle \left(\vphantom{\frac{1\; \mbox{minute}}{60\; \mbox{seconds}} }\right.$ $\displaystyle {\frac{1\; \mbox{minute}}{60\; \mbox{seconds}}}$ $\displaystyle \left.\vphantom{\frac{1\; \mbox{minute}}{60\; \mbox{seconds}} }\right)$ = 5.0 seconds/request,

an average rate that a successful system must meet or exceed. The system does not require performance above this bound, however.

Recommended Solution

We divide the system into two levels: the repository level and the replica level.

**Figure:** Structure of the proposed solution. One of the replicas is designated primary for each client operation.
$\begin{figure}\begin{center} \mbox{\PSbox{repository.eps}{161pt}{183pt}} \end{center}\end{figure}$

Repository-Level Functionality

In the repository level, one replica is designated as the primary replica for each put request. The client repository software chooses a primary replica out of all replicas available, and sends the data to that replica. The primary is responsible for immediately storing the data locally and sending it to all other replicas. When the primary is satisfied that at least two other replicas have stored the data, it returns the object ID to the client.

To make a get request, clients are responsible for contacting as many replicas as necessary to fulfill the request. There is no constraint on which replicas the client contacts for a put or get request.

Naming

In our design, the name of each file is the SHA1 hash [#!schneier!#, pp. 442-445] which is also the OID. This ensures with high probability that each file has a unique OID and allows for an OID to be trivially translated into a file name. Since the hash calculation needs to be performed to check data integrity, usage of the SHA1 hash as the OID and filename simplifies the system considerably. Also, this design trivializes the process of duplicate data detection, as explained in Section 3.1.3.

Although the OID of a given file is the same in each replica, the disk in which the file is stored at each replica may vary. Therefore, in order to map OIDs to the directory in which the file resides, we maintain a balanced tree data structure at each replica site. Whenever a new OID is stored at a replica, a new node is entered into the replica's tree. In order to guarantee reliability under the threat of disk failure, the tree is replicated across three disks at each replica site.

The amount of storage needed for the trees is on the order of several Gigabytes. An estimate of the necessary storage, assuming 100 bits of bookkeeping information per node, 160 bits per node for the SHA1 hash and 8 bits per node for the disk number, is 1.05 x 10⁸ nodes x $\left(\vphantom{ 160\; \mbox{bits} + 8\; \mbox{bits} + 100\; \mbox{bits} }\right.$ 160 bits + 8 bits + 100 bits $\left.\vphantom{ 160\; \mbox{bits} + 8\; \mbox{bits} + 100\; \mbox{bits} }\right)$ = 3.5 GB.

When get or the integrity checker operations (see Sections 3.1.4, 3.2.1) need to query an OID in a given replica, they query all three OID trees in that replica, and use majority polling to resolve differences.

Replica Selection

The client repository software decides the order in which it tries to use each replica as primary. Because librarians will submit put requests primarily from the Washington, D.C. area, and because the Washington, D.C. site has the highest-bandwidth Internet connection, nearly all put requests will use that site as their primary. Get requests will be more equally-distributed among the replicas because people from all over the world will issue such requests.

Put Protocol

The primary purpose of the put protocol is to guarantee that if the client receives an acknowledgment message, the system will store the data with very high probability.

**Figure:** Psuedocode for client-side *put* functionality.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

The operation of the put protocol is shown in Figure 5. First, the client chooses a replica to send the initial put request to; that replica is designated the primary. On receipt of the put request, the primary first generates the appropriate object ID by computing the SHA1 hash of the data, and atomically raises a busy flag (stored on disk) that indicates a put operation is in progress.

**Figure:** Psuedocode for primary replica *put* functionality.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

Next, the primary invokes a local store method (shown in Figure 3), which searches the object ID tree, checking that the freshly-submitted object ID is not a duplicate. The store method finds duplicates by checking that the OID of the new file is already in the OID tree and that the hash of the file pointed at by the tree node matches the hash of the new file. Here the usage of a search tree simplifies the detection of duplicate files, since the tree can be efficiently searched for a given OID.

**Figure:** Psuedocode for secondary replica *put* functionality.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

Next, store receives the data and computes its SHA1 hash in order to verify it with the hash provided in the call. Then it finds a suitable disk, given the size of the data that it is to write. It then calls create and careful-write-file to write the information to disk. The primary then adds all other replicas to a pending list for the present object ID. Then it waits for a background process to schedule those replicas on the pending list for updates via the store operation. When the replica receives two responses from other clients, it atomically lowers the busy flag and returns an acknowledgment to the client.

**Figure:** The sequence of messages sent in the course of a *put* operation. The user receives an acknowledgment only after each replica has acknowledged a *store* message.
$\begin{figure}\begin{center} \mbox{\PSbox{put.eps}{221pt}{207pt}} \end{center}\end{figure}$

Discussion

Failures may occur anywhere in the sequence of operations shown in Figure 5. We designed the put protocol to ensure that if a client receives an acknowledgment of a put operation, then at least three replicas have the information contained within that put request. Put uses a flag that is atomically raised when the operation begins and atomically lowered right before the client receives the acknowledgment and OID. If the system crashes while the flag is raised, then on restart, the system will know that the client never received an OID. If the flag is lowered when the crash occurs, then on restart the system will know the client did receive an OID.

The put procedure avoids phantom files (files which are not present in the OID tree) but does not prevent missing files (when the OID tree has a record of a file which does not actually exist) from occurring. Phantom files are avoided because the node for an OID is inserted into the OID tree before the file is actually written to disk. Also, it is difficult for disk failures to cause phantom files, since the OID tree is replicated. Thus, if a file is on disk, it will be in the OID tree. By this same reasoning, missing files may happen if the system crashes after the node is inserted in the OID tree but before the file is written. This will not cause any problems because the integrity-checker will detect the missing file and request it from other replicas. All the other replicas will eventually notify that they don't have the file; therefore the integrity-checker at the primary site will get rid of the OID.

We list the possible points of system failure during the put operation as follows. In each case, the system will remain in a consistent state.

1.: If the primary crashes before the flag is raised then there clearly are no consistency problems because nothing has been written to disk and the client will never receive any OID.
2.: If the primary crashes after the flag is raised but before store is called, then when the system restarts it will find the flag raised and know that the client never received an OID. It will then check the OID table and not see any nodes with the specific OID, and it will tell all other replicas to delete that OID (even though none of them will actually have that OID).
3.: If the primary crashes after the store function has added a node to the OID tree but before create-file has been called, then when the system restarts it will find the flag raised and know that the client never received an OID. It will check the OID table, find the node with the OID it is looking for, delete this node, and tell all other replicas to delete that OID (even though none of them will actually have that OID).
4.: If the primary crashes after the store function has called create-file but careful-write-file has not completed, then when the system restarts it will find the flag raised and know that the client never received an OID. It will check the OID table, find the node with the OID it is looking for, delete the node, delete the partially (or not-at-all) written file, and tell all other replicas to delete that OID (even though none of them will actually have that OID).
5.: If the primary crashes after the careful-write-file called by store has completed but before add-to-pending has completed, then when the system restarts it will find the flag set on and know that the client never received an OID. It will check the OID table, find the node with the OID it is looking for, delete the node, delete the file, and tell all other replicas to delete that OID (even though none of them will actually have that OID).
6.: If the primary crashes after add-to-pending has completed but before the flag has been lowered (which is when a second acknowledgment is received), then when the system restarts it will find the flag raised and know that the client never received an OID. It will check the OID table, find the node with the OID it is looking for, delete the node, delete the file, and tell all other replicas to delete that OID (and some of them may have the file and will delete it).
7.: If the primary crashes after the flag has been lowered but before an OID has actually been sent to the user, then when the system crashes and restarts it will think that it has sent the client an OID even though it has not. So the primary will find the pending list and attempt to send the file to all replicas still in the list, as usual. Hence the end result will be that the client thinks that the file has not been stored, but it actually has been stored. If the client again attempts to put the file into the repository, the replica will detect a duplicate and send an acknowledgment. Although disk space may be wasted if the client never again tries to put, nothing worse will happen.

Inter-Replica Consistency

Inter-replica consistency is implicitly maintained by the repository's distribution protocols. Whenever a new file is put into the repository, the primary replica is responsible for ensuring that each replica in the system eventually receives and stores a copy of the file. For each new file, the primary creates a new pending list containing every other replica. The significance of the pending list is that replicas in the set still need to receive a copy of the new file. When the primary receives an acknowledgment that a replica has stored a file, the primary removes that replica from the appropriate pending list. An empty pending list indicates that all replicas have a copy of the file. The primary tries to empty each pending list by periodically attempting to send the file to replicas still in the list. The frequency at which each list is flushed decreases over time using an exponential backoff similar to that used in [#!ethernet!#, p. 400]. Thus, all replicas converge towards a consistent state in which each file is at every replica.

Performance

We perform an average-case analysis for a put operation under the assumptions that the primary is single-threaded and that the first two replicas that the primary contacts respond immediately and at the same speed. Tracing each step of the protocol¹ we see that the system meets performance specifications.

Get Protocol

The purpose of the get protocol is to obtain the file with a specified OID. When the client issues a get command, the software tries to contact each replica until it finds one which contains the desired data. Each replica determines whether it has the data by searching its object ID trees for the submitted OID as described above.

**Figure:** Psuedocode for client *get* functionality.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

**Figure:** Psuedocode for replica *get* functionality.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

Robustness

Whenever any data is transferred over the Internet, the receiving replica recomputes the hash of the data and compares it with a hash that was sent along with the data. If they do not match, the receiving replica requests that the data be transferred again. When clients issue get requests, they attempt to communicate with as many replicas as necessary to find one which is available. Thus, only if none of the replicas can be contacted will the get be aborted because of network unavailability.

Integrity

There are in fact no consistency problems associated with the get protocol. A particular replica in the repository may write a file to disk at the same time that it services a get request to read the file. If the file is only partially written (and the OID of the file is in the OID tree), then when the file is read, the computed and stored hashes will not match, and the replica will return an error. In that case, the repository software at the client's machine will try another replica, and that replica will have the file. This is because if a get request arrives at the same time that a file is being copied to disk, then the replica is not the primary for that file (otherwise the client requesting the file would have no way of knowing the OID), so some other replica is the primary and will have the file.

Performance

The client get procedure iterates over replicas until it receives a positive response from one of them. While the worst-case time for this operation is proportional to the propagation delay and number of replicas, the common case is fast. We maintain a consistent state across replicas so that the first request usually returns a result to the client.

Replica Creation

A new replica is created in two distinct steps. First, a new set of hard disks is brought to a pre-existing replica, and its files are copied onto the new set of hard disks. The files are copied by an inorder traversal of the object ID tree at the pre-existing site: files are copied onto the new disk as they are read by the traversal. As each file is copied, we update a logical pointer on disk to refer to the most recently copied file. To make the update of this logical pointer atomic, it is implemented as two physical pointers with an auxiliary bit indicating which physical pointer is in use. Thus the copy operation will ensure that all files that were in the source replica at that time will be stored in the new replica.

We estimate a lower bound on the time spent in the first step by disregarding hard disk seek and rotational latency times:

$\displaystyle {\frac{1.05 \times 10^{14}\; \mbox{bytes}}{160 \times 10^6\; \mbox{bytes/sec}}}$ = 7.6 days.

Since the actual time spent in the first step (including seek and rotational latency) cannot be more than a factor of ten greater than this figure [#!handout33!#], we estimate an upper bound of 71 days to create a new replica.

The second step ensures that all files stored in only a subset of the replicas at the time the new replica is created will be stored in the new replica. Before the start of the first step, each replica whose pending list for a given OID is not empty adds the new replica site to that pending list; this implies that the new site will receive every file which has not yet been distributed to all replicas. Thus the new replica will eventually receive every file which was put in the repository before and during the replica creation operation.

**Figure:** Psuedocode replica creation functionality.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

Replica-Level Functionality

At the replica level, each replica maintains a balanced tree data structure in which each tree node is keyed by object ID and also contains the directory name in which the corresponding data file resides. This tree is replicated on three hard disks in order to guard against failure of any one disk. Consistency is achieved through a majority polling of the three data structures. We maintain data integrity with a background process that runs at each replica and traverses the object ID tree. For each node, the background process computes the cryptographic hash of the corresponding file and compares this hash against the hash stored in the file metadata. If they are unequal, the replica requests a new copy of the file from another replica.

New replicas are created by locally copying all disks of a pre-existing replica onto a new set of disks and physically transporting this set of disks to the new replica site. Protocols shared by the put function and the replica creation function ensure that a new replica will eventually become consistent with all other replicas.

Integrity Checking

Intra-replica integrity is enforced by a background process that traverses the OID trees, checking the integrity of the file associated with each node and cross-checking the validity of the OIDs stored in each tree.

The process first chooses one OID tree, called the master to walk over. We choose masters cyclically to avoid problems with missing files (see Figure 9, Section 3.1.3). Then, for each node in the master OID tree with OID x, we verify the directory entries for OID x in the other two trees. If there is a discrepancy, majority polling updates the fields of one of the three trees. An intolerable error occurs if all three trees differ in their opinion of x's directory.

Next, for each OID-directory number pair (x, d ) successfully verified, the process looks for the file named with x at d, computes the SHA1 hash of the file data, and compares this computed hash with the hash stored in the file metadata. If the two match, then the file is presumed to be correct and the process continues to the next node in the inorder master tree traversal. If the hash values are unequal, then the file data is corrupt with high probability, and the replica must request a correct copy of the file from another replica site.

The integrity checking process maintains a current pointer on disk which points to the most recently examined node in the master tree. To make the updating of this pointer atomic, it is implemented as two physical pointers with a bit switch, as in [#!chapter8!#, p. 82]. After a node is checked, the process increments the current pointer and atomically switches the auxiliary bit. This feature prevents any interruptions in the integrity checking process from influencing how often certain OID notes are checked: all OIDs are checked equally often. This increases the reliability of the system.

**Figure:** Background integrity checking functionality.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

The time to check the integrity of a single object can be estimated as 25.25 ms + 10 ms + $\left(\vphantom{6\; \mbox{ms} + 3\; \mbox{ms} }\right.$ 6 ms + 3 ms $\left.\vphantom{6\; \mbox{ms} + 3\; \mbox{ms} }\right)$ = 44.25 ms where 25.25 ms, 10 ms, and 9ms are the estimated times for careful-read-file (see Appendix A), computing a hash, and setting an atomic pointer on disk.

The total time to cycle through the entire object ID tree and check the integrity of each file can be estimated as

$\displaystyle \left(\vphantom{ 105,000,000\; \mbox{iterations} }\right.$ 105, 000, 000 iterations $\displaystyle \left.\vphantom{ 105,000,000\; \mbox{iterations} }\right)$ $\displaystyle \left(\vphantom{ \frac{44.25\; \mbox{ms}}{1\; \mbox{iteration}} }\right.$ $\displaystyle {\frac{44.25\; \mbox{ms}}{1\; \mbox{iteration}}}$ $\displaystyle \left.\vphantom{ \frac{44.25\; \mbox{ms}}{1\; \mbox{iteration}} }\right)$ $\displaystyle \left(\vphantom{ \frac{1\; \mbox{s}}{1000\; \mbox{ms}} }\right.$ $\displaystyle {\frac{1\; \mbox{s}}{1000\; \mbox{ms}}}$ $\displaystyle \left.\vphantom{ \frac{1\; \mbox{s}}{1000\; \mbox{ms}} }\right)$
$\displaystyle \left(\vphantom{ \frac{1\; \mbox{hour}}{3600\; \mbox{seconds}} }\right.$ $\displaystyle {\frac{1\; \mbox{hour}}{3600\; \mbox{seconds}}}$ $\displaystyle \left.\vphantom{ \frac{1\; \mbox{hour}}{3600\; \mbox{seconds}} }\right)$ $\displaystyle \left(\vphantom{ \frac{1\; \mbox{day}}{24\; \mbox{hours}} }\right.$ $\displaystyle {\frac{1\; \mbox{day}}{24\; \mbox{hours}}}$ $\displaystyle \left.\vphantom{ \frac{1\; \mbox{day}}{24\; \mbox{hours}} }\right)$ $\displaystyle \approx$ 54 days

assuming there are no corrupt files and the tree contains all 105 million objects. The length of the integrity checking process thus justifies the inclusion of the current pointer.

Careful Reading and Writing

Our design functions in the presence of disk failures, including single sector and whole disk failures. All data is written to disk using careful-write-file, shown in Figure 10. This procedure repeatedly writes to disk, reads the data it has just written, computes the SHA1 hash of the read data, and compares the hash with the stored hash in the file metadata. If the hashes don't match, the procedure writes and reads again; if the hashes do match, then the data has been successfully written to disk. If the procedure tries to write to a sector and encounters a transient error, it will repeat the write until the hashes match. Data is read from disk using careful-read-file, shown in Figure 11. This procedure reads from disk and then computes and compares hashes until the hashes match; thus transient failures while reading from disk are circumvented. Hard errors on disk sectors are flagged and avoided by the file system. If an entire disk fails, then the file system will mark every sector in that disk as having a hard disk error, and the inter-replica consistency mechanisms described previously will allow all data on the failed disk to be obtained eventually.

**Figure:** Psuedocode for careful writing to disk.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...&return\ failure;{\}}\cr } } \leavevmode \egroup \end{footnotesize}\end{figure}$

**Figure:** Psuedocode for careful reading from disk.
$\begin{figure}\begin{footnotesize} \bgroup \makeatletter \newskip\grspace \news... ...nindent ...$

Pending Lists

When a put request is directed to a primary replica, the primary creates a pending list for the file, places each replica in the list, and writes the list to disk. The pending list data structure is an unordered list of the replicas that have not yet received the file. A non-empty pending list implies that not all replicas have the associated file. When the last replica is removed from a list, the list is deleted; the absence of a pending list for a particular file means that all replicas have a copy of that file. These properties are useful in maintaining inter-replica consistency, as previously described. A daemon process at each replica continuously cycles through all the pending lists, and for each list tries to send the associated file to the replicas in that list.

Design Alternatives

This section briefly describes alternate implementations of the design and compares them with the recommended solution.

Segmented OID Namespace

One alternative considered to using SHA1 as the OID was the segmentation of the OID namespace into n ranges where n is the number of replicas in the system. We use an OID of $\lceil$ lg(1.05 x 10⁸) $\rceil$ = 27 bits. Then, if a client chooses replica i $\in$ [1, n] as the primary, i arbitrarily assigns an OID only in the range $\left[\vphantom{ 2^{27} \left( \frac{i - 1}{n} \right), 2^{27} \left( \frac{i}{n} \right) }\right.$ 2²⁷ $\left(\vphantom{ \frac{i - 1}{n} }\right.$ ${\frac{i - 1}{n}}$ $\left.\vphantom{ \frac{i - 1}{n} }\right)$ , 2²⁷ $\left(\vphantom{ \frac{i}{n} }\right.$ ${\frac{i}{n}}$ $\left.\vphantom{ \frac{i}{n} }\right)$ $\left.\vphantom{ 2^{27} \left( \frac{i - 1}{n} \right), 2^{27} \left( \frac{i}{n} \right) }\right]$ . This method is desirable because like the recommended design, it obviates the need for put and get protocols to keep OIDs from conflicting with each other. Additionally, a segmented OID namespace allows much shorter OIDs--27 bits instead of 160 bits--making OIDs human-readable. Finally, a segmented namespace causes conflicting OID assignment with probability 0, while the recommended design offers very low probability of OID conflicts.

We chose the recommended solution over the segmented namespace design because the segmented namespace places a limit on the number of OIDs that may be assigned by one replica. Since most requests are coming from one geographical area, we expect that one replica will be chosen as primary most of the time. This will exhaust the namespace of that replica quickly, leaving the system unable to accept new requests without a more complicated mechanism of addressing the problem. With regard to conflicts, the probability of any two OIDs in the system conflicting is sufficiently small (approximately 2^-104 [#!handout33!#]) that the hash method is preferable.

Library of Congress Number

Another alternative for OID implementation was use of the Library of Congress call number, Dewey Decimal System call number, or ISBN number. This approach requires the user to externally supply the object identifier as an argument to the put request. One problem with this method is that it requires an additional mapping between the filename and OID, since the two are not identical. In contrast, the recommended solution does not require that this mapping be kept in any data structure; it is instead implicit in the system. Also, since correctness of the stored information requires that the system trust the client, the reliability of the system is compromised. Therefore, we decided against this approach.

Data Structures

Initially, we chose to implement the OID to filename mapping as a table. The use of table is convenient for a system that uses sequential OID. Tables keyed by OID make it easy to locate the entry corresponding to an OID. Even though we are not using sequential OIDs, we could still maintain the sorted OIDs in a table. However, tables are inconvenient for inserting elements. It is more convenient to use a balanced tree structure for that purpose, since it makes it easy for searching and inserting elements that do not have sequential ordering.

Replica Creation

An alternative method that we considered for replica creation was to copy files directly over the Internet, or on CD-ROM. However, we ruled out these alternatives because of time constraints on the creation of a new replica. If we assume a U.S. domestic Internet connection, then since the network bandwidth will be the limiting factor in the transfer, we estimate that the copy replica operation will take

$\displaystyle \left(\vphantom{ \frac{1.05 \times 10^{14}\; \mbox{bytes}}{1.5 \times 10^6\; \mbox{bits/sec}} }\right.$ $\displaystyle {\frac{1.05 \times 10^{14}\; \mbox{bytes}}{1.5 \times 10^6\; \mbox{bits/sec}}}$ $\displaystyle \left.\vphantom{ \frac{1.05 \times 10^{14}\; \mbox{bytes}}{1.5 \times 10^6\; \mbox{bits/sec}} }\right)$ $\displaystyle \left(\vphantom{ \frac{8\; \mbox{bits}}{1\; \mbox{byte}} }\right.$ $\displaystyle {\frac{8\; \mbox{bits}}{1\; \mbox{byte}}}$ $\displaystyle \left.\vphantom{ \frac{8\; \mbox{bits}}{1\; \mbox{byte}} }\right)$ = 17.8 years

at worst, and

$\displaystyle \left(\vphantom{ \frac{1.05 \times 10^{14}\; \mbox{bytes}}{45 \times 10^6\; \mbox{bits/sec}} }\right.$ $\displaystyle {\frac{1.05 \times 10^{14}\; \mbox{bytes}}{45 \times 10^6\; \mbox{bits/sec}}}$ $\displaystyle \left.\vphantom{ \frac{1.05 \times 10^{14}\; \mbox{bytes}}{45 \times 10^6\; \mbox{bits/sec}} }\right)$ $\displaystyle \left(\vphantom{ \frac{8\; \mbox{bits}}{1\; \mbox{byte}} }\right.$ $\displaystyle {\frac{8\; \mbox{bits}}{1\; \mbox{byte}}}$ $\displaystyle \left.\vphantom{ \frac{8\; \mbox{bits}}{1\; \mbox{byte}} }\right)$ = 216 days

at best (using the estimates on Internet bandwidth given in [#!handout33!#]). If we copy files on CD-ROM, we must complete two separate copy operations, one at the source site and one at the target site. Each copy operation is again limited by the bandwidth of the CD-ROM, so we estimate an lower bound on the time for this method by

$\displaystyle \left(\vphantom{ \frac{1.05 \times 10^{14}\; \mbox{bytes}}{1.8 \times 10^6\; \mbox{bytes/sec}} }\right.$ $\displaystyle {\frac{1.05 \times 10^{14}\; \mbox{bytes}}{1.8 \times 10^6\; \mbox{bytes/sec}}}$ $\displaystyle \left.\vphantom{ \frac{1.05 \times 10^{14}\; \mbox{bytes}}{1.8 \times 10^6\; \mbox{bytes/sec}} }\right)$ $\displaystyle \left(\vphantom{ 2\; \mbox{operations} }\right.$ 2 operations $\displaystyle \left.\vphantom{ 2\; \mbox{operations} }\right)$ = 3.7 years.

Each of these estimated times is prohibitively long; we thus chose the recommended solution instead, yielding a copy time of 7.6 days.

RAID

One possible way to achieve additional redundancy is to use some level of RAID [#!raid!#] within each replica site. This might enable replicas to correct some errors without having to request a file from another replica. The drawback is that even more disks would be necessary, and operations such as write-file and read-file might take longer to complete because the redundant disks might have to be written or read. We decided that the redundancy provided by multiple replica sites was adequate for our needs and that the additional redundancy of RAID disks did not justify the added complexity and overhead.

Conclusion

In this paper, we have reviewed the desirable criteria of a digital repository. Integrity of stored data is by far the most important property of the repository: once put, a file should never be lost. Our recommended solution relies heavily on redundancy to maintain data integrity. Multiple replica sites, the use of cryptographic hashes, mirrored copies of the OID tree at each replica site, and the storage of the hash in both the OID tree and the file metadata are all examples of redundancy in guarding against failure. Finally, we compared our solution with several design alternatives for the assignment of OIDs and and for OID-to-filename data structures.

As the price per megabyte of persistent storage continues to drop, applications such as digital libraries will likely continue to increase in importance. Effective designs for storing large amounts of data will be necessary when a very high degree of fault-tolerance is required. There will always be tradeoffs between size and reliability, and a good design addresses these issues to ensure the integrity of data.

Performance Estimates

This appendix gives performance estimates for some of the low-level file operations. Time estimates made in the previous sections are based on these figures.

Time spent in read-file for the successful completion of a 1 MB read operation:

$\displaystyle {\frac{1\; \mbox{MB}}{150\; \mbox{MB/s}}}$ + 6 `x` 10^-3 s + $\displaystyle \left(\vphantom{ \frac{1}{2}\; \mbox{rotation} }\right.$ $\displaystyle {\textstyle\frac{1}{2}}$ rotation $\displaystyle \left.\vphantom{ \frac{1}{2}\; \mbox{rotation} }\right)$
$\displaystyle \left(\vphantom{ \frac{1\; \mbox{minute}}{10 \times 10^3\; \mbox{rotations}} }\right.$ $\displaystyle {\frac{1\; \mbox{minute}}{10 \times 10^3\; \mbox{rotations}}}$ $\displaystyle \left.\vphantom{ \frac{1\; \mbox{minute}}{10 \times 10^3\; \mbox{rotations}} }\right)$ $\displaystyle \left(\vphantom{ \frac{60\; \mbox{s}}{1\; \mbox{minute}} }\right.$ $\displaystyle {\frac{60\; \mbox{s}}{1\; \mbox{minute}}}$ $\displaystyle \left.\vphantom{ \frac{60\; \mbox{s}}{1\; \mbox{minute}} }\right)$ = 15.25 ms

Time spent in careful-read-file for the successful completion of a 1 MB read operation:

$\displaystyle {\frac{1\; \mbox{MB}}{100\; \mbox{MB/s}}}$ + t_{read - file} = 25.25 ms

About this document ...

Maintaining the Integrity of a Digital Library

This document was generated using the LaTeX2HTML translator Version 98.1 release (February 19th, 1998)

The command line arguments were:
latex2html -split 0 -no_navigation -no_math -html_version 3.2,math paper.

The translation was initiated by Kyle A. Jamieson on 1999-06-28

Footnotes

... protocol ¹: From Figure 5 we see that the time for a put operation is the sum of the following delays: transmission delay (client to primary), latency (client to primary), hash computation time (at primary), careful-write (at primary), add-pending (at primary), transmission delay (primary to replica), transmission delay (primary to replica), latency (primary to replica), hash computation (at replica), careful-write (at replica), transmission delay (replica to primary), latency (replica to primary), remove-pending (at primary), latency (primary to client). Note that only one primary to replica latency is included in this calculation since only the primary to replica #2 latency is on the critical path of the protocol, given the assumptions stated above. We use the following numerical estimates: average data size = 1 MB, network bandwidth = 10 MB/s, transmission time = 10 ms, hash computation time = 100 MB/s. Substituting these numbers in the above expression gives 2.92 seconds per transaction, well under the required upper bound described in Section 2.5.

Kyle A. Jamieson
1999-06-28