All deliverables should be submitted via the online submission site.
As with real life system designs, 6.033 design projects are under-specified, and it is your job to complete the specification in a sensible way given the overall requirements of the project. As with designs in practice, the specifications often need some adjustment as the design is fleshed out. We recommend that you start early so that you can evolve your design over time. A good design is likely to take more than just a few days to put together.
Ben Bitdiddle would like his system to keep track of where the contents of a particular file came from, and which other files may have in turn been influenced by this file. For example, when Ben is editing a PowerPoint presentation, he sometimes copies individual slides from other presentations, and then modifies them slightly. Later, he may want to find where did a particular slide get copied from, or, when Ben discovers a mistake in some slide, he may want to know what other presentations the mistake may have been copied to (even if that slide may have changed in those other presentations). As another example, Ben may want to know where a particular executable file in his system came from: that is, where did the file get copied or downloaded from, or what source files did it get compiled from?
In this design project, your goal will be to help Ben design a file system to keep track of provenance for files and to allow users to query that provenance.
Your system should support files that are logically composed of many components, and where each component might have a different provenance. For example, a PowerPoint file contains many slides, each of which may have been copied from a different source. Another example is a zip or tar file, which in turn contains many files that have their own provenance.
Your system should support applications that have not been modified to work with your design (e.g., a simple cp program that copies one file to another, and unmodified mv and rm programs). A simple way to do this would be to track all files read by a process, and record this set of files as the provenance for any files written by that same process.
Your system should also allow applications to supply additional information about the provenance of files. For example, a single process running PowerPoint may access many different presentation files, and modify many presentation files as well. With some changes to the application by PowerPoint developers, it should be possible to precisely keep track of which individual slides of the modified presentations depend on which individual slides of the presentations accessed by that process.
Your design of a provenance file system should clearly describe the on-disk and in-memory data structures for tracking provenance, and how they are used to support the required operations. To help you start thinking about what such a system might look like, Ben has sketched out the following notes about this design problem:
Each file in the file system has associated with it a set of parts. Each part corresponds to an application-defined unit of the file, such as a slide in a PointPoint file, or an individual compressed file in a zip file. A null part indicates the entire file, for applications that do not otherwise define any parts. If your design adopts the idea of file parts, you would need to decide how to name these parts; one choice may be to use free-form strings specified by the application as part names.
The system keeps track of the provenance for each part of the file; that is, it maintains a data structure that maps a name to a provenance data. Your design probably will have an on-disk and an in-memory data structure, and they may be different. The provenance data includes the set of other files (and optional part names in each of those files) that this file part depends on. One design may involve storing provenance information only about the immediate ancestors: that is, if a user copied a PowerPoint slide from file A to B, and then from B to C, C's provenance data points back to B, and it's up to the application to then track down B's provenance. An alternative design is to store the entire tree of provenance information with each file.
Storing the entire tree of provenance makes the design simpler, but may incur significant storage costs to create deep copies of provenance data. Moreover, a user may want to know whether two files have any common ancestors, in terms of provenance. Answering this question with a design that stores the entire tree requires a unique way of naming files in the stored tree.
Storing pointers to only the immediate ancestors offers potential space savings, but requires that the application be able to access the ancestor files, to keep tracking down further provenance information. However, your design should address the question of what happens when files may be deleted or renamed, or their names reused for a new file. You may find it interesting to look at how the Unix network file system (NFS) deals with inode reuse, as described in the textbook section 4.5.1.
Furthermore, even if the name is unique, it may be important that the name refers to a specific version of a file in time. Consider a user that copies a PowerPoint slide from file B to C, and then replaces that slide in B with a different slide from file A. If the provenance for this slide in file C points to the corresponding slide in B, and the provenance for B's slide now points to A, it may now appear that C's slide depends on A's slide, even though this could not have happened.
For unmodified applications, the system call API should be the same as a standard Unix system. For applications that have to be aware of the provenance (e.g., to specify precise dependencies for individual slides in PowerPoint), the OS could provide a function read_prov(fd, part) that would return the provenance for the specified part of the file, and a function write_prov(fd, part, provinfo) that would set the provenance for the specified part of the file. To access the provenance information, the user would use the read_prov(fd, part) system call to find files on which fd/part depends, and a search_prov(fd, part) system call to find files which depend on fd/part.
In your design, you may find it useful to assume a versioning file system, which stores all versions of each file over time. That is, you don't have to describe the versioning file system, but many of the important design issues still arise, even if you use a versioning file system. For example, you will need to explain when it is appropriate to garbage-collect old versions in such a design.
You may also find it useful to make use of file system extended attributes; the Wikipedia article offers a good summary.
Your system must support the following use cases:
PowerPoint slide copying. Suppose a user Alice opens several PowerPoint presentations, copies several slides from each of them into a new PowerPoint presentation, and saves the new presentation file. Later on, she should be able to track down which presentation each slide in the new file came from, and to be able to open up that file. If Alice discovers a mistake in a PowerPoint slide, she should also be able to track down all copies of that slide in other files, without having to wade through unrelated files (e.g., files that contain a copy of a different slide from that presentation).
Compiling software. Suppose Alice compiles a large piece of software using make, and then installs the resulting binary onto her system. Alice should be able to later track down the sources she used to compile the binary. For extra credit, make it possible for Alice to track down the URL from which she downloaded the software in the first place.
Copying files. Alice should be able to copy files (such as PowerPoint files or compiled executables) with a program that has not been modified to know about provenance, and still be able to correctly track the provenance of the files afterwards. At least one design question you should address here is whether such a copy utility should propagate the provenance information from source to destination by value, or whether it should create a reference to the source file in the destination file's provenance information.
Handling tar/zip files. Alice should be able to pack several files into a single tar or zip file, extract such tar or zip files, and the provenance information should make sense. For example, suppose Alice creates a zip file containing two PowerPoint files, where the first file contains slides copied from the second file, and sends the zip file to Bob. When Bob extracts this zip file, he should be able to tell that one of the second file's slides came from the first file. It's OK to require some modifications to the zip and tar programs to support this. Separately, what should be the provenance on the zip file itself, after it is created? Similarly to the file copying application, should tar/zip propagate provenance by value or by reference?
You can limit your design to a single machine. That is, files imported from outside the machine start out with no provenance information.
Your design must be able to operate continuously. Because the system would continue to generate provenance information when files are modified, your design may need to periodically garbage-collect provenance information. In particular, your design should be able to handle a workload where the user continuously copies a PowerPoint file a.ppt to b.ppt, then copies b.ppt to c.ppt and deletes b.ppt, then copies c.ppt to d.ppt and deletes c.ppt, and so on.
Your design should store the provenance information persistently. That is, if the computer is turned off, and then turned on again, the provenance information must be available. You don't have to worry the computer crashing during normal operation.
Your design does not need to address buggy applications, malicious applications, or any security considerations. It's OK if your design allows a specially-designed application to erase the provenance of some files.
Your design should perform reasonably well, and should not require excessive disk or memory storage overheads, though it is up to you to decide and argue for what constitutes excessive overheads; the target of this design is a typical laptop or desktop computer system. You can assume that your disk requires a constant (but non-trivial) time to read or write any block, similar to an ideal solid-state disk.
Two particularly challenging performance workloads are search with a large file system, and continuous file copying. Your design should be able to sustain file copying at a rate of 10 file copies per second (including the garbage collection that must take place in long-term use). For a search workload, if the file system contains 1,000,000 files, it should take at most a few seconds to find which files were derived from some file A; you can assume that there are only a few such derived files.
Challenge: extend your design to track provenance information for files across machines. For example, given a file in the file system, a user should be able to track which email messages that file came from as an attachment. Similarly, to track down who received a copy of slides that the user sent as part of an attachment. Addressing this challenge is not necessary to get an A, but may be fun.
The design proposal should be a concise summary (800 words) of your overall system design.
The format of your proposal should be roughly as follows: it should begin with an overview, describing the overall approach. It should then summarize your proposed design and explain how it solves the problem and meets the requirements in this document.
The core of the proposal should be the design description. This should describe your design's interface (based on Ben's sketch), how you propose to implement that interface, and how you think that interface will be used by applications. You may make use of bullets and tables, but must introduce them with at least 1-2 sentences of context. This section must include at least one graphic, correctly formatted with a caption and brief description. For example, it may be a good idea to illustrate the design of your data structures used to track provenance, or the data structure used to garbage-collect provenance information.
The proposal should end with a summary of how the design meets the requirements (1-2 sentences). It should also include problems which remain to be resolved.
You do not have to present a detailed rationale or analysis in your proposal. However, if any of your design decisions are unusual (particularly creative, experimental, or risky) or if you deviate from the requirements, you should explain and justify those decisions in your proposal.
You will receive feedback on your proposal from your TA in time to adjust your final report. You will also receive a writing program grade for your proposal, as well as feedback from your writing instructor that will help you improve the writing of the final report.
Your report should explain your design. It should discuss the major design decisions and tradeoffs you made, and justify your choices. It should discuss any limitations of which you are aware. You should assume that your report is being read by someone who has read this assignment, but has not thought carefully about this particular design problem. Give enough detail that your project can be turned over successfully to an implementation team. Your report should convince the reader that your design satisfies the requirements in Section III.
Use this organization for your report:
Here are a few tips:
The format and appearance of your report should conform to the DP1 style guide.
The most important aspect of your design is that we can understand how it works and that you have clearly addressed the requirements and provided the elements listed in Sections III and IV. Complicated designs that we cannot understand will not be graded favorably. Some overall content considerations: