6.033 2012 Design Project 1: A provenance-tracking file system

I. Due Dates and Deliverables

There are three deliverables for Design Project 1:
  1. A design proposal not exceeding 800 words, due on March 1, 2012 (before recitation).
  2. A design report not exceeding 2,500 words, due on March 22, 2012 at 5pm.
  3. An (optional) revision of your design report to improve your writing grade; you must set up a meeting with your writing instructor within one week of receiving your graded DP1 to take advantage of this.

All deliverables should be submitted via the online submission site.

As with real life system designs, 6.033 design projects are under-specified, and it is your job to complete the specification in a sensible way given the overall requirements of the project. As with designs in practice, the specifications often need some adjustment as the design is fleshed out. We recommend that you start early so that you can evolve your design over time. A good design is likely to take more than just a few days to put together.

II. The Problem

Ben Bitdiddle would like his system to keep track of where the contents of a particular file came from, and which other files may have in turn been influenced by this file. For example, when Ben is editing a PowerPoint presentation, he sometimes copies individual slides from other presentations, and then modifies them slightly. Later, he may want to find where did a particular slide get copied from, or, when Ben discovers a mistake in some slide, he may want to know what other presentations the mistake may have been copied to (even if that slide may have changed in those other presentations). As another example, Ben may want to know where a particular executable file in his system came from: that is, where did the file get copied or downloaded from, or what source files did it get compiled from?

In this design project, your goal will be to help Ben design a file system to keep track of provenance for files and to allow users to query that provenance.

Your system should support files that are logically composed of many components, and where each component might have a different provenance. For example, a PowerPoint file contains many slides, each of which may have been copied from a different source. Another example is a zip or tar file, which in turn contains many files that have their own provenance.

Your system should support applications that have not been modified to work with your design (e.g., a simple cp program that copies one file to another, and unmodified mv and rm programs). A simple way to do this would be to track all files read by a process, and record this set of files as the provenance for any files written by that same process.

Your system should also allow applications to supply additional information about the provenance of files. For example, a single process running PowerPoint may access many different presentation files, and modify many presentation files as well. With some changes to the application by PowerPoint developers, it should be possible to precisely keep track of which individual slides of the modified presentations depend on which individual slides of the presentations accessed by that process.

Your design of a provenance file system should clearly describe the on-disk and in-memory data structures for tracking provenance, and how they are used to support the required operations. To help you start thinking about what such a system might look like, Ben has sketched out the following notes about this design problem:

Each file in the file system has associated with it a set of parts. Each part corresponds to an application-defined unit of the file, such as a slide in a PointPoint file, or an individual compressed file in a zip file. A null part indicates the entire file, for applications that do not otherwise define any parts. If your design adopts the idea of file parts, you would need to decide how to name these parts; one choice may be to use free-form strings specified by the application as part names.

The system keeps track of the provenance for each part of the file; that is, it maintains a data structure that maps a name to a provenance data. Your design probably will have an on-disk and an in-memory data structure, and they may be different. The provenance data includes the set of other files (and optional part names in each of those files) that this file part depends on. One design may involve storing provenance information only about the immediate ancestors: that is, if a user copied a PowerPoint slide from file A to B, and then from B to C, C's provenance data points back to B, and it's up to the application to then track down B's provenance. An alternative design is to store the entire tree of provenance information with each file.

Storing the entire tree of provenance makes the design simpler, but may incur significant storage costs to create deep copies of provenance data. Moreover, a user may want to know whether two files have any common ancestors, in terms of provenance. Answering this question with a design that stores the entire tree requires a unique way of naming files in the stored tree.

Storing pointers to only the immediate ancestors offers potential space savings, but requires that the application be able to access the ancestor files, to keep tracking down further provenance information. However, your design should address the question of what happens when files may be deleted or renamed, or their names reused for a new file. You may find it interesting to look at how the Unix network file system (NFS) deals with inode reuse, as described in the textbook section 4.5.1.

Furthermore, even if the name is unique, it may be important that the name refers to a specific version of a file in time. Consider a user that copies a PowerPoint slide from file B to C, and then replaces that slide in B with a different slide from file A. If the provenance for this slide in file C points to the corresponding slide in B, and the provenance for B's slide now points to A, it may now appear that C's slide depends on A's slide, even though this could not have happened.

For unmodified applications, the system call API should be the same as a standard Unix system. For applications that have to be aware of the provenance (e.g., to specify precise dependencies for individual slides in PowerPoint), the OS could provide a function read_prov(fd, part) that would return the provenance for the specified part of the file, and a function write_prov(fd, part, provinfo) that would set the provenance for the specified part of the file. To access the provenance information, the user would use the read_prov(fd, part) system call to find files on which fd/part depends, and a search_prov(fd, part) system call to find files which depend on fd/part.

In your design, you may find it useful to assume a versioning file system, which stores all versions of each file over time. That is, you don't have to describe the versioning file system, but many of the important design issues still arise, even if you use a versioning file system. For example, you will need to explain when it is appropriate to garbage-collect old versions in such a design.

You may also find it useful to make use of file system extended attributes; the Wikipedia article offers a good summary.

III. Requirements

Your system must support the following use cases:

You can limit your design to a single machine. That is, files imported from outside the machine start out with no provenance information.

Your design must be able to operate continuously. Because the system would continue to generate provenance information when files are modified, your design may need to periodically garbage-collect provenance information. In particular, your design should be able to handle a workload where the user continuously copies a PowerPoint file a.ppt to b.ppt, then copies b.ppt to c.ppt and deletes b.ppt, then copies c.ppt to d.ppt and deletes c.ppt, and so on.

Your design should store the provenance information persistently. That is, if the computer is turned off, and then turned on again, the provenance information must be available. You don't have to worry the computer crashing during normal operation.

Your design does not need to address buggy applications, malicious applications, or any security considerations. It's OK if your design allows a specially-designed application to erase the provenance of some files.

Your design should perform reasonably well, and should not require excessive disk or memory storage overheads, though it is up to you to decide and argue for what constitutes excessive overheads; the target of this design is a typical laptop or desktop computer system. You can assume that your disk requires a constant (but non-trivial) time to read or write any block, similar to an ideal solid-state disk.

Two particularly challenging performance workloads are search with a large file system, and continuous file copying. Your design should be able to sustain file copying at a rate of 10 file copies per second (including the garbage collection that must take place in long-term use). For a search workload, if the file system contains 1,000,000 files, it should take at most a few seconds to find which files were derived from some file A; you can assume that there are only a few such derived files.

Challenge: extend your design to track provenance information for files across machines. For example, given a file in the file system, a user should be able to track which email messages that file came from as an attachment. Similarly, to track down who received a copy of slides that the user sent as part of an attachment. Addressing this challenge is not necessary to get an A, but may be fun.

IV. Design Proposal

The design proposal should be a concise summary (800 words) of your overall system design.

The format of your proposal should be roughly as follows: it should begin with an overview, describing the overall approach. It should then summarize your proposed design and explain how it solves the problem and meets the requirements in this document.

The core of the proposal should be the design description. This should describe your design's interface (based on Ben's sketch), how you propose to implement that interface, and how you think that interface will be used by applications. You may make use of bullets and tables, but must introduce them with at least 1-2 sentences of context. This section must include at least one graphic, correctly formatted with a caption and brief description. For example, it may be a good idea to illustrate the design of your data structures used to track provenance, or the data structure used to garbage-collect provenance information.

The proposal should end with a summary of how the design meets the requirements (1-2 sentences). It should also include problems which remain to be resolved.

You do not have to present a detailed rationale or analysis in your proposal. However, if any of your design decisions are unusual (particularly creative, experimental, or risky) or if you deviate from the requirements, you should explain and justify those decisions in your proposal.

You will receive feedback on your proposal from your TA in time to adjust your final report. You will also receive a writing program grade for your proposal, as well as feedback from your writing instructor that will help you improve the writing of the final report.

V. Design Report

Your report should explain your design. It should discuss the major design decisions and tradeoffs you made, and justify your choices. It should discuss any limitations of which you are aware. You should assume that your report is being read by someone who has read this assignment, but has not thought carefully about this particular design problem. Give enough detail that your project can be turned over successfully to an implementation team. Your report should convince the reader that your design satisfies the requirements in Section III.

V.A. Report organization

Use this organization for your report:

Here are a few tips:

The format and appearance of your report should conform to the DP1 style guide.

VI. How we evaluate your work

Your recitation and writing instructors will assign your report a grade that reflects both the design itself and how well your report presents the design. These are the main grading criteria.

The most important aspect of your design is that we can understand how it works and that you have clearly addressed the requirements and provided the elements listed in Sections III and IV. Complicated designs that we cannot understand will not be graded favorably. Some overall content considerations:

Some writing considerations:

VII. Revising Your Report

If you received a writing grade of B+ or less for your original report, you can revise your report to improve the writing grade. The maximum writing grade you can receive through revision is B+. The technical grade cannot be improved through revision. If you wish to revise, make an appointment to speak with your writing program lecturer no later than one week after you receive your writing grade for DP1. The steps for revision are:
  1. You will agree on a section to revise; the grade for the revision of that section will replace your writing grade for DP1.
  2. You will agree on a due date for that revision. Our aim is to provide enough time so that you can have the revised grade before drop date. If you do not submit the revised report by the due date agreed upon with your writing instructor, your original DP1 grade will become your final DP1 grade, and the revision will not count, although we can still provide you with feedback on your revised writing.
  3. You will not receive much written feedback on the revision, but you are encouraged to discuss any questions you have with your writing program lecturer.
  4. Your recitation instructor must approve the completed revision (to make sure no technical content has been broken in the process). Only your writing grade is changed by the revision, however.

VIII. Collaboration

This project is an individual effort. You are welcome to discuss the problem and ideas for solutions with your friends, but if you include any of their ideas in your solution you should explicitly give them credit. You must be the sole author of your report.

IX. Clarifications