6.033 2012 DP1: A provenance-tracking file system

6.033 2012 Design Project 1: A provenance-tracking file system

I. Due Dates and Deliverables

There are three deliverables for Design Project 1:

A design proposal not exceeding 800 words, due on March 1, 2012 (before recitation).
A design report not exceeding 2,500 words, due on March 22, 2012 at 5pm.
An (optional) revision of your design report to improve your writing grade; you must set up a meeting with your writing instructor within one week of receiving your graded DP1 to take advantage of this.

All deliverables should be submitted via the online submission site.

As with real life system designs, 6.033 design projects are under-specified, and it is your job to complete the specification in a sensible way given the overall requirements of the project. As with designs in practice, the specifications often need some adjustment as the design is fleshed out. We recommend that you start early so that you can evolve your design over time. A good design is likely to take more than just a few days to put together.

II. The Problem

Ben Bitdiddle would like his system to keep track of where the contents of a particular file came from, and which other files may have in turn been influenced by this file. For example, when Ben is editing a PowerPoint presentation, he sometimes copies individual slides from other presentations, and then modifies them slightly. Later, he may want to find where did a particular slide get copied from, or, when Ben discovers a mistake in some slide, he may want to know what other presentations the mistake may have been copied to (even if that slide may have changed in those other presentations). As another example, Ben may want to know where a particular executable file in his system came from: that is, where did the file get copied or downloaded from, or what source files did it get compiled from?

In this design project, your goal will be to help Ben design a file system to keep track of provenance for files and to allow users to query that provenance.

Your system should support files that are logically composed of many components, and where each component might have a different provenance. For example, a PowerPoint file contains many slides, each of which may have been copied from a different source. Another example is a zip or tar file, which in turn contains many files that have their own provenance.

Your system should support applications that have not been modified to work with your design (e.g., a simple cp program that copies one file to another, and unmodified mv and rm programs). A simple way to do this would be to track all files read by a process, and record this set of files as the provenance for any files written by that same process.

Your system should also allow applications to supply additional information about the provenance of files. For example, a single process running PowerPoint may access many different presentation files, and modify many presentation files as well. With some changes to the application by PowerPoint developers, it should be possible to precisely keep track of which individual slides of the modified presentations depend on which individual slides of the presentations accessed by that process.

Your design of a provenance file system should clearly describe the on-disk and in-memory data structures for tracking provenance, and how they are used to support the required operations. To help you start thinking about what such a system might look like, Ben has sketched out the following notes about this design problem:

Each file in the file system has associated with it a set of parts. Each part corresponds to an application-defined unit of the file, such as a slide in a PointPoint file, or an individual compressed file in a zip file. A null part indicates the entire file, for applications that do not otherwise define any parts. If your design adopts the idea of file parts, you would need to decide how to name these parts; one choice may be to use free-form strings specified by the application as part names.

The system keeps track of the provenance for each part of the file; that is, it maintains a data structure that maps a name to a provenance data. Your design probably will have an on-disk and an in-memory data structure, and they may be different. The provenance data includes the set of other files (and optional part names in each of those files) that this file part depends on. One design may involve storing provenance information only about the immediate ancestors: that is, if a user copied a PowerPoint slide from file A to B, and then from B to C, C's provenance data points back to B, and it's up to the application to then track down B's provenance. An alternative design is to store the entire tree of provenance information with each file.

Storing the entire tree of provenance makes the design simpler, but may incur significant storage costs to create deep copies of provenance data. Moreover, a user may want to know whether two files have any common ancestors, in terms of provenance. Answering this question with a design that stores the entire tree requires a unique way of naming files in the stored tree.

Storing pointers to only the immediate ancestors offers potential space savings, but requires that the application be able to access the ancestor files, to keep tracking down further provenance information. However, your design should address the question of what happens when files may be deleted or renamed, or their names reused for a new file. You may find it interesting to look at how the Unix network file system (NFS) deals with inode reuse, as described in the textbook section 4.5.1.

Furthermore, even if the name is unique, it may be important that the name refers to a specific version of a file in time. Consider a user that copies a PowerPoint slide from file B to C, and then replaces that slide in B with a different slide from file A. If the provenance for this slide in file C points to the corresponding slide in B, and the provenance for B's slide now points to A, it may now appear that C's slide depends on A's slide, even though this could not have happened.

For unmodified applications, the system call API should be the same as a standard Unix system. For applications that have to be aware of the provenance (e.g., to specify precise dependencies for individual slides in PowerPoint), the OS could provide a function read_prov(fd, part) that would return the provenance for the specified part of the file, and a function write_prov(fd, part, provinfo) that would set the provenance for the specified part of the file. To access the provenance information, the user would use the read_prov(fd, part) system call to find files on which fd/part depends, and a search_prov(fd, part) system call to find files which depend on fd/part.

In your design, you may find it useful to assume a versioning file system, which stores all versions of each file over time. That is, you don't have to describe the versioning file system, but many of the important design issues still arise, even if you use a versioning file system. For example, you will need to explain when it is appropriate to garbage-collect old versions in such a design.

You may also find it useful to make use of file system extended attributes; the Wikipedia article offers a good summary.

III. Requirements

Your system must support the following use cases:

PowerPoint slide copying. Suppose a user Alice opens several PowerPoint presentations, copies several slides from each of them into a new PowerPoint presentation, and saves the new presentation file. Later on, she should be able to track down which presentation each slide in the new file came from, and to be able to open up that file. If Alice discovers a mistake in a PowerPoint slide, she should also be able to track down all copies of that slide in other files, without having to wade through unrelated files (e.g., files that contain a copy of a different slide from that presentation).
Compiling software. Suppose Alice compiles a large piece of software using make, and then installs the resulting binary onto her system. Alice should be able to later track down the sources she used to compile the binary. For extra credit, make it possible for Alice to track down the URL from which she downloaded the software in the first place.
Copying files. Alice should be able to copy files (such as PowerPoint files or compiled executables) with a program that has not been modified to know about provenance, and still be able to correctly track the provenance of the files afterwards. At least one design question you should address here is whether such a copy utility should propagate the provenance information from source to destination by value, or whether it should create a reference to the source file in the destination file's provenance information.
Handling tar/zip files. Alice should be able to pack several files into a single tar or zip file, extract such tar or zip files, and the provenance information should make sense. For example, suppose Alice creates a zip file containing two PowerPoint files, where the first file contains slides copied from the second file, and sends the zip file to Bob. When Bob extracts this zip file, he should be able to tell that one of the second file's slides came from the first file. It's OK to require some modifications to the zip and tar programs to support this. Separately, what should be the provenance on the zip file itself, after it is created? Similarly to the file copying application, should tar/zip propagate provenance by value or by reference?

You can limit your design to a single machine. That is, files imported from outside the machine start out with no provenance information.

Your design must be able to operate continuously. Because the system would continue to generate provenance information when files are modified, your design may need to periodically garbage-collect provenance information. In particular, your design should be able to handle a workload where the user continuously copies a PowerPoint file a.ppt to b.ppt, then copies b.ppt to c.ppt and deletes b.ppt, then copies c.ppt to d.ppt and deletes c.ppt, and so on.

Your design should store the provenance information persistently. That is, if the computer is turned off, and then turned on again, the provenance information must be available. You don't have to worry the computer crashing during normal operation.

Your design does not need to address buggy applications, malicious applications, or any security considerations. It's OK if your design allows a specially-designed application to erase the provenance of some files.

Your design should perform reasonably well, and should not require excessive disk or memory storage overheads, though it is up to you to decide and argue for what constitutes excessive overheads; the target of this design is a typical laptop or desktop computer system. You can assume that your disk requires a constant (but non-trivial) time to read or write any block, similar to an ideal solid-state disk.

Two particularly challenging performance workloads are search with a large file system, and continuous file copying. Your design should be able to sustain file copying at a rate of 10 file copies per second (including the garbage collection that must take place in long-term use). For a search workload, if the file system contains 1,000,000 files, it should take at most a few seconds to find which files were derived from some file A; you can assume that there are only a few such derived files.

Challenge: extend your design to track provenance information for files across machines. For example, given a file in the file system, a user should be able to track which email messages that file came from as an attachment. Similarly, to track down who received a copy of slides that the user sent as part of an attachment. Addressing this challenge is not necessary to get an A, but may be fun.

IV. Design Proposal

The design proposal should be a concise summary (800 words) of your overall system design.

The format of your proposal should be roughly as follows: it should begin with an overview, describing the overall approach. It should then summarize your proposed design and explain how it solves the problem and meets the requirements in this document.

The core of the proposal should be the design description. This should describe your design's interface (based on Ben's sketch), how you propose to implement that interface, and how you think that interface will be used by applications. You may make use of bullets and tables, but must introduce them with at least 1-2 sentences of context. This section must include at least one graphic, correctly formatted with a caption and brief description. For example, it may be a good idea to illustrate the design of your data structures used to track provenance, or the data structure used to garbage-collect provenance information.

The proposal should end with a summary of how the design meets the requirements (1-2 sentences). It should also include problems which remain to be resolved.

You do not have to present a detailed rationale or analysis in your proposal. However, if any of your design decisions are unusual (particularly creative, experimental, or risky) or if you deviate from the requirements, you should explain and justify those decisions in your proposal.

You will receive feedback on your proposal from your TA in time to adjust your final report. You will also receive a writing program grade for your proposal, as well as feedback from your writing instructor that will help you improve the writing of the final report.

V. Design Report

Your report should explain your design. It should discuss the major design decisions and tradeoffs you made, and justify your choices. It should discuss any limitations of which you are aware. You should assume that your report is being read by someone who has read this assignment, but has not thought carefully about this particular design problem. Give enough detail that your project can be turned over successfully to an implementation team. Your report should convince the reader that your design satisfies the requirements in Section III.

V.A. Report organization

Use this organization for your report:

Title page: Give your report a title that reflects the subject and scope of your project. Include your name, email address, recitation instructor, section time(s), and the date on the title page.
No table of contents is needed
Introduction: Summarize what your design is intended to achieve, outline the design, explain the major trade-offs and design decisions you have made, and justify those trade-offs and decisions.
Design: Explain your design. Identify your design's main components, state, and algorithms. You should sub-divide the design, with corresponding subsections in the text, so that the reader can focus on and understand one piece at a time. Explain why your design makes sense as well as explaining how it works. Use diagrams, pseudo-code, and worked examples as appropriate.
Analysis: Explain how you expect your design to behave in different use cases. What use cases might pose problems for throughput, latency, or even correctness? What do you expect to be the scalability limits of your design?
Conclusion: Briefly summarize your design and provide recommendations for further actions and a list of any problems that must be resolved before the design can be implemented.
Acknowledgments and references: Give credit to individuals whom you consulted in developing your design. Provide a list of references at the end using the IEEE citation-sequence system ("IEEE style") described in the Mayfield Handbook.
Word count. Please indicate the word count of your report at the end of the document. Captions of figures should be included in the total word count.
Footnotes. Please do not use footnotes in your report.

Here are a few tips:

Use ideas and terms from the textbook and papers when appropriate; this will save you space (you can refer the reader to the relevant section of the textbook) and will save the reader some effort.
Before you explain the solution to any given problem, say what the problem is.
Before presenting the details of any given design component, ensure that the purpose and requirements of that component are well described.
It's often valuable to illustrate an idea using an example, but an example is no substitute for a full explanation of the idea.
You may want to separate the explanation of a component's data structures from its algorithms to access or use those data structures.
Explain all figures, tables, and pseudo-code; explain what is being presented, and what conclusions the reader should draw.

The format and appearance of your report should conform to the DP1 style guide.

VI. How we evaluate your work

Your recitation and writing instructors will assign your report a grade that reflects both the design itself and how well your report presents the design. These are the main grading criteria.

The most important aspect of your design is that we can understand how it works and that you have clearly addressed the requirements and provided the elements listed in Sections III and IV. Complicated designs that we cannot understand will not be graded favorably. Some overall content considerations:

Design supports use cases and describes how they work.
Clean yet sufficient design, minimal mechanism.
Reasonable implementation detail.
Convincing performance evaluation.

Some writing considerations:

Is the report well-organized? Does it follow standard organizational conventions for technical reports? Are the grammar and language sound?
Do you use diagrams and/or figures appropriately? Are diagrams or figures appropriately labeled, referenced, and discussed in the text?
Does the report use the concepts, models, and terminology introduced in 6.033? If not, does it have a good reason for using different vocabulary?
Does the report address the intended audience?
Are references cited and used correctly?

VII. Revising Your Report

If you received a writing grade of B+ or less for your original report, you can revise your report to improve the writing grade. The maximum writing grade you can receive through revision is B+. The technical grade cannot be improved through revision. If you wish to revise, make an appointment to speak with your writing program lecturer no later than one week after you receive your writing grade for DP1. The steps for revision are:

You will agree on a section to revise; the grade for the revision of that section will replace your writing grade for DP1.
You will agree on a due date for that revision. Our aim is to provide enough time so that you can have the revised grade before drop date. If you do not submit the revised report by the due date agreed upon with your writing instructor, your original DP1 grade will become your final DP1 grade, and the revision will not count, although we can still provide you with feedback on your revised writing.
You will not receive much written feedback on the revision, but you are encouraged to discuss any questions you have with your writing program lecturer.
Your recitation instructor must approve the completed revision (to make sure no technical content has been broken in the process). Only your writing grade is changed by the revision, however.

VIII. Collaboration

This project is an individual effort. You are welcome to discuss the problem and ideas for solutions with your friends, but if you include any of their ideas in your solution you should explicitly give them credit. You must be the sole author of your report.

IX. Clarifications

Your design does not need to deal with failures (e.g., power outages, disk failures, etc.), or with security concerns (e.g., user changing the provenance information).
You can find previous examples of DP1 reports under the "Excellent Writing Examples" link on the left-hand side of the course home page.
Late submission grading policy: Your DP1 report is due before 5pm on March 22, 2011. If you submit your DP1 report late, we will penalize you one letter grade per 48 hours, starting from 5pm on the submission day. For example, if you submit the report anywhere from 1 minute to 48 hours late, and your report would have otherwise received a grade of "A", you will receive a "B"; if you submitted 49 hours late, you would receive a "C".
Design proposal grade: You will also receive a writing program grade for your proposal, as well as feedback from your writing instructor that will help you improve the writing of the final report.
Section VII. Revising Your Report has been updated with information about the revision process for this year.