6.033 2011 Design Project 1

I. Due Dates and Deliverables

There are three deliverables for Design Project 1:
  1. A design proposal not exceeding 800 words, due on February 24th, 2011 (before recitation).
  2. A design report not exceeding 2,500 words, due on March 17th, 2011 at 5pm.
  3. An (optional) revision of your design report to improve your writing grade; you must set up a meeting with your writing instructor within one week of receiving your graded DP1 to take advantage of this.

All deliverables should be submitted via the online submission site.

As with real life system designs, 6.033 design projects are under-specified, and it is your job to complete the specification in a sensible way given the overall requirements of the project. As with designs in practice, the specifications often need some adjustment as the design is fleshed out. We recommend that you start early so that you can evolve your design over time. A good design is likely to take more than just a few days to put together.

II. The Problem

Ben Bitdiddle is tired of file system hierarchies. He has noticed that most users use search to find the files they are looking for, and no single hierarchy is right for every occasion. For example, there are many ways in which Ben can organize his photos. One may be by date. Another may be by location. A third alternative may be to separate photos by the people in the photo. A traditional file system hierarchy, such as storing the photos from each day in a separate directory, does not fit Ben's needs, if he wants to access photos from a specific location or containing some set of people. Today, some applications provide this functionality internally, but that organization is not visible to other applications, such as a third-party photo editor, uploader, etc. To fix this problem, Ben wants to explore a different design which has no path names, directories, or hierarchies at all. Instead, users and programs assign free-form tags to files, such as "location:Boston", "person:Ben", "date:2011-02-01", or "filename:IMG_3459.JPG".

Ben's proposed system call interface involves three key concepts. Devices are named by a 64-bit number that's unique on each computer, which is a local name for that device (e.g., the USB device plugged into the 5th USB port). Files are named by file IDs, a 64-bit number that's unique on each storage device. Finally, scopes are used to represent groups of files, and each scope is represented by a 64-bit ID. Devices are one kind of scope (and, infact, the ID of a device is the scope ID for the set of files on that device). The results of a search for all files tagged "location:Boston" is yet another scope, representing all of the files that have that tag.

Scopes are useful in limiting the set of files that an application searches over. Ben is thinking that scopes could be ephemeral; that is, scopes would disappear after a reboot, and creating a new scope would be cheap (it might not involve any disk I/O at all). In this way, scopes can be used to pass sets of files between applications at runtime.

Ben's initial sketch of this file system design includes the following system calls:

III. Requirements

Your job in this design project is to come up with a coherent design for a system call API, based on Ben's initial sketch, and an overview of how you would implement your system calls (e.g., how is a scope represented internally?). Many aspects of Ben's sketch are not fully specified, and your design should address these issues. For example, does search copy the set of matching files into the destination scope, or does the destination scope now contain a dynamically-updating view of files matching the search criteria? Should tags be simple strings, or is it better to have tags be key-value pairs like "location": "Boston"? How should applications be notified when new files or devices become available? How do files get deleted? Feel free to change Ben's system calls as needed to improve your design (e.g., if you feel like scopes should be persistent across reboots, or if you don't think you need scopes at all, change it). As in Unix, keep in mind that you can choose to follow certain conventions in your applications or libraries that make your system work well, such as well-known tag names used by all applications.

You should describe precisely how the following use cases would work in your design (including how the application developer would have to use your system call interface, what tags the user may need to assign, etc):

  1. A photo viewer is running, displaying the user's own photos in random order. The user plugs in a USB drive, and wants to tell the photo viewer to add the photos from the USB drive, which have been tagged ahead of time, to the set of photos it's showing.
  2. The user wants to view all of the photos from Colorado taken in 2007, which may be stored on any device (as above), in the photo viewer. Make sure the performance is acceptable even if there many other files that don't match the search criteria.
  3. The user inserts a (read-only) CD, which has a set of photos, but the files have no pre-existing tags. What does the user do to view these files in their photo viewer above?
  4. The user wants to copy the photos from the CD to a new USB drive they've just plugged in.
  5. The user types a command like "python" at a shell prompt. How does the shell find the "python" binary? A stray USB drive shouldn't cause the user to run the wrong "python" command. On the other hand, if the user has two drives containing system files, the "python" command may be stored on either of the two drives.
  6. The user runs a Python interpreter, which loads Python modules. How does the Python interpreter locate its modules? What happens if there are two versions of the same module installed? Ideally, the latest version would be loaded.
  7. The user installs two versions of Python, each with its own set of modules. How should the user invoke each version of Python to ensure each Python version gets its own modules? Recall that Python creates pre-compiled versions of its modules (i.e., pyc files). How does your system keep the pyc files for the two versions of Python separate from each other?

The best way to get started is probably for you to figure out how each of the above applications can use Ben's API, and, in cases where there isn't a clear match, figure out whether the API needs to change, or whether your applications can follow some convention (e.g., tag names) to make things work out. Once you've figured out how the applications can work with your (possibly modified) API, you should start thinking about how it can be implemented. You can assume that you already have access to a library of on-disk data structures, such as a B+tree.

To get you started, we will partially work out how Ben's API can be used for scenario #1 above (although your solution may be different, depending on how you modify Ben's API). Initially, the photos on the user's USB drive would be tagged with "type:photo". The user's applications all share a single scope that represents the user's own files, created when the user logs in. The scope is populated at login time with the user's files from the computer's hard drive, perhaps by running a search for files tagged "owner:Bob" when Bob logs in. The photo viewer works by periodically searching the user's scope for files tagged "type:photo". When the USB drive is plugged in, Bob uses the search system call on the USB drive's device scope, with no tag arguments, to import all of the files from the USB drive into his user scope (not unlike how the user would mount a file system in Unix). At this point, the photo viewer would find the photos in the user's scope. Note that in our example, neither the user nor the applications rely on path names in any way, and simply search for files by tags like "owner:Bob" and "type:photo". Part of your task for this design project is to work out all of the scenarios, including this one, in at least this much detail.

Finally, you should make sure that your design can perform well enough to be usable. For example, you should ensure that operations that only need to touch a small number of files are not unnecessarily slowed down by large numbers of unrelated files that exist in the same system. Specifically, you should evaluate the performance of your design under workloads 2, 5, and 6 from above. You can assume that your disk requires a constant (but non-trivial) time to read or write any block, similar to an ideal solid-state disk.

Optional challenge problem: augment your design to support free-form text searches in the same namespace as tags (e.g., an application can search for "text:kaashoek" to find files containing Prof. Kaashoek's name). Explain how your design indexes the contents of files, and how the index is updated incrementally when files are modified.

IV. Design Proposal

The design proposal should be a concise summary (800 words) of your overall system design.

The format of your proposal should be roughly as follows: it should begin with an overview, describing the overall approach. It should then summarize your proposed design and explain how it solves the problem and meets the requirements in this document.

The core of the proposal should be the design description. This should describe your design's interface (based on Ben's sketch), how you propose to implement that interface, and how you think that interface will be used by applications. You may make use of bullets and tables, but must introduce them with at least 1-2 sentences of context. This section must include at least one graphic, correctly formatted with a caption and brief description.

The proposal should end with a summary of how the design meets the requirements (1-2 sentences). It should also include problems which remain to be resolved.

You do not have to present a detailed rationale or analysis in your proposal. However, if any of your design decisions are unusual (particularly creative, experimental, or risky) or if you deviate from the requirements, you should explain and justify those decisions in your proposal.

You will receive feedback on your proposal from your TA in time to adjust your final report.

V. Design Report

Your report should explain your design. It should discuss the major design decisions and tradeoffs you made, and justify your choices. It should discuss any limitations of which you are aware. You should assume that your report is being read by someone who has read this assignment, but has not thought carefully about this particular design problem. Give enough detail that your project can be turned over successfully to an implementation team. Your report should convince the reader that your design satisfies the requirements in Section III.

V.A. Report organization

Use this organization for your report:

Here are a few tips:

The format and appearance of your report should conform to the DP1 style guide.

VI. How we evaluate your work

Your recitation and writing instructors will assign your report a grade that reflects both the design itself and how well your report presents the design. These are the main grading criteria.

The most important aspect of your design is that we can understand how it works and that you have clearly addressed the requirements and provided the elements listed in Sections III and IV. Complicated designs that we cannot understand will not be graded favorably. Some overall content considerations:

Some writing considerations:

VII. Revising Your Report

If you received a writing grade of 90 (out of 100) or less for your original report, you can revise your report to improve the writing grade. The maximum writing grade you can receive through revision is 90. The technical grade cannot be improved through revision. Before you submit your revision, you must set up an appointment with your writing instructor to agree on a plan for your revision within one week of receiving your grade on DP1. You writing instructor will work out a time when your revision is due, typically one week after your meeting. If you do not submit the revised report by the due date agreed upon with your writing instructor, your original DP1 grade will become your final DP1 grade, and the revision will not count, although we can still provide you with feedback on your revised writing.

VIII. Collaboration

This project is an individual effort. You are welcome to discuss the problem and ideas for solutions with your friends, but if you include any of their ideas in your solution you should explicitly give them credit. You must be the sole author of your report.

IX. Clarifications