6.033 2011 Design Project 1

I. Due Dates and Deliverables

There are three deliverables for Design Project 1:

A design proposal not exceeding 800 words, due on February 24th, 2011 (before recitation).
A design report not exceeding 2,500 words, due on March 17th, 2011 at 5pm.
An (optional) revision of your design report to improve your writing grade; you must set up a meeting with your writing instructor within one week of receiving your graded DP1 to take advantage of this.

All deliverables should be submitted via the online submission site.

As with real life system designs, 6.033 design projects are under-specified, and it is your job to complete the specification in a sensible way given the overall requirements of the project. As with designs in practice, the specifications often need some adjustment as the design is fleshed out. We recommend that you start early so that you can evolve your design over time. A good design is likely to take more than just a few days to put together.

II. The Problem

Ben Bitdiddle is tired of file system hierarchies. He has noticed that most users use search to find the files they are looking for, and no single hierarchy is right for every occasion. For example, there are many ways in which Ben can organize his photos. One may be by date. Another may be by location. A third alternative may be to separate photos by the people in the photo. A traditional file system hierarchy, such as storing the photos from each day in a separate directory, does not fit Ben's needs, if he wants to access photos from a specific location or containing some set of people. Today, some applications provide this functionality internally, but that organization is not visible to other applications, such as a third-party photo editor, uploader, etc. To fix this problem, Ben wants to explore a different design which has no path names, directories, or hierarchies at all. Instead, users and programs assign free-form tags to files, such as "location:Boston", "person:Ben", "date:2011-02-01", or "filename:IMG_3459.JPG".

Ben's proposed system call interface involves three key concepts. Devices are named by a 64-bit number that's unique on each computer, which is a local name for that device (e.g., the USB device plugged into the 5th USB port). Files are named by file IDs, a 64-bit number that's unique on each storage device. Finally, scopes are used to represent groups of files, and each scope is represented by a 64-bit ID. Devices are one kind of scope (and, infact, the ID of a device is the scope ID for the set of files on that device). The results of a search for all files tagged "location:Boston" is yet another scope, representing all of the files that have that tag.

Scopes are useful in limiting the set of files that an application searches over. Ben is thinking that scopes could be ephemeral; that is, scopes would disappear after a reboot, and creating a new scope would be cheap (it might not involve any disk I/O at all). In this way, scopes can be used to pass sets of files between applications at runtime.

Ben's initial sketch of this file system design includes the following system calls:

search takes as arguments a source scope ID, a list of tags, and a destination scope ID. This system call adds files matching all of the specified tags in the source scope to the destination scope.
create takes a device ID as an argument, and returns a file ID for a new file on that device.
delete takes a (device ID, file ID) as an argument, and deletes that file.
read and write take a (device ID, file ID) tuple, and an offset, and read or write data at that offset.
list takes a scope ID and returns a list of (device ID, file ID) tuples for files in that scope.
mkscope creates an empty scope, which can be populated using search, and returns its scope ID.
tag_add and tag_remove take a (device ID, file ID) and a tag name (string) as an argument, and add or remove that tag from the file.
tag_get takes a (device ID, file ID) as an argument, and returns the list of tags for that file.
device_list returns the list of currently plugged-in devices (i.e., the 64-bit IDs for the scope of each device).

III. Requirements

Your job in this design project is to come up with a coherent design for a system call API, based on Ben's initial sketch, and an overview of how you would implement your system calls (e.g., how is a scope represented internally?). Many aspects of Ben's sketch are not fully specified, and your design should address these issues. For example, does search copy the set of matching files into the destination scope, or does the destination scope now contain a dynamically-updating view of files matching the search criteria? Should tags be simple strings, or is it better to have tags be key-value pairs like "location": "Boston"? How should applications be notified when new files or devices become available? How do files get deleted? Feel free to change Ben's system calls as needed to improve your design (e.g., if you feel like scopes should be persistent across reboots, or if you don't think you need scopes at all, change it). As in Unix, keep in mind that you can choose to follow certain conventions in your applications or libraries that make your system work well, such as well-known tag names used by all applications.

You should describe precisely how the following use cases would work in your design (including how the application developer would have to use your system call interface, what tags the user may need to assign, etc):

A photo viewer is running, displaying the user's own photos in random order. The user plugs in a USB drive, and wants to tell the photo viewer to add the photos from the USB drive, which have been tagged ahead of time, to the set of photos it's showing.
The user wants to view all of the photos from Colorado taken in 2007, which may be stored on any device (as above), in the photo viewer. Make sure the performance is acceptable even if there many other files that don't match the search criteria.
The user inserts a (read-only) CD, which has a set of photos, but the files have no pre-existing tags. What does the user do to view these files in their photo viewer above?
The user wants to copy the photos from the CD to a new USB drive they've just plugged in.
The user types a command like "python" at a shell prompt. How does the shell find the "python" binary? A stray USB drive shouldn't cause the user to run the wrong "python" command. On the other hand, if the user has two drives containing system files, the "python" command may be stored on either of the two drives.
The user runs a Python interpreter, which loads Python modules. How does the Python interpreter locate its modules? What happens if there are two versions of the same module installed? Ideally, the latest version would be loaded.
The user installs two versions of Python, each with its own set of modules. How should the user invoke each version of Python to ensure each Python version gets its own modules? Recall that Python creates pre-compiled versions of its modules (i.e., pyc files). How does your system keep the pyc files for the two versions of Python separate from each other?

The best way to get started is probably for you to figure out how each of the above applications can use Ben's API, and, in cases where there isn't a clear match, figure out whether the API needs to change, or whether your applications can follow some convention (e.g., tag names) to make things work out. Once you've figured out how the applications can work with your (possibly modified) API, you should start thinking about how it can be implemented. You can assume that you already have access to a library of on-disk data structures, such as a B+tree.

To get you started, we will partially work out how Ben's API can be used for scenario #1 above (although your solution may be different, depending on how you modify Ben's API). Initially, the photos on the user's USB drive would be tagged with "type:photo". The user's applications all share a single scope that represents the user's own files, created when the user logs in. The scope is populated at login time with the user's files from the computer's hard drive, perhaps by running a search for files tagged "owner:Bob" when Bob logs in. The photo viewer works by periodically searching the user's scope for files tagged "type:photo". When the USB drive is plugged in, Bob uses the search system call on the USB drive's device scope, with no tag arguments, to import all of the files from the USB drive into his user scope (not unlike how the user would mount a file system in Unix). At this point, the photo viewer would find the photos in the user's scope. Note that in our example, neither the user nor the applications rely on path names in any way, and simply search for files by tags like "owner:Bob" and "type:photo". Part of your task for this design project is to work out all of the scenarios, including this one, in at least this much detail.

Finally, you should make sure that your design can perform well enough to be usable. For example, you should ensure that operations that only need to touch a small number of files are not unnecessarily slowed down by large numbers of unrelated files that exist in the same system. Specifically, you should evaluate the performance of your design under workloads 2, 5, and 6 from above. You can assume that your disk requires a constant (but non-trivial) time to read or write any block, similar to an ideal solid-state disk.

Optional challenge problem: augment your design to support free-form text searches in the same namespace as tags (e.g., an application can search for "text:kaashoek" to find files containing Prof. Kaashoek's name). Explain how your design indexes the contents of files, and how the index is updated incrementally when files are modified.

IV. Design Proposal

The design proposal should be a concise summary (800 words) of your overall system design.

The format of your proposal should be roughly as follows: it should begin with an overview, describing the overall approach. It should then summarize your proposed design and explain how it solves the problem and meets the requirements in this document.

The core of the proposal should be the design description. This should describe your design's interface (based on Ben's sketch), how you propose to implement that interface, and how you think that interface will be used by applications. You may make use of bullets and tables, but must introduce them with at least 1-2 sentences of context. This section must include at least one graphic, correctly formatted with a caption and brief description.

The proposal should end with a summary of how the design meets the requirements (1-2 sentences). It should also include problems which remain to be resolved.

You do not have to present a detailed rationale or analysis in your proposal. However, if any of your design decisions are unusual (particularly creative, experimental, or risky) or if you deviate from the requirements, you should explain and justify those decisions in your proposal.

You will receive feedback on your proposal from your TA in time to adjust your final report.

V. Design Report

Your report should explain your design. It should discuss the major design decisions and tradeoffs you made, and justify your choices. It should discuss any limitations of which you are aware. You should assume that your report is being read by someone who has read this assignment, but has not thought carefully about this particular design problem. Give enough detail that your project can be turned over successfully to an implementation team. Your report should convince the reader that your design satisfies the requirements in Section III.

V.A. Report organization

Use this organization for your report:

Title page: Give your report a title that reflects the subject and scope of your project. Include your name, email address, recitation instructor, section time(s), and the date on the title page.
No table of contents is needed
Introduction: Summarize what your design is intended to achieve, outline the design, explain the major trade-offs and design decisions you have made, and justify those trade-offs and decisions.
Design: Explain your design. Identify your design's main components, state, and algorithms. You should sub-divide the design, with corresponding subsections in the text, so that the reader can focus on and understand one piece at a time. Explain why your design makes sense as well as explaining how it works. Use diagrams, pseudo-code, and worked examples as appropriate.
Analysis: Explain how you expect your design to behave in different scenarios. What scenarios might pose problems for throughput, latency, or even correctness? What do you expect to be the scalability limits of your design?
Conclusion: Briefly summarize your design and provide recommendations for further actions and a list of any problems that must be resolved before the design can be implemented.
Acknowledgments and references: Give credit to individuals whom you consulted in developing your design. Provide a list of references at the end using the IEEE citation-sequence system ("IEEE style") described in the Mayfield Handbook.
Word count. Please indicate the word count of your report at the end of the document. Captions of figures should be included in the total word count.
Footnotes. Please do not use footnotes in your report.

Here are a few tips:

Use ideas and terms from the textbook and papers when appropriate; this will save you space (you can refer the reader to the relevant section of the textbook) and will save the reader some effort.
Before you explain the solution to any given problem, say what the problem is.
Before presenting the details of any given design component, ensure that the purpose and requirements of that component are well described.
It's often valuable to illustrate an idea using an example, but an example is no substitute for a full explanation of the idea.
You may want to separate the explanation of a component's data structures from its algorithms to access or use those data structures.
Explain all figures, tables, and pseudo-code; explain what is being presented, and what conclusions the reader should draw.

The format and appearance of your report should conform to the DP1 style guide.

VI. How we evaluate your work

Your recitation and writing instructors will assign your report a grade that reflects both the design itself and how well your report presents the design. These are the main grading criteria.

The most important aspect of your design is that we can understand how it works and that you have clearly addressed the requirements and provided the elements listed in Sections III and IV. Complicated designs that we cannot understand will not be graded favorably. Some overall content considerations:

Design supports scenarios and describes how they work.
Clean yet sufficient design, minimal mechanism.
Reasonable implementation detail.
Convincing performance evaluation.

Some writing considerations:

Is the report well-organized? Does it follow standard organizational conventions for technical reports? Are the grammar and language sound?
Do you use diagrams and/or figures appropriately? Are diagrams or figures appropriately labeled, referenced, and discussed in the text?
Does the report use the concepts, models, and terminology introduced in 6.033? If not, does it have a good reason for using different vocabulary?
Does the report address the intended audience?
Are references cited and used correctly?

VII. Revising Your Report

If you received a writing grade of 90 (out of 100) or less for your original report, you can revise your report to improve the writing grade. The maximum writing grade you can receive through revision is 90. The technical grade cannot be improved through revision. Before you submit your revision, you must set up an appointment with your writing instructor to agree on a plan for your revision within one week of receiving your grade on DP1. You writing instructor will work out a time when your revision is due, typically one week after your meeting. If you do not submit the revised report by the due date agreed upon with your writing instructor, your original DP1 grade will become your final DP1 grade, and the revision will not count, although we can still provide you with feedback on your revised writing.

VIII. Collaboration

This project is an individual effort. You are welcome to discuss the problem and ideas for solutions with your friends, but if you include any of their ideas in your solution you should explicitly give them credit. You must be the sole author of your report.

IX. Clarifications

Your design does not need to deal with hardware failures, or with security concerns.
"Usable performance" means that an application running under your design is (roughly) at most twice as slow as a similar application under Unix. We are interested in a coherent design and a good performance evaluation more than we are interested in high performance.
You can find previous examples of DP1 reports under the "Excellent Writing Examples" link on the left-hand side of the course home page.
Some students asked about using a SQL database for tag storage. While it's OK to build your design around a SQL database, keep in mind two things. First, you will have to describe what relations/tables your SQL database would contain, and what queries you would perform on it. Second, for the analysis, will need to make some assumptions about how the SQL database stores your data and executes your queries. As a result, using a SQL database may not be as big of a simplification as you might initially think. However, thinking of a SQL database may still be a useful approach in working through different aspects of your design.
Late submission grading policy: Your DP1 report is due before 5pm on March 17th, 2011. If you submit your DP1 report late, we will penalize you one letter grade per 48 hours, starting from 5pm on the submission day. For example, if you submit the report anywhere from 1 minute to 48 hours late, and your report would have otherwise received a grade of "A", you will receive a "B"; if you submitted 49 hours late, you would receive a "C".