6.033 2011 Design Project 1
I. Due Dates and Deliverables
There are three deliverables for Design Project 1:
- A design proposal not exceeding 800 words,
due on February 24th, 2011 (before recitation).
- A design report not exceeding 2,500 words,
due on March 17th, 2011 at 5pm.
- An (optional) revision of your design report to
improve your writing grade; you must set up a
meeting with your writing instructor within one
week of receiving your graded DP1 to take advantage of this.
All deliverables should be submitted via the
online submission site.
As with real life system designs, 6.033 design projects are
under-specified, and it is your job to complete the specification in a
sensible way given the overall requirements of the project. As with
designs in practice, the specifications often need some adjustment as
the design is fleshed out. We recommend that you start
early so that you can evolve your design over time. A good design is likely to
take more than just a few days to put together.
II. The Problem
Ben Bitdiddle is tired of file system hierarchies. He has noticed
that most users use search to find the files they are looking for,
and no single hierarchy is right for every occasion. For example,
there are many ways in which Ben can organize his photos. One may
be by date. Another may be by location. A third alternative may
be to separate photos by the people in the photo. A traditional
file system hierarchy, such as storing the photos from each day in
a separate directory, does not fit Ben's needs, if he wants to access
photos from a specific location or containing some set of people.
Today, some applications provide this functionality internally, but
that organization is not visible to other applications, such as a
third-party photo editor, uploader, etc. To fix this problem, Ben
wants to explore a different design which has no path names,
directories, or hierarchies at all. Instead, users and programs
assign free-form tags to files, such as "location:Boston",
"person:Ben", "date:2011-02-01", or "filename:IMG_3459.JPG".
Ben's proposed system call interface involves three key concepts.
Devices are named by a 64-bit number that's unique on each computer,
which is a local name for that device (e.g., the USB device plugged
into the 5th USB port). Files are named by file IDs, a 64-bit number
that's unique on each storage device. Finally, scopes are used to
represent groups of files, and each scope is represented by a 64-bit
ID. Devices are one kind of scope (and, infact, the ID of a device
is the scope ID for the set of files on that device).
The results of a search for all
files tagged "location:Boston" is yet another scope, representing all
of the files that have that tag.
Scopes are useful in limiting the set of files that an application
searches over. Ben is thinking that scopes could be ephemeral; that
is, scopes would disappear after a reboot, and creating a new scope
would be cheap (it might not involve any disk I/O at all). In this
way, scopes can be used to pass sets of files between applications
at runtime.
Ben's initial sketch of this file system design includes the following
system calls:
- search takes as arguments a source scope ID, a list
of tags, and a destination scope ID. This system call adds
files matching all of the specified tags in the source scope
to the destination scope.
- create takes a device ID as an argument, and returns a
file ID for a new file on that device.
- delete takes a (device ID, file ID) as an argument,
and deletes that file.
- read and write take a (device ID, file ID)
tuple, and an offset, and read or write data at that offset.
- list takes a scope ID and returns a list of (device ID,
file ID) tuples for files in that scope.
- mkscope creates an empty scope, which can be populated
using search, and returns its scope ID.
- tag_add and tag_remove take a (device ID, file
ID) and a tag name (string) as an argument, and add or remove
that tag from the file.
- tag_get takes a (device ID, file ID) as an argument,
and returns the list of tags for that file.
- device_list returns the list of currently plugged-in
devices (i.e., the 64-bit IDs for the scope of each device).
III. Requirements
Your job in this design project is to come up with a coherent design
for a system call API, based on Ben's initial sketch, and an overview
of how you would implement your system calls (e.g., how is a scope
represented internally?). Many aspects of Ben's sketch are not fully
specified, and your design should address these issues.
For example, does search copy the set of matching files
into the destination scope, or does the destination scope now contain
a dynamically-updating view of files matching the search criteria?
Should tags be simple strings, or is it better to have tags be
key-value pairs like "location": "Boston"? How should applications
be notified when new files or devices become available? How do
files get deleted? Feel free to change Ben's
system calls as needed to improve your design (e.g., if you feel like
scopes should be persistent across reboots, or if you don't think you
need scopes at all, change it).
As in Unix, keep in mind that
you can choose to follow certain conventions in your applications or
libraries that make your system work well, such as well-known tag
names used by all applications.
You should describe precisely how
the following use cases would work in your design (including how the
application developer would have to use your system call interface,
what tags the user may need to assign, etc):
-
A photo viewer is running, displaying the user's own photos
in random order. The user plugs in a USB drive, and wants to tell
the photo viewer to add the photos from the USB drive, which
have been tagged ahead of time, to the set of photos it's showing.
-
The user wants to view all of the photos from Colorado taken
in 2007, which may be stored on any device (as above), in the
photo viewer. Make sure the performance is acceptable even if
there many other files that don't match the search criteria.
-
The user inserts a (read-only) CD, which has a set of photos,
but the files have no pre-existing tags. What does the user do
to view these files in their photo viewer above?
-
The user wants to copy the photos from the CD to a new USB
drive they've just plugged in.
-
The user types a command like "python" at a shell prompt. How does
the shell find the "python" binary? A stray USB drive shouldn't cause
the user to run the wrong "python" command. On the other hand, if the
user has two drives containing system files, the "python" command may
be stored on either of the two drives.
-
The user runs a Python interpreter, which loads Python modules.
How does the Python interpreter locate its modules? What happens
if there are two versions of the same module installed? Ideally,
the latest version would be loaded.
-
The user installs two versions of Python, each with its own set
of modules. How should the user invoke each version of Python to
ensure each Python version gets its own modules? Recall that Python
creates pre-compiled versions of its modules (i.e., pyc files).
How does your system keep the pyc files for the two versions of
Python separate from each other?
The best way to get started is probably for you to figure out how each
of the above applications can use Ben's API, and, in cases where there
isn't a clear match, figure out whether the API needs to change, or
whether your applications can follow some convention (e.g., tag names)
to make things work out. Once you've figured out how the applications
can work with your (possibly modified) API, you should start thinking
about how it can be implemented. You can assume that you already
have access to a library of on-disk data structures, such as a B+tree.
To get you started, we will partially work out how Ben's API can be
used for scenario #1 above (although your solution may be different,
depending on how you modify Ben's API). Initially, the photos
on the user's USB drive would be tagged with "type:photo". The user's
applications all share a single scope that represents the user's own
files, created when the user logs in. The scope is populated at login
time with the user's files from the computer's hard drive, perhaps by
running a search for files tagged "owner:Bob" when Bob logs in. The
photo viewer works by periodically searching the user's scope for files
tagged "type:photo". When the USB drive is plugged in, Bob uses the
search system call on the USB drive's device scope, with no tag arguments,
to import all of the files from the USB drive into his user scope
(not unlike how the user would mount a file system in Unix). At this
point, the photo viewer would find the photos in the user's scope.
Note that in our example, neither the user nor the applications rely on
path names in any way, and simply search for files by tags like "owner:Bob"
and "type:photo". Part of your task for this design project is to work out
all of the scenarios, including this one, in at least this much detail.
Finally, you should make sure that your design can perform well enough
to be usable. For example, you should ensure that operations that
only need to touch a small number of files are not unnecessarily
slowed down by large numbers of unrelated files that exist in the
same system. Specifically, you should evaluate the performance of
your design under workloads 2, 5, and 6 from above. You can assume
that your disk requires a constant (but non-trivial) time to read or
write any block, similar to an ideal solid-state disk.
Optional challenge problem: augment your design to support
free-form text searches in the same namespace as tags (e.g., an application
can search for "text:kaashoek" to find files containing Prof. Kaashoek's
name). Explain how your design indexes the contents of files, and how the
index is updated incrementally when files are modified.
IV. Design Proposal
The design proposal should be a concise summary (800 words) of your
overall system design.
The format of your proposal should be roughly as follows: it should
begin with an overview, describing the overall approach. It should
then summarize your proposed design and explain how it solves the
problem and meets the requirements in this document.
The core of the proposal should be the design description. This
should describe your design's interface (based on Ben's sketch),
how you propose to implement that interface, and how you think
that interface will be used by applications. You may make use
of bullets and tables, but must introduce them with at least 1-2
sentences of context. This section must include at least one
graphic, correctly formatted with a caption and brief description.
The proposal should end with a summary of how
the design meets the requirements (1-2 sentences). It should also
include problems which remain to be resolved.
You do not have to present a detailed rationale or analysis in your
proposal. However, if any of your design decisions are unusual
(particularly creative, experimental, or risky) or if you deviate from
the requirements, you should explain and justify those decisions in
your proposal.
You will receive feedback on your proposal from your TA in time to
adjust your final report.
V. Design Report
Your report should explain your design. It should discuss the major
design decisions and tradeoffs you made, and justify your choices.
It should discuss any limitations of which you are aware. You should
assume that your report is being read by someone who has read this
assignment, but has not thought carefully about this particular
design problem. Give enough detail that your project can be turned
over successfully to an implementation team. Your report should
convince the reader that your design satisfies the requirements
in Section III.
V.A. Report organization
Use this organization for your report:
- Title page: Give your report a title that reflects the subject
and scope of your project. Include your name, email address,
recitation instructor, section time(s), and the date on the
title page.
- No table of contents is needed
- Introduction: Summarize what your design is intended to achieve,
outline the design, explain the major trade-offs and design
decisions you have made, and justify those trade-offs and decisions.
- Design: Explain your design. Identify your design's main components,
state, and algorithms. You should sub-divide the design, with
corresponding subsections in the text, so that the reader can
focus on and understand one piece at a time. Explain why your
design makes sense as well as explaining how it works. Use
diagrams, pseudo-code, and worked examples as appropriate.
- Analysis: Explain how you expect your design to behave in different
scenarios. What scenarios might pose problems for throughput, latency,
or even correctness? What do you expect to be the scalability limits
of your design?
- Conclusion: Briefly summarize your design and provide recommendations
for further actions and a list of any problems that must be resolved
before the design can be implemented.
- Acknowledgments and references: Give credit to individuals whom
you consulted in developing your design. Provide a list of references
at the end using the IEEE citation-sequence system ("IEEE style")
described in the
Mayfield Handbook.
- Word count. Please indicate the word count of your report at the end
of the document. Captions of figures should be included in the total
word count.
- Footnotes. Please do not use footnotes in your report.
Here are a few tips:
- Use ideas and terms from the textbook and papers when appropriate;
this will save you space (you can refer the reader to the relevant
section of the textbook) and will save the reader some effort.
- Before you explain the solution to any given problem, say what the
problem is.
- Before presenting the details of any given design component, ensure
that the purpose and requirements of that component are well described.
- It's often valuable to illustrate an idea using an example, but an
example is no substitute for a full explanation of the idea.
- You may want to separate the explanation of a component's data
structures from its algorithms to access or use those data structures.
- Explain all figures, tables, and pseudo-code; explain what is being
presented, and what conclusions the reader should draw.
The format and appearance of your report should conform to the
DP1 style guide.
VI. How we evaluate your work
Your recitation and writing instructors will assign your report a grade
that reflects both the design itself and how well your report presents
the design. These are the main grading criteria.
The most important aspect of your design is that we can understand how
it works and that you have clearly addressed the requirements and
provided the elements listed in Sections III and IV. Complicated
designs that we cannot understand will not be graded favorably.
Some overall content considerations:
- Design supports scenarios and describes how they work.
- Clean yet sufficient design, minimal mechanism.
- Reasonable implementation detail.
- Convincing performance evaluation.
Some writing considerations:
- Is the report well-organized? Does it follow standard
organizational conventions for technical reports? Are
the grammar and language sound?
- Do you use diagrams and/or figures appropriately? Are
diagrams or figures appropriately labeled, referenced,
and discussed in the text?
- Does the report use the concepts, models, and terminology
introduced in 6.033? If not, does it have a good reason
for using different vocabulary?
- Does the report address the intended audience?
- Are references cited and used correctly?
VII. Revising Your Report
If you received a writing grade of 90 (out of 100) or less for your
original report, you can revise your report to improve the writing
grade. The maximum writing grade you can receive through revision
is 90. The technical grade cannot be improved through revision.
Before you submit your revision, you must set up an appointment with
your writing instructor to agree on a plan for your revision within
one week of receiving your grade on DP1. You writing instructor
will work out a time when your revision is due, typically one week
after your meeting. If you do not submit the revised report by the
due date agreed upon with your writing instructor, your original DP1
grade will become your final DP1 grade, and the revision will not
count, although we can still provide you with feedback on your
revised writing.
VIII. Collaboration
This project is an individual effort. You are welcome to discuss
the problem and ideas for solutions with your friends, but if you
include any of their ideas in your solution you should explicitly
give them credit. You must be the sole author of your report.
IX. Clarifications
- Your design does not need to deal with hardware failures, or with
security concerns.
- "Usable performance" means that an application running under your
design is (roughly) at most twice as slow as a similar application
under Unix. We are interested in a coherent design and a good
performance evaluation more than we are interested in high
performance.
- You can find previous examples of DP1 reports under the "Excellent
Writing Examples" link on the left-hand side of the course home
page.
- Some students asked about using a SQL database for tag storage.
While it's OK to build your design around a SQL database, keep
in mind two things. First, you will have to describe what
relations/tables your SQL database would contain, and what
queries you would perform on it. Second, for the analysis,
will need to make some assumptions about how the SQL database
stores your data and executes your queries. As a result, using
a SQL database may not be as big of a simplification as you might
initially think. However, thinking of a SQL database may still
be a useful approach in working through different aspects of your
design.
- Late submission grading policy: Your DP1 report is due before
5pm on March 17th, 2011. If you submit your DP1 report late,
we will penalize you one letter grade per 48 hours, starting from
5pm on the submission day. For example, if you submit the report
anywhere from 1 minute to 48 hours late, and your report would have
otherwise received a grade of "A", you will receive a "B"; if you
submitted 49 hours late, you would receive a "C".