6.033 Spring 2014 Design Project 1: Search Folders on UNIX

Revisions

2/4/2014 -- Fixed examples to that referred to non-existent "openDirectory" method.
2/20/2014 -- Added requirement that you support search by file name

Due Dates and Deliverables

There are three deliverables for Design Project 1:

A design memo, due on Feb 14, 2014 (11.59pm).
A design proposal not exceeding 800 words, due on February 28, 2014 at 5pm.
A design report not exceeding 2,500 words, due on March 21, 2014 at 5pm.

All deliverables should be submitted via the online submission site. As with real-life system designs, 6.033 design projects are under-specified, and it is your job to complete the specification in a sensible way given the overall requirements of the project. As with designs in practice, the specifications often need some adjustment as the design is fleshed out. We recommend that you start early so that you can evolve your design over time. A good design is likely to take more than just a few days to put together.

Late submission grading policy: If you submit your DP1 report late, we will penalize you one letter grade per 48 hours, starting from 5pm on the submission day. For example, if you submit the report anywhere from 1 minute to 48 hours late, and your report would have otherwise received a grade of "A", you will receive a "B"; if you submitted 49 hours late, you would receive a "C".

The Problem

Your job is to design a concept similar to Smart Folders on MacOS or Search Folders on Windows into the UNIX file system. The goal is for the user to be able to create special "virtual" directories that contain the results of a file search query as symbolic links to the original files (see Section 2.5/Page 104 of the textbook for a discussion of symbolic links). The main challenge is to do this in a manner that maintains performance of both the virtual directories that contain the search results and the directories that are being searched.

Note that symbolic links are different from the "hard" links described in the UNIX paper; unlike hard links, symbolic links simply encode the path of the file they point to. The file may or may not actually exist (so symbolic links are not removed if the file they point to is removed, and they don't modify the link count in a file's i-node). Symbolic links may also point to files that are on volumes that are not currently mounted.

Your search folders interface will take the form of a command-line tool with the following usage:


  searchmount mountpoint searchpath [expression]

searchmount -u mountpoint

The first command creates a new directory mountpoint and makes the new directory appear to contain symbolic links to files under searchpath matching the search query given in expression. The second command removes the mountpoint directory.

The expression given to the first command is in a format similar to the one accepted by the UNIX find command. Several variations of the find command exist depending on whether you are using Linux, BSD, MacOS etc., but your system must support at least the following kinds of expressions:

File names that are an exact match for some string, or that contain a particular substring.
File size is greater than, less than, or equal to a given value.
Time of file's creation, last modification, or last use is greater than, less than, or equal to a given time.
File's user or group ID is equal to or not equal to a given value.
File's permission bits match a specified value exactly, or at least one of the specified bits are set, or all of the specified bits are set.
Boolean combinations of the above (e.g. with the "operators" supported by find).

Note that the directory to be searched should be searched recursively, that is, the search should include files both in the searched directory and in its subdirectories, and so on.

As an example, consider a directory named /Users/jsmith/Research, containing many files in various subdirectories. The user can find files larger than 10 megabytes contained within this directory using the traditional find command:

$ find /Users/jsmith/Research -size +10M
/Users/jsmith/Research/References/In Spreadsheet/quan_haystackthesis.pdf
/Users/jsmith/Research/References/Other external/Example tailor-made/Artifax/eventconcerthallsmall.wmv
/Users/jsmith/Research/Work/130924 InfoVis/Slides V2.pptx
/Users/jsmith/Research/Work/130924 InfoVis/Slides V3.pptx
/Users/jsmith/Research/Work/130924 InfoVis/Slides.ppt

Using the searchmount command, the user can use a similar syntax to create a directory /Users/jsmith/BigResearchFiles that always appears to contain symbolic links to files in the /Users/jsmith/Research directory that are larger than 10 megabytes:

$ searchmount /Users/jsmith/BigResearchFiles /Users/jsmith/Research -size +10M
$ ls -l /Users/jsmith/BigResearchFiles
Slides V2.pptx -> /Users/jsmith/Research/Work/130924 InfoVis/Slides V2.pptx
Slides V3.pptx -> /Users/jsmith/Research/Work/130924 InfoVis/Slides V3.pptx
Slides.ppt -> /Users/jsmith/Research/Work/130924 InfoVis/Slides.ppt
eventconcerthallsmall.wmv -> /Users/jsmith/Research/References/Other external/Example tailor-made/Artifax/eventconcerthallsmall.wmv
quan_haystackthesis.pdf -> /Users/jsmith/Research/References/In Spreadsheet/quan_haystackthesis.pdf

From this point on, as files are modified, added, or removed in the original /Users/jsmith/Research directory, the apparent contents of /Users/jsmith/BigResearchFiles should be updated to reflect the results of the search query.

Note the search directories do not contain other directories, only files.

The search folders system will be implemented as a "removable file system" of the kind mentioned in the Section 3.4 of the UNIX paper. We will now give more details about the specific interfaces you will need to implement for the purposes of this exercise. After the user creates a search directory with searchmount, the operating system will invoke a collection of methods that you must design to create and read the contents of your directory. These methods are as follows:

DirEntryPointer getFirstDirectoryEntry(name): Get a pointer to the first entry in the directory with the given name, or null if there are no files in the directory. Called by the operating system when it wants to begin iterating over the files in your virtual file system. Note that there is only one top-level directory in your virtual search file system. So suppose your virtual file system is mounted on /mnt/bigfiles and this directory contains the files song1.mp3, song2.mp3, and song3.mp3, then getFirstDirectoryEntry("/mnt/bigfiles") will return a DirEntryPointer representing the directory entry for song1.mp3.
DirEntryPointer readNextDirectoryEntry(DirEntryPointer previousFile): Get a pointer to the next directory entry after previousFile in the directory containing said file, or null if there are no more files in the directory. To iterate over all the files in the directory, the operating system first calls getFirstDirectoryEntry to get the first directory entry and then calls readNextDirectoryEntry successively with the previous DirEntryPointer to retrieve additional entries. Continuing the example above, readNextDirectoryEntry(getFirstDirectoryEntry("/mnt/bigfiles")) would return a DirEntryPointer representing the directory entry for song2.mp3.
String fileName(DirEntryPointer entry): Return the name of the file represented by directory entry entry. Called when the operating system wants to get the name to display for a file listed with getFirstDirectoryEntry or readNextDirectoryEntry. Continuing our example, fileName(getFirstDirectoryEntry("/mnt/bigfiles")) would return the string "song1.mp3".
String readSymbolicLink(DirEntryPointer entry): Return the target of a symbolic link stored in the directory entry pointed to by entry. Called when the operating system wants to get the path of the file that directory entry entry on your virtual file system points to. So readSymbolicLink(getFirstDirectoryEntry("/mnt/bigfiles")) might return a string like "/home/john/Audio/song1.mp3".

The operating system will not look at the contents of the DirEntryPointer structure that you return from getFirstDirectoryEntry or readNextDirectoryEntry. However, you should make clear what you need to store in DirEntryPointer yourself in order to implement the methods above.

Note that the methods above are different than those described in Section 2.5 of the textbook in that they explicitly differentiate between directories and files, whereas the methods in the textbook specify that directories are nothing more than files with a particular structure. The advantage of the above design is that it allows you to design your file system without requiring the creation of inodes that contain the names of linked files.

Your design paper should clearly describe your design for these methods; this includes any on-disk data structures, in-memory data structures, and operations that are done at specific times in order to support the required operations.

In addition, you should describe what happens when the searchmount utility is initially run; specifically, describe the behavior of the createDirectory method, which is part of the searchmount utility:

createDirectory(name, searchPath, expression): Called when the user initially mounts the file system name for the specified search path with the specified search expression. You can assume that this method can call into the operating system to create a new, removable file system that exposes the four methods described above to search its contents.

In reality, implementing a virtual file system requires you to implement many other methods (e.g. to list permissions for directory entries, to attempt to create new files, etc.). On Linux, a virtual file system like this would be most easily implemented using a the FUSE interface. You are not required to specify a complete set of file system functions like those defined by FUSE, only the methods we have listed above. You also do not need to describe the implementation of the searchmount utility beyond the functioning of the createDirectorymethod.

Requirements

Your design should consider the following use cases:

Users interactively using searchmount, e.g., creating a search folder to search a never-before-searched target directory and immediately examining the resulting files, when the target directory contains a modest number of files (say, 300).
Users interactively using searchmount to search the system's root directory ("/"), which may contain a very large number of files (say, 100,000).
Users mounting search folders automatically at startup or login. This means that the createDirectory command should not take a long time to complete. It may also be desirable to avoid redoing a whole search at every reboot.

You should analyze your solution these use cases. Explain both the performance of access to the search folder itself and the performance impact on the rest of the system, especially the searched target directories. Keep in mind:

Many search folders may be defined to search the same target path.
Searched target directories may be very large, and contain many files. Doing a hard drive seek for every inode on the searched target file system may be slow.
There may be any number of results in the search folder, depending on the search query and the contents of the search target directory.
Searched target directories may be on any kind of file system; their exact implementation is beyond your control.
A search folder should not significantly impact the performance of the searched target directory.
Search results in the search folder should, as much as possible, be up to date.

Your design does not need to deal with failures (e.g. power outages or disk failures).

Some issues to consider include:

How do you avoid repeating the search operation every time the user makes a directory listing of the virtual directory containing the search results?
How do you ensure that displayed results fresh, i.e., up to date? (It is OK if sometimes results are not perfectly fresh, but you should be able to reason about when this is the case, and why it is a good tradeoff.)
What happens if the search results contain two files with the same name?
Do you need additional kinds of user interaction to support the requirements?
What happens if a virtual search directory is used as a searched target directory?

The list of issues is incomplete, but hopefully gets you going.

Some Helpful APIs

If you wish, you can assume the operating system provides a way to listen for changes to files in a directory, such as inotify.

Additionally, your implementation may make use of a simple persistent database with the following API:

DBPointer openDatabase(String filename): opens a database at the specified location on disk; database is created if it doesn’t exist.
put(DBPointer p, String key, String value): stores the specified key/value in the specified database. If the key already exists, the previous value is overwritten.
String get(DBPointer p, String key): returns the value of the specified key in the database.
remove(DBPointer p, String key): deletes the specified key from the database

You may create as many databases as you wish, and you can assume that the database is stored persistently on disk (e.g., survives across restarts and reboots, and does not become corrupted.) In addition, you can assume that the put and get commands are relatively efficient (e.g., the running time of a single get or put is small and does not grow significantly with the size of the database, even for databases with millions of records.)

Design Proposal

The design proposal should be a concise summary (800 words) of your overall system design.

The core of the proposal should be the design description. This section must include at least one graphic, correctly formatted with a caption and brief description.

You do not have to present a detailed rationale or analysis in your proposal. However, if any of your design decisions are unusual (particularly creative, experimental, or risky) or if you deviate from the requirements, you should explain and justify those decisions in your proposal.

You will receive feedback on your proposal from your TA in time to adjust your final report. You will also receive a writing program grade for your proposal, as well as feedback from your writing instructor that will help you improve the writing of the final report. Your writing instructor will evaluate the proposal according to the CI guidelines.

Some writing considerations for the proposal (and report):

Does the proposal/report summarize the proposed system as a whole?
Does the proposal/report make an argument in favor of the system (as opposed to a different design)?
Is content organized in accordance with the guidelines?
Do hierarchy and structure match guidelines and clearly signal the role of information?
Do figures clarify or illustrate an essential concept presented in the text? Are figure properly formatted, labeled, and referenced?
Does the document meet basic standards of professional grammar, syntax, spelling, etc?

Here are a few tips:

Use ideas and terms from the textbook and papers when appropriate; this will save you space (you can refer the reader to the relevant section of the textbook) and will save the reader some effort.
Before you explain the solution to any given problem, say what the problem is.
Before presenting the details of any given design component, ensure that the purpose and requirements of that component are well described.
It's often valuable to illustrate an idea using an example, but an example is no substitute for a full explanation of the idea.
You may want to separate the explanation of a component's data structures from its algorithms to access or use those data structures.
Explain all figures, tables, and pseudo-code; explain what is being presented, and what conclusions the reader should draw.

Design Report

Your report should explain your design. It should discuss the major design decisions and tradeoffs you made, and justify your choices. It should discuss any limitations of which you are aware. You should assume that your report is being read by someone who has read this assignment, but has not thought carefully about this particular design problem. Give enough detail that your project can be turned over successfully to an implementation team. Your report should convince the reader that your design satisfies the requirements in the "Requirements" section above.

Use this organization for your report:

Title page: Give your report a title that reflects the subject and scope of your project. Include your name, email address, recitation instructor, section time(s), and the date on the title page.
No table of contents is needed.
Introduction: Summarize what your design is intended to achieve, outline the design, explain the major trade-offs and design decisions you have made, and justify those trade-offs and decisions.
Design: Explain your design. Identify your design's main components, state, and algorithms. You should sub-divide the design, with corresponding subsections in the text, so that the reader can focus on and understand one piece at a time. Explain why your design makes sense as well as explaining how it works. Use diagrams, pseudo-code, and worked examples as appropriate.
Analysis: Explain how you expect your design to behave in different use cases. What use cases might pose problems for throughput, latency, or even correctness? What do you expect to be the scalability limits of your design?
Conclusion: Briefly summarize your design and provide recommendations for further actions and a list of any problems that must be resolved before the design can be implemented.
Acknowledgments and references: Give credit to individuals whom you consulted in developing your design. Provide a list of references at the end using the IEEE citation-sequence system ("IEEE style") described in the Mayfield Handbook.
Word count. Please indicate the word count of your report at the end of the document. Captions of figures should be included in the total word count.
Footnotes. Please do not use footnotes in your report.

The writing suggestions for the proposal also apply to the report. You can find previous examples of DP1 reports under the Excellent Writing Examples link on the left-hand side of the course home page.

How we evaluate your work

Your recitation and writing instructors will assign your report a grade that reflects both the design itself and how well your report presents the design. The most important aspect of your design is that we can understand how it works and that you have clearly addressed the requirements and provided the elements listed in the "Requirements" and "Design Report" sections above. Complicated designs that we cannot understand will not be graded favorably.

Some overall content considerations:

Design supports use cases and describes how they work.
Clean yet sufficient design, minimal mechanism.
Reasonable implementation detail.
Convincing space and performance evaluation.

The grading rubric for the final report is as follows:

Overall design When implementation details are presented are they justified with reference to the overall purpose of the system? Is there any consideration of possible design alternatives and why the resulting ones were chosen?	30
The degree to which the design addresses the requirements and use cases Does the paper discuss all the stipulated use cases and demonstrate the design meets the requirements?	20
Analysis of space and time requirements Is the analysis quantitative and justified in terms of reasonable performance metrics for the underlying components (disk, network, processing, etc. as appropriate)	20
User experience Is the system behavior described in terms of what users would experience?	15
Quality of the figures that illustrate the design	5
Overall presentation	10

The items in the grading rubric are not independent: a design that we cannot understand will likely result in a low score for several items.

85 and above is an A grade. Between 60 and 85 is a B grade. You will have to hand in a design project to pass 6.033.

Collaboration

This project is an individual effort. You are welcome to discuss the problem and ideas for solutions with your friends, but if you include any of their ideas in your solution you should explicitly give them credit. You must be the sole author of your report.

Clarifications

(None yet.)