6.033 - Computer System Engineering (Spring 2001) Handout 17 - April 12, 2001
Last modified: April 11 22:15

Design Project #2: Peer-to-peer File Sharing

This is a team project, so form a team of 3 people, where all of you have the same recitation instructor. Please send email to your recitation instructor as soon as possible with the names and email addresses of your team members. Please let us know by noon on Thursday, April 19th at the latest. And get to work! Your team's report is due on Thursday, May 10th, in recitation. Questions and answers about DP2 will also be posted here; check periodically for updates.

New: keep an eye on the DP2 FAQ for updates and clarifications.

Introduction

Your job is to design a peer-to-peer file sharing application. Your design should allow Internet users to cooperatively pool free disk space and network bandwidth for use in serving public-access data.

Many Internet users have useful and popular data, but do not have the server or network resources required to satisfy the demand for that data. For example, an open source developer might wish to publish a software distribution; the average demand for that distribution may be low, but the instantaneous demand after a new release may be too high for the developer's server or network connection to handle. Your design should allow such users to share disk space and bandwidth in a way that spreads the total load over all the participants. Your design should avoid the common requirement that each publisher of data own a network link with capacity equal to the peak demand for that data.

You should focus your design efforts on the availability and authenticity of the stored data; on avoiding centralized components; and on good performance with large numbers of clients and servers.

Your task

Your system should support the following model of use. Internet users, perhaps thousands or millions of them, will volunteer their machines for use as servers in your system. Publishers of data should be able to insert named documents into your system, causing them to be stored (in whatever manner you determine) on the servers. A publisher should be able to update a document without changing its name. A consumer of data should be able to retrieve a document from your system using its name, and verify the authenticity of the resulting data. You cannot assume that all the participating servers are trustworthy.

Your system should implement an API like the following for publishers and consumers:

You should also have some procedure for deleting or garbage collecting useless documents, or a good argument for why such a feature is not needed.

One of your design decisions is to figure out a reasonable naming scheme. You may find it useful to assume that names typically come from links embedded in documents, much like URLs; this means that you don't have to make it particularly convenient for users to remember and type document names. Consider using either content hashes (a cryptographic hash of the document's content) or public keys as document names. In the latter case, you could demand that publishers include a digital signature in each document; consumers could check the signature against the name/key, and your servers could reject inserts (including updates) for which the name/key did not match the signature. One place to look for naming ideas is the keys and searching section of the FreeNet paper.

The challenge in your project is to support the above API with no central control; your system should be truly cooperative, or in today's buzzwords peer-to-peer. Your system will be composed of the Internet hosts of a large number of volunteer users. None of these hosts will be particularly reliable or powerful. You cannot assume the existence of some special set of very reliable or high-capacity servers. You can't assume the existence of any formal organization that runs or maintains your system.

More precisely, your job is to design a cooperative system that:

A user of your systems retrieves documents through a Web browser, but you are allowed to design a program that runs on the user's computer that intercepts the browser's HTTP requests and feeds them into your system.

Chord Interface

To simplify the problem, we provide you with a distributed hash system, called Chord, that maps keys to nodes. Chord keys are arbitrary 160-bit values; you'll probably want to use SHA-1 to generate them in some way based on your document names. You might use Chord to map a document name to the server that stores it.

Chord is described on a separate page. You don't have to understand its inner workings in order to use it, so you can safely skip the description. However, you may find that understanding Chord will give you better opportunities for optimization.

Chord assigns every participating node a 160-bit ID. Chord maps key k to the node with the smallest ID that follows k; that node is called k's successor. A Chord node does not keep a complete list of all other Chord nodes in order to perform this mapping. Instead, each node knows about a few other nodes, and sends "find successor" messages to those other nodes to help it map keys to nodes.

Chord presents you with this API:

You can change the details of this interface if you need to, as long as the change is consistent with the Chord implementation. If you change the interface, your report should include an outline of how to implement your changes.

Other peer-to-peer systems

Peer-to-peer systems are a hot topic and quite a few have been proposed or implemented. They differ in the services they offer and the guarantees they provide.

Napster allows people to share music files across the Internet. It involves a central server, containing a database of everybody's music files. You ask the central server to perform a lookup for you, which returns the IP addresses of a few Napster users who have the music you want. Then you directly contact one of those users' computers and ask for the file. The main drawback of Napster is the central server, which is a potential performance and reliabilty problem. Napster retrieves documents using keyword search, which you don't need to implement; you should identify documents using unique IDs or names.

GNUtella provides distributed search, as well as direct peer to peer transfer of the resulting files. Each GNUtella node knows about a few other neighbor nodes; this results in a mesh of nodes. When a node wants to find a file, it floods a query for the file over the whole mesh, starting with the nodes it knows about. When a node receives a query, it tries to match that query against locally-stored files. If there's a match, it returns the file to the original querier. Otherwise the node forwards the queries to its neighbors. This approach eliminates the central component of Napster. However, every node is involved in every lookup, so GNUtella doesn't scale well to large numbers of nodes.

FreeNet provides an interface similar to the one we are asking for in this assignment. Each document has an identifier, clients ask for documents using identifiers, and FreeNet figures out which server node is likely to store a document based on the identifier. You may find FreeNet a useful source of ideas about naming and authenticity checking. FreeNet provides anonymity for both consumers and publishers of data, and makes malicious deletion of data difficult. However, FreeNet's focus on anonymity causes it to sometimes fail to find documents that exist. You are not required to implement anonymity.

The Eternity Service is a distributed storage service that guarantees that documents will never disappear. Its guarantees are quite strong; it can even defend against legal attacks by governments. You are not required to provide eternal storage.

Your report

Your report should include an explanation of how your system works, and a justification of why your design is good. You should include pseudo-code and diagrams as aids to (but not replacements for) your explanation.

Your paper must not exceed ten 11-point, single-spaced pages in length. Please use 1-inch margins.

Your report should describe the operation of your system. It should describe how your insert() and get() functions work. It should describe how to publish a document, how users name documents, and how users decide whether document content is authentic. You should characterize the extent to which server and network failures affect the availability of documents.

Make sure your report touches on the following points. You should not try to design a system that is perfect in all of these areas, but you should at least be able to describe its properties.

Writing Suggestions

This section provides some suggestions and guidelines on writing style and some of the things we look for in your report.

1. Suggestions on Writing Style

Your paper should be as long as is necessary to explain the problem, your solution, the alternatives you considered, and the reasons for your choices. It should be no longer than that. Again, please make sure your paper is no longer than ten pages.

A good paper begins with an abstract. The abstract is a very short summary of the entire paper. It is not an outline of the organization of the paper! It states the problem to be addressed (in one sentence). It states the essential points of your solution, without any detailed justification. And it announces any conclusions you have drawn. Good abstracts can fit in 100-150 words, at most.

The body of your paper should expand the points made in the abstract. Here you should:

  1. Explain the problem and the externally imposed constraints. You should explain to your intended audience the background of the problem in terms that the audience can understand.
  2. Explain and elaborate your solution. Be sure to explain the approach or architecture conceptually before delving into details, components, and mechanics. (Remember, you aren't writing a mystery novel!) Present any analysis clearly and concisely.
  3. Show how your solution satisfies the constraints and solves the problem (or how it fails to do so); explain how the properties of your solution that result from choices you have made in your design are reasonable or desirable in the context of the problem statement.
  4. Briefly describe the alternative approaches that you considered and rejected, and why you rejected them. Your paper will be more convincing if you say not just why your idea is good, but why it is better than the alternatives. (For example, if another approach would meet all of the objectives perfectly, but the cost would be 100 times higher, then you should mention that as a reason for choosing your less general but cheaper approach.)
  5. Document your sources, giving a list of all references (including personal communications). The style of your citations (references) and bibliography should be similar to the styles in the technical papers you're reading in this class. In particular, a bibliography at the end and no citations in the text of your paper is insufficient; we'd like to see what specific pieces of information you learned from where as we read your paper.

Write for an audience consisting of colleagues who took 6.033 five years ago. That is, they understand the underlying system and network concepts and have a fair amount of experience applying them in various situations, but they have not thought carefully about the particular problem you are dealing with. Give enough detail that you rdesign could be turned over to some programmers for implementation with some confidence that you won't surprised by the result.

2. How do we evaluate your report?

When evaluating your report, your instructor will look at both content and writing.

Some content considerations:

Some writing considerations: You can find other helpful suggestions on writing this kind of report in the M.I.T. Writing Program's on-line guide to writing Design and Feasibility Reports. You may also want to look at the Mayfield Handbook's explanation of IEEE documentation style. A very good book on writing style is: "The Elements of Style," by William Strunk Jr. and E. B. White, Third Ed., MacMillan Publishing Co., New York, NY, 1979.

VII. Schedule

Your report is due at the beginning of recitation Thursday, May 10th, 2001.

Go to 6.033 Home Page Questions or Comments: 6.033-tas@mit.edu