6.033 - Computer System Engineering (Spring 2001) | Handout 17 - April 12, 2001 |
New: keep an eye on the DP2 FAQ for updates and clarifications.
Your job is to design a peer-to-peer file sharing application. Your design should allow Internet users to cooperatively pool free disk space and network bandwidth for use in serving public-access data.
Many Internet users have useful and popular data, but do not have the server or network resources required to satisfy the demand for that data. For example, an open source developer might wish to publish a software distribution; the average demand for that distribution may be low, but the instantaneous demand after a new release may be too high for the developer's server or network connection to handle. Your design should allow such users to share disk space and bandwidth in a way that spreads the total load over all the participants. Your design should avoid the common requirement that each publisher of data own a network link with capacity equal to the peak demand for that data.
You should focus your design efforts on the availability and authenticity of the stored data; on avoiding centralized components; and on good performance with large numbers of clients and servers.
Your system should support the following model of use. Internet users, perhaps thousands or millions of them, will volunteer their machines for use as servers in your system. Publishers of data should be able to insert named documents into your system, causing them to be stored (in whatever manner you determine) on the servers. A publisher should be able to update a document without changing its name. A consumer of data should be able to retrieve a document from your system using its name, and verify the authenticity of the resulting data. You cannot assume that all the participating servers are trustworthy.
Your system should implement an API like the following for publishers and consumers:
You should also have some procedure for deleting or garbage collecting useless documents, or a good argument for why such a feature is not needed.
One of your design decisions is to figure out a reasonable naming scheme. You may find it useful to assume that names typically come from links embedded in documents, much like URLs; this means that you don't have to make it particularly convenient for users to remember and type document names. Consider using either content hashes (a cryptographic hash of the document's content) or public keys as document names. In the latter case, you could demand that publishers include a digital signature in each document; consumers could check the signature against the name/key, and your servers could reject inserts (including updates) for which the name/key did not match the signature. One place to look for naming ideas is the keys and searching section of the FreeNet paper.
The challenge in your project is to support the above API with no central control; your system should be truly cooperative, or in today's buzzwords peer-to-peer. Your system will be composed of the Internet hosts of a large number of volunteer users. None of these hosts will be particularly reliable or powerful. You cannot assume the existence of some special set of very reliable or high-capacity servers. You can't assume the existence of any formal organization that runs or maintains your system.
More precisely, your job is to design a cooperative system that:
A user of your systems retrieves documents through a Web browser, but you are allowed to design a program that runs on the user's computer that intercepts the browser's HTTP requests and feeds them into your system.
To simplify the problem, we provide you with a distributed hash system, called Chord, that maps keys to nodes. Chord keys are arbitrary 160-bit values; you'll probably want to use SHA-1 to generate them in some way based on your document names. You might use Chord to map a document name to the server that stores it.
Chord is described on a separate page. You don't have to understand its inner workings in order to use it, so you can safely skip the description. However, you may find that understanding Chord will give you better opportunities for optimization.
Chord assigns every participating node a 160-bit ID. Chord maps key k to the node with the smallest ID that follows k; that node is called k's successor. A Chord node does not keep a complete list of all other Chord nodes in order to perform this mapping. Instead, each node knows about a few other nodes, and sends "find successor" messages to those other nodes to help it map keys to nodes.
Chord presents you with this API:
You can change the details of this interface if you need to, as long as the change is consistent with the Chord implementation. If you change the interface, your report should include an outline of how to implement your changes.
Napster allows people to share music files across the Internet. It involves a central server, containing a database of everybody's music files. You ask the central server to perform a lookup for you, which returns the IP addresses of a few Napster users who have the music you want. Then you directly contact one of those users' computers and ask for the file. The main drawback of Napster is the central server, which is a potential performance and reliabilty problem. Napster retrieves documents using keyword search, which you don't need to implement; you should identify documents using unique IDs or names.
GNUtella provides distributed search, as well as direct peer to peer transfer of the resulting files. Each GNUtella node knows about a few other neighbor nodes; this results in a mesh of nodes. When a node wants to find a file, it floods a query for the file over the whole mesh, starting with the nodes it knows about. When a node receives a query, it tries to match that query against locally-stored files. If there's a match, it returns the file to the original querier. Otherwise the node forwards the queries to its neighbors. This approach eliminates the central component of Napster. However, every node is involved in every lookup, so GNUtella doesn't scale well to large numbers of nodes.
FreeNet provides an interface similar to the one we are asking for in this assignment. Each document has an identifier, clients ask for documents using identifiers, and FreeNet figures out which server node is likely to store a document based on the identifier. You may find FreeNet a useful source of ideas about naming and authenticity checking. FreeNet provides anonymity for both consumers and publishers of data, and makes malicious deletion of data difficult. However, FreeNet's focus on anonymity causes it to sometimes fail to find documents that exist. You are not required to implement anonymity.
The Eternity Service is a distributed storage service that guarantees that documents will never disappear. Its guarantees are quite strong; it can even defend against legal attacks by governments. You are not required to provide eternal storage.
Your paper must not exceed ten 11-point, single-spaced pages in length. Please use 1-inch margins.
Your report should describe the operation of your system. It should describe how your insert() and get() functions work. It should describe how to publish a document, how users name documents, and how users decide whether document content is authentic. You should characterize the extent to which server and network failures affect the availability of documents.
Make sure your report touches on the following points. You should not try to design a system that is perfect in all of these areas, but you should at least be able to describe its properties.
Your paper should be as long as is necessary to explain the problem, your solution, the alternatives you considered, and the reasons for your choices. It should be no longer than that. Again, please make sure your paper is no longer than ten pages.
A good paper begins with an abstract. The abstract is a very short summary of the entire paper. It is not an outline of the organization of the paper! It states the problem to be addressed (in one sentence). It states the essential points of your solution, without any detailed justification. And it announces any conclusions you have drawn. Good abstracts can fit in 100-150 words, at most.
The body of your paper should expand the points made in the abstract. Here you should:
Write for an audience consisting of colleagues who took 6.033 five years ago. That is, they understand the underlying system and network concepts and have a fair amount of experience applying them in various situations, but they have not thought carefully about the particular problem you are dealing with. Give enough detail that you rdesign could be turned over to some programmers for implementation with some confidence that you won't surprised by the result.
When evaluating your report, your instructor will look at both content and writing.
Some content considerations:
Go to 6.033 Home Page | Questions or Comments: 6.033-tas@mit.edu |