6.033 Handout 16: Assignment 4 (03/17/03

M.I.T. DEPARTMENT OF EECS

6.033 - Computer System Engineering

Handout 16 - March 7, 2002

Assignment 6: March 17 through April 4

For Lecture: Monday, March 18

Today's lecture is the last lecture on networking. Read Chapter 4, Section F (Shared resources and congestion control) of the 6.033 class notes.

For Recitation: Tuesday, March 19

Read the paper "Design and Implementation of the Sun Network Filesystem" (Reading #11). The paper is in your reading package and is also available online.

If you haven't started reading this paper yet, you can focus only on the first five pages, as indicated by the schedule.

There is no one-pager due today. Instead, keep working on Design Project #1.

For Lecture: Wednesday, March 20

Today's lecture will begin our discussion of Naming. Read Chapter 5, Section A (Reference in computer systems) and Section B (Considerations in the design of naming systems) of the 6.033 class notes.

For Recitation: Thursday, March 21

Read the RFC on The IP Network Address Translator (NAT). If you are curious about RFC, you can read the RFC overview page. Also read Things that NAT's break. The origins of this document are slightly unclear, but we believe that the document has been compiled by Keith Moore (U. Tennesee), who may also be the author of the document. These papers are not available in your reading package.

For background on NAT, you may wish to read:

There is no hands-on assignment this week. Instead, work on your Design Project #1 which is due today in recitation.

For March 24 through 28

Have a great Spring Break!

For Lecture: Monday, March 31

We continue our discussion of Naming in today's lecture. Read Chapter 5, Appendix A (Case study of UNIX persistent object naming), Appendix C (Case study of uniform resource locators, URLs) of the 6.033 class notes and Appendix D (Pathologies in the use of names).

For Recitation: Tuesday, April 1

For recitation and your one-pager, read the paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Reading #12) by Sergey Brin and Lawrence Page. The paper is in your reading package and is also available online. Address the following question in your one-pager:

Ranking is one of the most important aspects of the seach engine. Traditional ranking techniques that used to work very well in the context of searching research papers and encyclopedias turn out to be very bad for the web. Unlike research papers or encyclopedias, web pages are created by people who want to do everything they can so that their own web page becomes one of the highest ranked results (tricking search engine). Traditional Information Retrieval (IR) uses TF x IDF (term frequency times inverse document frequency) to rank documents. Traditional search engines for research papers use the citation count of each paper as their ranks. How do you spam these two ranking techniques(i.e. how do you trick the two ranking techniques into highly ranking a page)? Compare those two techinque with PageRank and explain why PageRank is more resilient to user tricks. (Optional: Can you figure out a scenario to trick Google into highly ranking a page?)
For TF x IDF ranking, one calculates each document's rank with regard to a query as follows:
terminology:

(TF) term frequency = the number of occurances of a query word in a document.
(This captures the intuition that the more times a query word occurs in a document, the more relevant that document should be).
(IDF) inverse document frequence = 1 / (the total number of occurances of a word in the ENTIRE DOCUMENT CORPUS).
(IDF can be calculated for each word in the corpus in advance. It is independent of queries and any individual document. Intuitively, IDF measures the importance of a particular word. Take an example query "professor frans kaashoek mit": The words "professor" and "mit" appear many more times in the entire document corpus than "frans" or "kaashoek". Therefore their respective IDF is very very low compared with "frans" or "kaashoek". Hence a document that matches words "frans kaashoek" should have a much higher rank than those that match words "professor mit".)
Each document's rank/score is calculated by multiplying the TF of the query word in that document with that word's IDF. If a query contains multiple words, then the document rank is the summation of the individual TFxIDF scores of each query word.
Citation count:
For each research papers, it cites a couple of papers in its reference section. A research paper's rank is set to be the number of citations it gets from other research papers. The intuition is that the more a paper is cited by other pages, the more important it is. Citation count is similar to just counting the back-links of each web page and use this count as the rank of each web page.

For more search engine fun, also read the article "The Effects of September 11 on the Leading Search Engine" by Richard Wiggins. This paper is not available in your reading packge.

A more careful discussion of the PageRank algorithm is available here and a more "authentic" description of PageRank is at Google's site

For Lecture: Wednesday, April 2

This lecture is the first lecture covering security in computer systems. Read Chapter 6, Section A (Introduction to secure systems) and Section B (Cryptography as a building block for secure systems).

For Recitation: Thursday, April 3

Read Steven Lerman, James Bruce and Jerome Saltzer's article on "Teaching Students About Responsible Use of Computers" (Reading #13). The paper is in your reading package. Also read Chapter 6, Appendix A (War Stories: Protection System Failures) of the 6.033 class notes.

Also, do Hands-on #5 on DNS for today.

Go to 6.033 Home Page Questions or Comments: 6.033-tas@mit.edu