6.033 Assignment: Recitation 14, Tuesday, March 30, 2004

M.I.T. DEPARTMENT OF EECS
6.033 - Computer System Engineering	Recitation 14 - Tuesday, March 30, 2004

Read The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (reading #8) and The Effects of September 11 on the Leading Search Engine by Richard W. Wiggins (available only on-line).

The first paper tries to push a number of themes simultaneously. First, the authors try to explain how their search engine gives better answers than others. Second, they talk about the system architecture that actually lets them to crawl, index, and search the web quickly enough. They also take the time to present some idealism that work in search engines ought to be shared.

For the purposes of 6.033, it is the second issue, the system architecture, that is most obviously relevant. Their discussion in Section 2, of the various techniques they apply to improve the quality of results, is fascinating but can be skimmed rather lightly. The real meat of the paper for you is Section 4. The driving feature of their entire design is the sheer size of the Web. (You may find some of their numbers a bit small by today's standards, but in the interim machines and the Web have both scaled up such that their assertions about what can fit where are still pretty accurate.)

Some background on search: while the details differ, essentially all large Web search engines work similarly at the systems level. In order to resolve a query made up of a number of search terms, we want to inspect all the documents and find those that have the most, or the best, occurrences of those search terms. One could imagine doing this by brute force (e.g., using the UNIX command grep), but it is clear that a scheme that actually touches every document on the Web in order to resolve every query is going to have some scaling problems.

Instead, any efficient search engine makes use of an inverted index. This is a data structure that lists, for any word, all of the documents that contain that word. When faced with a search term that appears only in a small fraction of all the documents on the web, the search engine can limit its inspection to only those documents that actually contain that word --- a massive improvement in efficiency. The index is called "inverted" because it maps from each word to documents that have the word; conversely the (much less used) forward index maps from each document to the words in that document.

A Web search engine's work is therefore built around three steps. Its first job is to crawl the web, gathering copies of all of the documents that will be searchable by the engine. The second job is to build the inverted index. This is basically a big sorting job. As a crawler visits a document, it can be thought of as outputting a sequence of pairs (called postings) of the form (document-id,term-id) which, since we visit documents in sequence, will be sorted according to the first, document-id coordinate. For the inverted index, we want the same sequence of pairs, sorted by the second, term-id coordinate. Once the inverted index is built, the search engine resolves a query involving multiple terms by looking up each term in the inverted index, finding its containing documents, and somehow combining the list of documents for each term into a single list. Finally, using heuristics (some of which are mentioned in the paper and others of which Google considers proprietary) it arranges the list in order with the goal that the things the client was probably looking for appear at the top.

The performance of the search engine is generally evaluated in terms of recall (what fraction of the desired documents are actually found) and precision (what fraction of the documents shown to the user are actually desired).

Here are some specific questions as you read about how Google tackled these problems:

They spend quite some time compressing their data. Why do they bother (disk is cheap) and what are some compression tricks they use?
They assert that, on the Web, recall is not important to optimize, and instead one should worry about precision and speed. How is this argument justified? How do they get improved speed by sacrificing recall, and why doesn't the sacrifice of recall matter?
They don't seem to say whether their barrels are on one disk or many. What do you think and why?
They say that the main bottleneck for search is network traffic for NFS I/O. But the NFS paper implied that network traffic isn't a big contributor to the load on an NFS server. How do we resolve this contradiction?
How do people try to trick search engines into ranking their web sites higher?
The authors take a rather idealistic stance regarding "good search", especially in the appendix. How well has that stance stood up in the intervening years?

Now let's turn to the 9/11 paper, a rather different entity reflecting a view from the outside. This paper is very light on the technical side, and makes an easy read from a technical perspective (though the memories may not be pleasant). A straight pass from beginning to end should be easy.

A number of questions arise regarding the use of naming and finding on the Web:

The paper points out use of Google to "find" CNN. Compare this to using DNS to find cnn.com. Are Google and DNS doing the same job? What are the pros and cons of each system?
Would it make sense to abolish DNS and use Google to resolve DNS names?
We hear lots about fights over DNS names. Why aren't there similar fights about placement in search engines?

Go to 6.033 Home Page