6.033 Assignment: Recitation 14, Tuesday, March 30, 2004

M.I.T. DEPARTMENT OF EECS
6.033 - Computer System Engineering Recitation 14 - Tuesday, March 30, 2004

Read The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (reading #8) and The Effects of September 11 on the Leading Search Engine by Richard W. Wiggins (available only on-line).

The first paper tries to push a number of themes simultaneously. First, the authors try to explain how their search engine gives better answers than others. Second, they talk about the system architecture that actually lets them to crawl, index, and search the web quickly enough. They also take the time to present some idealism that work in search engines ought to be shared.

For the purposes of 6.033, it is the second issue, the system architecture, that is most obviously relevant. Their discussion in Section 2, of the various techniques they apply to improve the quality of results, is fascinating but can be skimmed rather lightly. The real meat of the paper for you is Section 4. The driving feature of their entire design is the sheer size of the Web. (You may find some of their numbers a bit small by today's standards, but in the interim machines and the Web have both scaled up such that their assertions about what can fit where are still pretty accurate.)

Some background on search: while the details differ, essentially all large Web search engines work similarly at the systems level. In order to resolve a query made up of a number of search terms, we want to inspect all the documents and find those that have the most, or the best, occurrences of those search terms. One could imagine doing this by brute force (e.g., using the UNIX command grep), but it is clear that a scheme that actually touches every document on the Web in order to resolve every query is going to have some scaling problems.

Instead, any efficient search engine makes use of an inverted index. This is a data structure that lists, for any word, all of the documents that contain that word. When faced with a search term that appears only in a small fraction of all the documents on the web, the search engine can limit its inspection to only those documents that actually contain that word --- a massive improvement in efficiency. The index is called "inverted" because it maps from each word to documents that have the word; conversely the (much less used) forward index maps from each document to the words in that document.

A Web search engine's work is therefore built around three steps. Its first job is to crawl the web, gathering copies of all of the documents that will be searchable by the engine. The second job is to build the inverted index. This is basically a big sorting job. As a crawler visits a document, it can be thought of as outputting a sequence of pairs (called postings) of the form (document-id,term-id) which, since we visit documents in sequence, will be sorted according to the first, document-id coordinate. For the inverted index, we want the same sequence of pairs, sorted by the second, term-id coordinate. Once the inverted index is built, the search engine resolves a query involving multiple terms by looking up each term in the inverted index, finding its containing documents, and somehow combining the list of documents for each term into a single list. Finally, using heuristics (some of which are mentioned in the paper and others of which Google considers proprietary) it arranges the list in order with the goal that the things the client was probably looking for appear at the top.

The performance of the search engine is generally evaluated in terms of recall (what fraction of the desired documents are actually found) and precision (what fraction of the documents shown to the user are actually desired).

Here are some specific questions as you read about how Google tackled these problems:

Now let's turn to the 9/11 paper, a rather different entity reflecting a view from the outside. This paper is very light on the technical side, and makes an easy read from a technical perspective (though the memories may not be pleasant). A straight pass from beginning to end should be easy.

A number of questions arise regarding the use of naming and finding on the Web:


Go to 6.033 Home Page