Preparation for Recitation 14
Read The Anatomy
of a Large-Scale Hypertextual Web Search Engine by Sergey Brin
and Lawrence Page and The Effects of September 11 on the
Leading Search Engine by Richard W. Wiggins. Both of those
readings are available only online, and are not in your
packet.
Sergy Brin and Lawrence Page. The anatomy of a large-scale
hypertextual web search engine. Proc. of the 7th WWW
Conference, Brisbane, Australia (April 1998). Published in:
COMPUTER NETWORKS, (formerly Computer Networks and ISDN Systems),
Vol 30, 1998, pp 107-117.
The first paper tries to push a number of themes simultaneously.
First, the authors try to explain how their search engine gives
better answers than others. Second, they talk about the system
architecture that actually lets them to crawl, index, and search
the web quickly enough. They also take the time to present some
idealism that work in search engines ought to be shared.
For the purposes of 6.033, it is the second issue, the system
architecture, that is most obviously relevant. Their discussion
in Section 2, of the various techniques they apply to improve the
quality of results, is fascinating but can be skimmed rather
lightly. The real meat of the paper for you is Section 4. The
driving feature of their entire design is the sheer size of the
Web. (You may find some of their numbers a bit small by today's
standards, but in the interim machines and the Web have both
scaled up such that their assertions about what can fit where are
still pretty accurate.)
Some background on search: while the details differ, essentially
all large Web search engines work similarly at the systems level.
In order to resolve a query made up of a number of search terms,
we want to inspect all the documents and find those that have the
most, or the best, occurrences of those search terms. One could
imagine doing this by brute force (e.g., using the UNIX command
grep), but it is clear that a scheme that actually touches every
document on the Web in order to resolve every query is going to
have some scaling problems.
Instead, any efficient search engine makes use of an inverted
index. This is a data structure that lists, for any
word, all of the documents that contain that word. When
faced with a search term that appears only in a small fraction of
all the documents on the web, the search engine can limit its
inspection to only those documents that actually contain that word
--- a massive improvement in efficiency. The index is called
"inverted" because it maps from each word to documents
that have the word; conversely the (much less used) forward
index maps from each document to the words in that
document.
A Web search engine's work is therefore built around three steps.
Its first job is to crawl the web, gathering copies of
all of the documents that will be searchable by the engine. The
second job is to build the inverted index. This is basically a
big sorting job. As a crawler visits a document, it can be
thought of as outputting a sequence of pairs (called
postings) of the form (document-id,term-id) which, since we
visit documents in sequence, will be sorted according to the
first, document-id coordinate. For the inverted
index, we want the same sequence of pairs, sorted by the second,
term-id
coordinate. Once the inverted index is built, the search engine
resolves a query involving multiple terms by looking up each term
in the inverted index, finding its containing documents, and
somehow combining the list of documents for each term into a
single list. Finally, using heuristics (some of which are
mentioned in the paper and others of which Google considers
proprietary) it arranges the list in order with the goal that the
things the client was probably looking for appear at the top.
The performance of the search engine is generally evaluated in
terms of recall (what fraction of the desired documents
are actually found) and precision (what fraction of the
documents shown to the user are actually desired).
Here are some specific questions as you read about how Google
tackled these problems:
- They spend quite some time compressing their data. Why do they
bother (disk is cheap) and what are some compression tricks they
use?
- They assert that, on the Web, recall is not important to
optimize, and instead one should worry about precision and speed.
How is this argument justified? How do they get improved speed
by sacrificing recall, and why doesn't the sacrifice of recall
matter?
- They don't seem to say whether their barrels are on one disk or
many. What do you think and why?
- How do people try to trick search engines into ranking their web
sites higher?
- The authors take a rather idealistic stance regarding "good
search", especially in the appendix. How well has that
stance stood up in the intervening years?
Now let's turn to the 9/11 paper, a rather different entity
reflecting a view from the outside. This paper is very light on
the technical side, and makes an easy read from a technical
perspective (though the memories may not be pleasant). A straight
pass from beginning to end should be easy.
A number of questions arise regarding the use of naming and
finding on the Web:
- The paper points out use of Google to "find" CNN.
Compare this to using DNS to find cnn.com. Are Google and DNS doing the
same job? What are the pros and cons of each system?
- Would it make sense to abolish DNS and use Google to resolve DNS
names?
- We hear lots about fights over DNS names. Why aren't there
similar fights about placement in search engines?
|