6.033 Discussion Suggestions: Google paper

6.033--Computer System Engineering

Suggestions for classroom discussion

Topic: Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine.

By J. H. Saltzer, 2 April 2002. Numbers updated 24 March 2004

Start with an update. On 24 March 2004, the Google home page reported that it is "Searching 4,285,199,774 web pages". The same week, Newsweek reported that Google was performing 200 million searches/day.

If the average search requires 20 disk hits, what is the minimum number of parallel disks they must be using?
(This requires making some assumptions. Suppose that in the busiest second they perform about five times as many searches as in the average second. A day has 86,400 seconds, so 200 million searches/day is an average of 2,300 searches per second. The busiest second would have 10,000 searches, which would in turn require 10,000*20 = 200,000 disk accesses/second. Suppose a (cheap) disk can seek in 10 milliseconds, rotate into position in 8 milliseconds, and deliver the data 2 milliseconds later. One disk can thus perform about 50 operations/second. This suggests that there must be at least 4,000 disks all running flat out. If the system is loaded to half capacity to keep queues short, you might expect to find 8,000 disks.
As of summer 2003, each Google site was actually implemented with 6,000 rack-mounted PC's, most of which have two 80-gigabyte disks. For reliability, there are at least half a dozen such sites scattered around the world--the exact number is a state secret. So Google owns about 6,000 terabytes [6 petabytes] of storage. This number helps confirm the folklore that part of keeping up with the computer business is that you need to learn a new SI prefix every ten years. Storage quantities measured in yottabytes should start to show up around 2030.)
What isn't being indexed, and why?
(
- Stuff that is materialized by CGI scripts.
- Explicitly forbidden areas (excluded, e.g., by robots.txt)
- Uninterpretable binary data
)
Search quality: explain "precision" and "recall".
(A Venn diagram is probably the easiest visualization. A big cloud represents all web pages. A small interior cloud (labeled "wanted") represents the pages the client would really like to see. A second small interior cloud (labeled "found by query") represents the pages his query is going to discover. With luck, the two clouds will overlap, but some relevant pages will be missed and some irrelevant pages will be hit. (1) To improve recall, make the query broader, and it will catch all the relevant pages. But it will also hit more irrelevant ones. (2) To improve precision, make the query more focused, and it will exclude more irrelevant pages. But it will also exclude some relevant ones. It is easy to improve one or the other. Getting good recall and good precision at the same time is the goal, but it is hard.)
Suppose you want to check out all the papers written by Frans Kaashoek. So you search Google for Frans AND Kaashoek and it reports 4,940 hits. So the recall is pretty good. But doesn't that mean that its precision is really awful? Kaashoek writes a lot of papers, but not that many.

(The raw number of hits does measure the recall, and those pages probably include papers he has written, citations of his papers, acknowledgements in other people's papers, cover pages of theses his students wrote, reading lists of Tasmanian universities, traffic tickets [if there were any--there actually aren't], and genealogy pages that mention Frans Gruber and Piet Kaashoek. But we can identify precision by what appears on the first page, the first ten things Google chooses to display. Number 1 is Kaashoek's home page, number 2 is a page titled "Frans Kaashoek's papers", and number 3 is the LCS official biographical sketch.)
So how did Google manage to choose those three things and put them up front?
(
- Look at proximity. ("Frans" near "Kaashoek" on the page.)
- Look at inbound links. (Most links to his home page have for their anchor text "Frans Kaashoek".)
- Count links. (More people link to his home page than to the Tasmanian reading list.)
- Analyze metadata tags (font size, words appearing in the title or headings, etc.)
- Notice the word "home page" on the home page or in anchor text of links to it.
- More generally, do all those things, rank the 4,940 pages, and show the 10 highest-ranked ones.
)
It seems that the Google server is actually composed of hundreds of PC's, all working on different parts of the indexes in parallel. So when I do a search it sends my request to many different machines and integrates the responses. What if one of the machines is bogged down and responds very slowly?

(
- Wait. This annoys customers, and if the missing server has crashed, it really annoys customers.
- Integrate just the responses that came back quickly. If you tell the user that there were only 4120 hits, the user will probably not care, and the chances are that the user would never have looked at pages you don't return, anyway. This is a trade-off in which recall is explicitly sacrificed in favor of response time. Inktomi's HotBot is reported to use this strategy. I don't know if Google does it.
)
Here is something an MIT student can really home in on. "PageRank...corresponds to the principal eigenvector of the normalized link matrix of the web." Explain, preferably in 25 words or less.
First, what is the link matrix of the web?
(Make a table with 2 billion rows and 2 billion columns. Label the rows "to" and the columns "from" [this is the reverse of the way that people usually think of arranging this kind of info, but it makes the next steps easier]. Whenever a page has a link to another page, put a one at the intersection. Put zeros everywhere else.)
What does normalized mean?
(Add up the numbers in each column. That is the number of outbound links on that page. Then, divide the number in each cell of that column by the sum. The zeros remain zero, the ones become 1/sum.)
Some pages have no outbound links. Others have no inbound links. What about them?
(Don't divide by zero. Remove all the all-zero columns from the matrix. the matrix is now much smaller--only about 100 million pages remain.)
How does Google calculate PageRank?
([Ignoring the damping factor] The PageRank of a page is the sum of the PageRanks of all the pages that link to that page, weighted by the numbers in the normalized link matrix.)
But that is circular.
(Yes, but you can still make progress. First, solve the equation M · I = I, where M is the normalized link matrix and I is a vector of PageRanks--the eigenvector. But there are still an infinite number of solutions, all of which have the same ratios of the vector components. So impose the constraint that the sum of all PageRanks be one and you have a unique result that can be interpreted as a probability distribution over all web pages. [The paper skips a step of reasoning here, simply claiming that it is a probability distribution.])
Consider a five-page system in which page one has links to pages 2, 3, 4, and 5, pages 2, 3, 4, and 5 all link back to page 1, and there is a link from 2 to 5. The link matrix and normalized link matrix are
```
     raw link count                 normalized
           from                        from
        1  2  3  4  5               1  2  3  4  5
     1  0  1  1  1  1           1   0 .5  1  1  1
     2  1  0  0  0  0           2 .25  0  0  0  0
  to 3  1  0  0  0  0       to  3 .25  0  0  0  0
     4  1  0  0  0  0           4 .25  0  0  0  0
     5  1  1  0  0  0           5 .25 .5  0  0  0
  sums  4  2  1  1  1
```
You can solve the equation in two ways: Start with an arbitrary pagerank vector, such as equal weighting {0.2, 0.2, 0.2, 0.2, 0.2}. Take the dot product of that vector with the matrix, to produce the next approximation vector {.7, .05, .05, .05, .15}. Take the dot product of that approximation with the matrix to get another approximation; keep doing this till two successive approximations are so close together that you don't care to continue. To three significant figures you end up with {.471, .118, .118, .118, .176}
Or, since the array is only 5 x 5, solve it in closed form. The answer is {8/17, 2/17, 2/17, 2/17, 3/17}. Unfortunately, that wouldn't be practical for the vector of 100 million pages, so Google is stuck with iteration.
OK, we did the math. Now what?

(If the search for Frans AND Kaashoek hits pages 1, 2, and 5, display page 1 first, page 5 second, and page 2 last. Chances are page 1 is his home page and page 5 is his list of publications.)
What does it mean to "Propagate anchor text?" What is "anchor text," anyway?

(Web jargon: Page A links to page B by providing an anchor. An anchor has two parts: a URL, which names page B, and a string of text that the browser usually underlines when it displays page A. That string of text is the anchor text.)
And what does it mean to "propagate" it?

(When the crawler stores a copy of page B for indexing, include with it the anchor text of all the links in other pages that point to page B. Then, when you create the index, pretend that you found the anchor text in page B. The reason is that anchor text often provides good characterizations of what is on page B. The author of page A chose those words specifically to identify page B to the readers of page A.)