By J. H. Saltzer, 2 April 2002. Numbers updated 24 March 2004
Start with an update. On 24 March 2004, the Google home page reported that it is "Searching 4,285,199,774 web pages". The same week, Newsweek reported that Google was performing 200 million searches/day.
(This requires making some assumptions. Suppose that in the busiest second they perform about five times as many searches as in the average second. A day has 86,400 seconds, so 200 million searches/day is an average of 2,300 searches per second. The busiest second would have 10,000 searches, which would in turn require 10,000*20 = 200,000 disk accesses/second. Suppose a (cheap) disk can seek in 10 milliseconds, rotate into position in 8 milliseconds, and deliver the data 2 milliseconds later. One disk can thus perform about 50 operations/second. This suggests that there must be at least 4,000 disks all running flat out. If the system is loaded to half capacity to keep queues short, you might expect to find 8,000 disks.
As of summer 2003, each Google site was actually implemented with 6,000 rack-mounted PC's, most of which have two 80-gigabyte disks. For reliability, there are at least half a dozen such sites scattered around the world--the exact number is a state secret. So Google owns about 6,000 terabytes [6 petabytes] of storage. This number helps confirm the folklore that part of keeping up with the computer business is that you need to learn a new SI prefix every ten years. Storage quantities measured in yottabytes should start to show up around 2030.)
(
(A Venn diagram is probably the easiest visualization. A big cloud represents all web pages. A small interior cloud (labeled "wanted") represents the pages the client would really like to see. A second small interior cloud (labeled "found by query") represents the pages his query is going to discover. With luck, the two clouds will overlap, but some relevant pages will be missed and some irrelevant pages will be hit. (1) To improve recall, make the query broader, and it will catch all the relevant pages. But it will also hit more irrelevant ones. (2) To improve precision, make the query more focused, and it will exclude more irrelevant pages. But it will also exclude some relevant ones. It is easy to improve one or the other. Getting good recall and good precision at the same time is the goal, but it is hard.)
(The raw number of hits does measure the recall, and those pages probably include papers he has written, citations of his papers, acknowledgements in other people's papers, cover pages of theses his students wrote, reading lists of Tasmanian universities, traffic tickets [if there were any--there actually aren't], and genealogy pages that mention Frans Gruber and Piet Kaashoek. But we can identify precision by what appears on the first page, the first ten things Google chooses to display. Number 1 is Kaashoek's home page, number 2 is a page titled "Frans Kaashoek's papers", and number 3 is the LCS official biographical sketch.)
(
(
(Make a table with 2 billion rows and 2 billion columns. Label the rows "to" and the columns "from" [this is the reverse of the way that people usually think of arranging this kind of info, but it makes the next steps easier]. Whenever a page has a link to another page, put a one at the intersection. Put zeros everywhere else.)
(Add up the numbers in each column. That is the number of outbound links on that page. Then, divide the number in each cell of that column by the sum. The zeros remain zero, the ones become 1/sum.)
(Don't divide by zero. Remove all the all-zero columns from the matrix. the matrix is now much smaller--only about 100 million pages remain.)
([Ignoring the damping factor] The PageRank of a page is the sum of the PageRanks of all the pages that link to that page, weighted by the numbers in the normalized link matrix.)
(Yes, but you can still make progress. First, solve the equation M · I = I, where M is the normalized link matrix and I is a vector of PageRanks--the eigenvector. But there are still an infinite number of solutions, all of which have the same ratios of the vector components. So impose the constraint that the sum of all PageRanks be one and you have a unique result that can be interpreted as a probability distribution over all web pages. [The paper skips a step of reasoning here, simply claiming that it is a probability distribution.])
Consider a five-page system in which page one has links to pages 2, 3, 4, and 5, pages 2, 3, 4, and 5 all link back to page 1, and there is a link from 2 to 5. The link matrix and normalized link matrix are
raw link count normalized from from 1 2 3 4 5 1 2 3 4 5 1 0 1 1 1 1 1 0 .5 1 1 1 2 1 0 0 0 0 2 .25 0 0 0 0 to 3 1 0 0 0 0 to 3 .25 0 0 0 0 4 1 0 0 0 0 4 .25 0 0 0 0 5 1 1 0 0 0 5 .25 .5 0 0 0 sums 4 2 1 1 1You can solve the equation in two ways: Start with an arbitrary pagerank vector, such as equal weighting {0.2, 0.2, 0.2, 0.2, 0.2}. Take the dot product of that vector with the matrix, to produce the next approximation vector {.7, .05, .05, .05, .15}. Take the dot product of that approximation with the matrix to get another approximation; keep doing this till two successive approximations are so close together that you don't care to continue. To three significant figures you end up with {.471, .118, .118, .118, .176}
Or, since the array is only 5 x 5, solve it in closed form. The answer is {8/17, 2/17, 2/17, 2/17, 3/17}. Unfortunately, that wouldn't be practical for the vector of 100 million pages, so Google is stuck with iteration.
(If the search for Frans AND Kaashoek hits pages 1, 2, and 5, display page 1 first, page 5 second, and page 2 last. Chances are page 1 is his home page and page 5 is his list of publications.)
(Web jargon: Page A links to page B by providing an anchor. An anchor has two parts: a URL, which names page B, and a string of text that the browser usually underlines when it displays page A. That string of text is the anchor text.)
(When the crawler stores a copy of page B for indexing, include with it the anchor text of all the links in other pages that point to page B. Then, when you create the index, pretend that you found the anchor text in page B. The reason is that anchor text often provides good characterizations of what is on page B. The author of page A chose those words specifically to identify page B to the readers of page A.)
Saltzer@mit.edu