How Google Finds Your Needle in
the Web's Haystack
As we'll see, the trick is to ask the
web itself to rank the importance of pages...
David
Austin Grand Valley State University david
at merganser.math.gvsu.edu
Imagine a library containing 25 billion documents but
with no centralized organization and no librarians. In
addition, anyone may add a document at any time without
telling anyone. You may feel sure that one of the
documents contained in the collection has a piece of
information that is vitally important to you, and, being
impatient like most of us, you'd like to find it in a
matter of seconds. How would you go about doing it?
Posed in this way, the problem seems impossible. Yet
this description is not too different from the World
Wide Web, a huge, highly-disorganized collection of
documents in many different formats. Of course, we're
all familiar with search engines (perhaps you found this
article using one) so we know that there is a solution.
This article will describe Google's PageRank algorithm
and how it returns pages from the web's collection of 25
billion documents that match search criteria so well
that "google" has become a widely used verb.
Most search engines, including
Google, continually run an army of computer programs
that retrieve pages from the web, index the words in
each document, and store this information in an
efficient format. Each time a user asks for a web search
using a search phrase, such as "search engine," the
search engine determines all the pages on the web that
contains the words in the search phrase. (Perhaps
additional information such as the distance between the
words "search" and "engine" will be noted as well.) Here
is the problem: Google now claims to index 25 billion
pages. Roughly 95% of the text in web pages is composed
from a mere 10,000 words. This means that, for most
searches, there will be a huge number of pages
containing the words in the search phrase. What is
needed is a means of ranking the importance of the pages
that fit the search criteria so that the pages can be
sorted with the most important pages at the top of the
list.
One way to determine the importance of pages is to
use a human-generated ranking. For instance, you may
have seen pages that consist mainly of a large number of
links to other resources in a particular area of
interest. Assuming the person maintaining this page is
reliable, the pages referenced are likely to be useful.
Of course, the list may quickly fall out of date, and
the person maintaining the list may miss some important
pages, either unintentionally or as a result of an
unstated bias.
Google's PageRank algorithm assesses the importance
of web pages without human evaluation of the content. In
fact, Google feels that the value of its service is
largely in its ability to provide unbiased results to
search queries; Google claims, "the heart of our
software is PageRank." As we'll see, the trick is to ask
the web itself to rank the importance of pages.
How to tell who's
important
If you've ever created a web page, you've probably
included links to other pages that contain valuable,
reliable information. By doing so, you are affirming the
importance of the pages you link to. Google's PageRank
algorithm stages a monthly popularity contest among all
pages on the web to decide which pages are most
important. The fundamental idea put forth by PageRank's
creators, Sergey Brin and Lawrence Page, is this: the
importance of a page is judged by the number of pages
linking to it as well as their importance.
We will assign to each web page P a measure
of its importance I(P), called the page's
PageRank. At various sites, you may find an approximation of a page's PageRank.
(For instance, the home page of The American
Mathematical Society currently has a PageRank of 8 on a
scale of 10. Can you find any pages with a PageRank of
10?) This reported value is only an approximation since
Google declines to publish actual PageRanks in an effort
to frustrate those would manipulate the rankings.
Here's how the PageRank is determined. Suppose that
page Pj has lj
links. If one of those links is to page
Pi, then Pj will
pass on 1/lj of its importance to
Pi. The importance ranking of
Pi is then the sum of all the
contributions made by pages linking to it. That is, if
we denote the set of pages linking to
Pi by Bi, then
This may remind you of the chicken and the egg: to
determine the importance of a page, we first need to
know the importance of all the pages linking to it.
However, we may recast the problem into one that is more
mathematically familiar.
Let's first create a matrix, called the hyperlink
matrix,
in which the entry in theith row and
jth column is
Notice that H has some special
properties. First, its entries are all nonnegative.
Also, the sum of the entries in a column is one unless
the page corresponding to that column has no links.
Matrices in which all the entries are nonnegative and
the sum of the entries in every column is one are called
stochastic; they will play an important role in
our story.
We will also form a vector
whose components are PageRanks--that is, the importance
rankings--of all the pages. The condition above defining
the PageRank may be expressed as
In other words, the vector I is an
eigenvector of the matrix H with
eigenvalue 1. We also call this a stationary
vector of H.
Let's look at an example. Shown below is a
representation of a small collection (eight) of web
pages with links represented by arrows.
The corresponding matrix is
 |
with stationary vector |
 |
This shows that page 8 wins the popularity contest.
Here is the same figure with the web pages shaded in
such a way that the pages with higher PageRanks are
lighter.
Computing I
There are many ways to find the eigenvectors of a
square matrix. However, we are in for a special
challenge since the matrix H is a
square matrix with one column for each web page indexed
by Google. This means that H has about
n = 25 billion columns and rows. However, most
of the entries in H are zero; in fact,
studies show that web pages have an average of about 10
links, meaning that, on average, all but 10 entries in
every column are zero. We will choose a method known as
the power method for finding the stationary
vector I of the matrix H.
How does the power method work? We begin by choosing
a vector I 0 as a candidate for
I and then producing a sequence of vectors
I k by
The method is founded on the following general
principle that we will soon investigate.
General principle: The sequence
I k will converge to the
stationary vector
I. |
We will illustrate with the example above.
I 0 |
I 1 |
I 2 |
I 3 |
I 4 |
... |
I 60 |
I 61
|
1 |
0 |
0 |
0 |
0.0278 |
... |
0.06 |
0.06 |
0 |
0.5 |
0.25 |
0.1667 |
0.0833 |
... |
0.0675 |
0.0675 |
0 |
0.5 |
0 |
0 |
0 |
... |
0.03 |
0.03 |
0 |
0 |
0.5 |
0.25 |
0.1667 |
... |
0.0675 |
0.0675 |
0 |
0 |
0.25 |
0.1667 |
0.1111 |
... |
0.0975 |
0.0975 |
0 |
0 |
0 |
0.25 |
0.1806 |
... |
0.2025 |
0.2025 |
0 |
0 |
0 |
0.0833 |
0.0972 |
... |
0.18 |
0.18 |
0 |
0 |
0 |
0.0833 |
0.3333 |
... |
0.295 |
0.295 |
It is natural to ask what these numbers mean. Of
course, there can be no absolute measure of a page's
importance, only relative measures for comparing the
importance of two pages through statements such as "Page
A is twice as important as Page B." For this reason, we
may multiply all the importance rankings by some fixed
quantity without affecting the information they tell us.
In this way, we will always assume, for reasons to be
explained shortly, that the sum of all the popularities
is one.
Three important
questions
Three questions naturally come to mind:
- Does the sequence I k always
converge?
- Is the vector to which it converges independent of
the initial vector I 0?
- Do the importance rankings contain the information
that we want?
Given the current method, the answer to all three
questions is "No!" However, we'll see how to modify our
method so that we can answer "yes" to all three.
Let's first look at a very simple example. Consider
the following small web consisting of two web pages, one
of which links to the other:
 |
with matrix |
 |
Here is one way in which our algorithm could proceed:
I 0 |
I 1 |
I 2 |
I
3=I |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
In this case, the importance rating of both pages is
zero, which tells us nothing about the relative
importance of these pages. The problem is that
P2 has no links. Consequently, it
takes some of the importance from page
P1 in each iterative step but does
not pass it on to any other page. This has the effect of
draining all the importance from the web. Pages with no
links are called dangling nodes, and there are,
of course, many of them in the real web we want to
study. We'll see how to deal with them in a minute, but
first let's consider a new way of thinking about the
matrix H and stationary vector
I.
A probabilitistic interpretation
of H
Imagine that we surf the web at random; that is, when
we find ourselves on a web page, we randomly follow one
of its links to another page after one second. For
instance, if we are on page Pj with
lj links, one of which takes us to
page Pi, the probability that we
next end up on Pi page is then .
As we surf randomly, we will denote by the fraction of time that we spend on
page Pj. Then the fraction of the
time that we end up on page Pi page
coming from Pj is . If we end up on
Pi, we must have come from a page
linking to it. This means that
where the sum is over all the pages
Pj linking to
Pi. Notice that this is the same
equation defining the PageRank rankings and so
. This allows us to interpret a web page's PageRank as
the fraction of time that a random surfer spends on that
web page. This may make sense if you have ever surfed
around for information about a topic you were unfamiliar
with: if you follow links for a while, you find yourself
coming back to some pages more often than others. Just
as "All roads lead to Rome," these are typically more
important pages.
Notice that, given this interpretation, it is natural
to require that the sum of the entries in the PageRank
vector I be one.
Of course, there is a complication in this
description: If we surf randomly, at some point we will
surely get stuck at a dangling node, a page with no
links. To keep going, we will choose the next page at
random; that is, we pretend that a dangling node has a
link to every other page. This has the effect of
modifying the hyperlink matrix H by
replacing the column of zeroes corresponding to a
dangling node with a column in which each entry is
1/n. We call this new matrix
S.
In our previous example, we now have
 |
with matrix |
 |
and eigenvector |
 |
In other words, page P2 has twice
the importance of page P1, which may
feel about right to you.
The matrix S has the pleasant
property that the entries are nonnegative and the sum of
the entries in each column is one. In other words, it is
stochastic. Stochastic matrices have several properties
that will prove useful to us. For instance, stochastic
matrices always have stationary vectors.
For later purposes, we will note that
S is obtained from H
in a simple way. If A is the matrix
whose entries are all zero except for the columns
corresponding to dangling nodes, in which each entry is
1/n, then S =
H + A.
How does the power method
work?
In general, the power method is a technique for
finding an eigenvector of a square matrix corresponding
to the eigenvalue with the largest magnitude. In our
case, we are looking for an eigenvector of
S corresponding to the eigenvalue 1.
Under the best of circumstances, to be described soon,
the other eigenvalues of S will have a
magnitude smaller than one; that is,
if
is an eigenvalue of S other than 1.
We will assume that the eigenvalues of
S are
and that
We will also assume that there is a basis
vj of eigenvectors for
S with corresponding eigenvalues
. This assumption is not necessarily true, but with it
we may more easily illustrate how the power method
works. We may write our initial vector I
0 as
Then
Since the eigenvalues
with
have magnitude smaller than one, it follows that
if
and therefore
, an eigenvector corresponding to the eigenvalue 1.
It is important to note here that the rate at which
is determined by
. When
is relatively close to 0, then
relatively quickly. For instance, consider the matrix
The eigenvalues of this matrix are
and
. In the figure below, we see the vectors I
k, shown in red, converging to the
stationary vector I shown in green.
Now consider the matrix
Here the eigenvalues are
and
. Notice how the vectors I k
converge more slowly to the stationary vector I
in this example in which the second eigenvalue has a
larger magnitude.
When things go wrong
In our discussion above, we assumed that the matrix
S had the property that
and
. This does not always happen, however, for the matrices
S that we might find.
Suppose that our web looks like this:
In this case, the matrix S is
Then we see
I 0 |
I 1 |
I 2 |
I 3 |
I 4 |
I 5 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
In this case, the sequence of vectors I
k fails to converge. Why is this? The
second eigenvalue of the matrix S
satisfies
and so the argument we gave to justify the power method
no longer holds.
To guarantee that
, we need the matrix S to be
primitive. This means that, for some
m, Sm has
all positive entries. In other words, if we are given
two pages, it is possible to get from the first page to
the second after following m links. Clearly,
our most recent example does not satisfy this property.
In a moment, we will see how to modify our matrix
S to obtain a primitive, stochastic
matrix, which therefore satisfies
.
Here's another example showing how our method can
fail. Consider the web shown below.
In this case, the matrix S is
 |
with stationary vector |
 |
Notice that the PageRanks assigned to the first four
web pages are zero. However, this doesn't feel right:
each of these pages has links coming to them from other
pages. Clearly, somebody likes these pages! Generally
speaking, we want the importance rankings of all pages
to be positive. The problem with this example is that it
contains a smaller web within it, shown in the blue box
below.
Links come into this box, but none go out. Just as in
the example of the dangling node we discussed above,
these pages form an "importance sink" that drains the
importance out of the other four pages. This happens
when the matrix S is
reducible; that is, S can be
written in block form as
Indeed, if the matrix S is
irreducible, we can guarantee that there is a stationary
vector with all positive entries.
A web is called strongly connected if, given
any two pages, there is a way to follow links from the
first page to the second. Clearly, our most recent
example is not strongly connected. However, strongly
connected webs provide irreducible matrices
S.
To summarize, the matrix S is
stochastic, which implies that it has a stationary
vector. However, we need S to also be
(a) primitive so that
and (b) irreducible so that the stationary vector has
all positive entries.
A final modification
To find a new matrix that is both primitive and
irreducible, we will modify the way our random surfer
moves through the web. As it stands now, the movement of
our random surfer is determined by S:
either he will follow one of the links on his current
page or, if at a page with no links, randomly choose any
other page to move to. To make our modification, we will
first choose a parameter
between 0 and 1. Now suppose that our random surfer
moves in a slightly different way. With probability , he is guided by S.
With probability
, he chooses the next page at random.
If we denote by 1 the
matrix whose entries are all one, we obtain the
Google matrix:
Notice now that G is stochastic as
it is a combination of stochastic matrices. Furthermore,
all the entries of G are positive,
which implies that G is both primitive
and irreducible. Therefore, G has a
unique stationary vector I that may be found
using the power method.
The role of the parameter is an
important one. Notice that if
, thenG = S. This
means that we are working with the original hyperlink
structure of the web. However, if
, then
. In other words, the web we are considering has a link
between any two pages and we have lost the original
hyperlink structure of the web. Clearly, we would like
to take close
to 1 so that we hyperlink structure of the web is
weighted heavily into the computation.
However, there is another consideration. Remember
that the rate of convergence of the power method is
governed by the magnitude of the second eigenvalue
. For the Google matrix, it has been proven that the
magnitude of the second eigenvalue
. This means that when is
close to 1 the convergence of the power method will be
very slow. As a compromise between these two competing
interests, Serbey Brin and Larry Page, the creators of
PageRank, chose
.
Computing I
What we've described so far looks like a good theory,
but remember that we need to apply it to
matrices where n is about 25 billion! In fact,
the power method is especially well-suited to this
situation.
Remember that the stochastic matrix
S may be written as
and therefore the Google matrix has the form
Therefore,
Now recall that most of the entries in
H are zero; on average, only ten
entries per column are nonzero. Therefore, evaluating
HI k requires only
ten nonzero terms for each entry in the resulting
vector. Also, the rows of A are all
identical as are the rows of 1.
Therefore, evaluating AI
k and 1I
k amounts to adding the current
importance rankings of the dangling nodes or of all web
pages. This only needs to be done once.
With the value of chosen
to be near 0.85, Brin and Page report that 50 - 100
iterations are required to obtain a sufficiently good
approximation to I. The calculation is reported
to take a few days to complete.
Of course, the web is continually changing. First,
the content of web pages, especially for news
organizations, may change frequently. In addition, the
underlying hyperlink structure of the web changes as
pages are added or removed and links are added or
removed. It is rumored that Google recomputes the
PageRank vector I roughly every month. Since
the PageRank of pages can be observed to fluctuate
considerably during this time, it is known to some as
the Google Dance. (In 2002, Google held a Google Dance!)
Summary
Brin and Page introduced Google in 1998, a time when
the pace at which the web was growing began to outstrip
the ability of current search engines to yield useable
results. At that time, most search engines had been
developed by businesses who were not interested in
publishing the details of how their products worked. In
developing Google, Brin and Page wanted to "push more
development and understanding into the academic realm."
That is, they hoped, first of all, to improve the design
of search engines by moving it into a more open,
academic environment. In addition, they felt that the
usage statistics for their search engine would provide
an interesting data set for research. It appears that
the federal government, which recently tried to gain
some of Google's statistics, feels the same way.
There are other algorithms that use the hyperlink
structure of the web to rank the importance of web
pages. One notable example is the HITS algorithm,
produced by Jon Kleinberg, which forms the basis of the
Teoma search
engine. In fact, it is interesting to compare the
results of searches sent to different search engines as
a way to understand why some complain of a Googleopoly.
References
- Michael Berry, Murray Browne,
Understanding Search Engines: Mathematical
Modeling and Text Retrieval. Second Edition,
SIAM, Philadelphia. 2005.
- Sergey Brin, Lawrence Page, The antaomy of
a large-scale hypertextual Web search engine,
Computer Networks and ISDN Systems,
33: 107-17, 1998. Also available online at http://infolab.stanford.edu/pub/papers/google.pdf
- Kurt Bryan, Tanya Leise, The
$25,000,000,000 eigenvector. The linear algebra behind
Google. SIAM Review, 48 (3), 569-81. 2006.
Also avaiable at http://www.rose-hulman.edu/~bryan/google.html
- Google Corporate Information:
Technology.
- Taher Haveliwala, Sepandar Kamvar, The second eigenvalue of the Google
matrix.
- Amy Langville, Carl Meyer, Google's
PageRank and Beyond: The Science of Search Engine
Rankings. Princeton University Press, 2006.
This is an informative, accessible book, written
in an engaging style. Besides providing the relevant
mathematical background and details of PageRank and
its implementation (as well as Kleinberg's HITS
algorithm), this book contains many interesting
"Asides" that give trivia illuminating the context of
search engine design.
David
Austin Grand Valley State University david
at merganser.math.gvsu.edu
NOTE: Those who can access JSTOR can find some of
the papers mentioned above there. For those with access,
the American Mathematical Society's MathSciNet can
be used to get additional bibliographic information and
reviews of some these materials. Some of the items above
can be accessed via the ACM Portal,
which also provides bibliographic
services. |