Using your laptop to compute PageRank for millions of webpages

最新推荐文章于 2022-12-07 11:12:05 发布

xingzhixi

最新推荐文章于 2022-12-07 11:12:05 发布

阅读量588

点赞数

文章标签： matrix vector dictionary algorithm structure random

本文链接：https://blog.csdn.net/xingzhixi/article/details/6913481

版权

The PageRank algorithm is a great way of using collective intelligence to determine the importance of a webpage. There’s a big problem, though, which is that PageRank is difficult to apply to the web as a whole, simply because the web contains so many webpages. While just a few lines of code can be used to implement PageRank on collections of a few thousand webpages, it’s trickier to compute PageRank for larger sets of pages.

The underlying problem is that the most direct way to compute the PageRank of webpages involves inverting an $n \times n$ matrix. Even when is just a few thousand, this means inverting a matrix containing millions or tens of millions of floating point numbers. This is possible on a typical personal computer, but it’s hard to go much further.

In this post, I describe how to compute PageRank for collections containing millions of webpages. My little laptop easily coped with two million pages, using about 650 megabytes of RAM and a few hours of computation – at a guess it would max out at about 4 to 5 million pages. More powerful machines could cope with upwards of 10 million pages. And that’s with a program written in Python, an interpreted language not noted for its efficiency! Of course, this number still falls a long way short of the size of the entire web, but it’s a lot better than the naive approach. In a future post I’ll describe a more scalable parallel implementation that can be used to compute PageRank for hundreds of millions of pages.

Let’s quickly run through the basic facts we need to know about PageRank. The idea is that we number webpages $1,\ldots,n$ . For webpage number there is an associated PageRank q_j which measures the importance of page . The PageRank is a number between and ; the bigger the PageRank, the more important the page. Furthermore, the vector $q = (q_1,\ldots,q_n)$ of all the PageRanks is a probability distribution, that is, the PageRanks sum to one.

How is the PageRank vector computed? I’ll just describe the mathematical upshot here; the full motivation in terms of a crazy websurfer who randomly surfs the web is described in an earlier post. The upshot is that the PageRank vector can be defined by the equation (explanation below):

$q \leftarrow M^j P.$

What this equation represents is a starting distribution for the crazy websurfer, and then steps of “surfing”, with each action of representing how the distribution changes in a single step. is an -dimensional vector, and is an $n \times n$ matrix whose entries reflect the link structure of the web in a way I’ll describe below. The PageRank is defined in the limit of large – in our examples, convergence occurs for in the range to . You might wonder how is chosen, but part of the magic of PageRank is that it doesn’t matter how is chosen, provided it’s a probability distribution. The intuition here is that the starting distribution for the websurfer doesn’t matter to their long-run behaviour. We’ll therefore choose the uniform probability distribution, $P = (1/n,1/n,\ldots)$ .

How is the matrix defined? It can be broken up into three pieces

M = s A + s D + t E,

which correspond to, respectively, a contribution representing the crazy websurfer randomly picking links to follow, a contribution due to the fact that the websurfer can’t randomly pick a link when they hit a dangling page (i.e., one with no outbound links), and so something else needs to be done in that case, and finally a contribution representing the websurfer getting bored and “teleporting” to a random webpage.

We’ll set s = 0.85 and t = 1-s = 0.15 as the respective probabilities for randomly selecting a link and teleporting. See here for discussion of the reasons for this choice.

The matrix describes the link structure of the web. In particular, suppose we define $\#(j)$ to be the number of links outbound from page . Then $A_{kj}$ is defined to be if page does not link to , and $1/\#(j)$ if page does link to page . Stated another way, the entries of the th column of are zero, except at locations corresponding to outgoing links, where they are $1/\#(j)$ . The intuition is that describes the action of a websurfer at page randomly choosing an outgoing link.

The matrix is included to deal with the problem of dangling pages, i.e., pages with no outgoing links, where it is ambigous what it means to choose an outgoing link at random. The conventional resolution is to choose another page uniformly at random from the entire set of pages. What this means is that if is a dangling page, then theth column of should have all its entries 1/n , otherwise all the entries should be zero. A compact way of writing this is

D = e d^T/ n,

where is the vector of dangling pages, i.e., the th entry of is if page is dangling, and otherwise is zero. is the vector whose entries are all s.

The final piece of is the matrix , describing the bored websurfer teleporting somewhere else at random. This matrix has entries 1/n everywhere, representing a uniform probability of going to another webpage.

Exercises

This exercise is for people who’ve read my earlier post introducing PageRank. In that post you’ll see that I’ve broken up in a slightly different way. You should verify that the breakup in the current post and the breakup in the old post result in the same matrix .

Now, as we argued above, you quickly run into problems in computing the PageRank vector if you naively try to store all $n \times n = n^2$ elements in the matrix or M^j . Fortunately, it’s possible to avoid doing this. The reason is that the matrices A, D and all have a special structure. is a very sparse matrix, since most webpages only link to a tiny fraction of the web as a whole, and so most entries are zero. is not ordinarily sparse, but it does have a very simple structure, determined by the vector (not matrix!) of dangling pages, , and this makes it easy to do vector multiplications involving . Finally, the matrix is even more simple, and it’s straightforward to compute the action of on a vector.

The idea used in our algorithm is therefore to compute M^jP by repeatedly multiplying the vector by , and with each multiplication using the special form M = sA+sD+tE to avoid holding anything more than -dimensional vectors in memory. This is the critical step that lets us compute PageRank for a web containing millions of webpages.

The crux of the algorithm, therefore, is to determine (Mv)_j for each index , where is an arbitrary vector. We do this by computing (Av)_j, (Dv)_j and (Ev)_j separately, and then summing the results, with the appropriate prefactors, or .

To compute (Av)_j , just observe that $(Av)_j = \sum_k A_{jk} v_k$ . This can be computed by summing over all the pages which have links to page , with $A_{jk} = 1/\#(k)$ in each case. We see that the total number of operations involved in computing all the entries in is $\Theta(l)$ , where is the total number of links in the web graph, a number that scales far more slowly than n^2 . Furthermore, the memory consumed is $\Theta(n)$ .

I’m being imprecise in analysing the number of operations and the memory consumption, not worrying about small constant factors. If we were really trying to optimize our algorithm, we’d be careful to keep track of exactly what resources are consumed, rather than ignoring those details. However, as we’ll see below, the advantages of the approach we’re now considering over the approach based on matrix inversion are enormous, and so ignoring a few small factors is okay for the purposes of comparison.

To compute , observe that $Dv = e (d \cdot v) / n$ , and so it suffices to compute the inner product $d \cdot v$ , which in the worst case requires $\Theta(n)$ operations, and $\Theta(n)$ memory, to store .

Finally, to compute (Ev)_j is easy, since it is just $(e \cdot v)/n$ , which requires virtually no additional computation or memory.

It follows that we can compute using a total number of operations $\Theta(n+l)$ , and total memory $\Theta(n)$ , and thus approximate the PageRank vector

$q \approx M^j P$

with a total number of operations $\Theta(j (n+l))$ , and with memory consumption $\Theta(n)$ .

These are imposingly large numbers. Suppose we have n = 10^6 webpages, l = 10^7 links, and use j = 10^2 iterations. Then the total amount of memory involved is on the order of 10^6 floating point numbers, and the total number of operations is on the order of 10^9 .

On the other hand, if we’d used the method that required us to store the full matrix , we’d have needed to store on the order of $10^{12}$ floating pointing numbers. Furthermore using Gaussian elimination the matrix inversion requires on the order of $10^{18}$ operations, give or take a few (but only a few!) orders of magnitude. More efficient methods for matrix inverstion are known, but the numbers remain daunting.

The code

Let’s think about how we’d code this algorithm up in Python. If you’re not already familiar with Python, then a great way to learn is “Dive into Python”, an excellent free online introductory book about Python by Mark Pilgrim. You can read the first five chapters in just a few hours, and that will provide you with enough basic understanding of Python to follow most code, and to do a lot of coding yourself. Certainly, it’s more than enough for all the examples in this post series.

The most important part of our implementation of PageRank is to choose our representation of the web in a compact way, and so that the information we need to compute M^jP can be retrieved quickly. This is achieved by the following definition for a class web, whose instances represent possible link structures for the web:

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

Suppose g is an object of type web, created using, e.g., g = web(3), to create a web of 3 pages. g.size = 3 for this web. g.in_links is a dictionary whose keys are numbers for the webpages, 0,1,2, and with corresponding values being lists of the inbound webpages. These lists are initially set to contain no webpages:

g.in_links = {0 : [], 1 : [], 2 : []}

Note that in Python lists are indexed $0,1,2,\ldots$ , and this makes it convenient to number pages $0,\ldots,n-1$ , rather than $1,\ldots,n$ , as we have been doing up to now.

The advantages of storing the in-links as a dictionary of lists are two-fold. First, by using this representation, we store no more information than we absolutely need. E.g., we are not storing information about the absence of links, as we would be if we stored the entire adjacency matrix. Second, Python dictionaries are a convenient way to store the information, because they don’t require any ordering information, and so are in many ways faster to manipulate than lists.

You might be wondering why there is no attribute g.out_links in our class definition. The glib answer is that we don’t need it – after all, it can, in principle, be reconstructed from g.in_links. A better answer is that in real applications it would quite sensible to also have g.out_links, but storing that extra dictionary would consume a lot of extra memory, and we avoid it since we can compute PageRank without it. An even better answer would be to make use of a data structure that gives us the speed and convenience provided by both g.in_links and g.out_links, but only requires links to be stored once. This can certainly be done, but requires more work, and so I’ve avoided it.

Let’s finish our description of the attributes of g. g.number_out_links is a dictionary such that g.number_out_links[j] is the number of out-links from page j. Initially these values are all set to zero, but they will be incremented when links are added. We need to store the number of outgoing links in order to compute the probabilities $1/\#(k)$ used in the PageRank algorithm, as described earlier. While this could in principle be computed from the dictionary g.in_links, in practice that would be extremely slow.

Finally, g.dangling_pages is a dictionary whose key set is the set of dangling webpages. The values of the keys are placeholders which are never used, and so we use True as a placeholder value. Initially, all webpages are dangling, but we remove keys when links are added.

With these data structures in place, the rest of the program is a pretty straightforward implementation of the algorithm described earlier. Here’s the code:

# Generates a random web link structure, and finds the
# corresponding PageRank vector.  The number of inbound
# links for each page is controlled by a power law
# distribution.
#
# This code should work for up to a few million pages on a
# modest machine.
#
# Written and tested using Python 2.5.

import numpy
import random

class web:
  def __init__(self,n):
    self.size = n
    self.in_links = {}
    self.number_out_links = {}
    self.dangling_pages = {}
    for j in xrange(n):
      self.in_links[j] = []
      self.number_out_links[j] = 0
      self.dangling_pages[j] = True

def paretosample(n,power=2.0):
  '''Returns a sample from a truncated Pareto distribution
  with probability mass function p(l) proportional to
  1/l^power.  The distribution is truncated at l = n.'''
  m = n+1
  while m > n: m = numpy.random.zipf(power)
  return m

def random_web(n=1000,power=2.0):
  '''Returns a web object with n pages, and where each
  page k is linked to by L_k random other pages.  The L_k
  are independent and identically distributed random
  variables with a shifted and truncated Pareto
  probability mass function p(l) proportional to
  1/(l+1)^power.'''
  g = web(n)
  for k in xrange(n):
    lk = paretosample(n+1,power)-1
    values = random.sample(xrange(n),lk)
    g.in_links[k] = values
    for j in values:
      if g.number_out_links[j] == 0: g.dangling_pages.pop(j)
      g.number_out_links[j] += 1
  return g

def step(g,p,s=0.85):
  '''Performs a single step in the PageRank computation,
  with web g and parameter s.  Applies the corresponding M
  matrix to the vector p, and returns the resulting
  vector.'''
  n = g.size
  v = numpy.matrix(numpy.zeros((n,1)))
  inner_product = sum([p[j] for j in g.dangling_pages.keys()])
  for j in xrange(n):
    v[j] = s*sum([p[k]/g.number_out_links[k]
    for k in g.in_links[j]])+s*inner_product/n+(1-s)/n
  # We rescale the return vector, so it remains a
  # probability distribution even with floating point
  # roundoff.
  return v/numpy.sum(v)  

def pagerank(g,s=0.85,tolerance=0.00001):
  '''Returns the PageRank vector for the web g and
  parameter s, where the criterion for convergence is that
  we stop when M^(j+1)P-M^jP has length less than
  tolerance, in l1 norm.'''
  n = g.size
  p = numpy.matrix(numpy.ones((n,1)))/n
  iteration = 1
  change = 2
  while change > tolerance:
    print "Iteration: %s" % iteration
    new_p = step(g,p,s)
    change = numpy.sum(numpy.abs(p-new_p))
    print "Change in l1 norm: %s" % change
    p = new_p
    iteration += 1
  return p

print '''Now computing the PageRank vector for a random
web containing 10000 pages, with the number of inbound
links to each page controlled by a Pareto power law
distribution with parameter 2.0, and with the criterion
for convergence being a change of less than 0.0001 in l1
norm over a matrix multiplication step.'''

g = random_web(10000,2.0) # works up to several million
                          # pages.
pr = pagerank(g,0.85,0.0001)

As written, the code computes PageRank for a 10,000 page web, and I recommend you start with that number of pages, to check that the code runs. Once that’s done, you can change the parameter in the second last line to change the number of pages. If you do that, you may wish to change the tolerance 0.0001 in the final line, decreasing it to ensure a tolerance more appropriate to the number of pages, say $10^{-7}$ .

What determines the tolerance that should be used? I have used the l_1 norm to quantify the rate of convergence in the code, but that is a rather arbitrary measure. In particular, an alternate measure might be to look at the magnitude of the differences between individual elements, computing $\max_j |(Mv-v)_j|$ , and requiring that it be substantially smaller than the minimal possible value for PageRank, t/n . This is less conservative, and arguably a more relevant measure; the code is easily changed.

Exercises

The code above is complex enough that bugs are likely. As a test, use an alternate method to compute PageRank for small web sizes, and compare your results against the output from the code above. Do you believe the code is correct?
Modify the program described in this lecture so that it works with a non-uniform teleportation vector, as described in an earlier post.
Python is a great language, but it’s not designed with speed in mind. Write a program in C to compute the PageRank for a random web. Can you compute the PageRank for ten million pages? What’s the largest number you’re able to compute on a single machine?

About this post: This is one in a series of posts about the Google Technology Stack – PageRank, MapReduce, and so on. The posts are summarized here, and there is FriendFeed room for the series here. You can subscribe to my blog here.