实践Demo：基于MapReduce的PageRank网页排序

最新推荐文章于 2023-08-01 08:44:57 发布

飘逸慕嫣然

最新推荐文章于 2023-08-01 08:44:57 发布

阅读量1.3k

点赞数 2

分类专栏：大数据云计算 mapreduce PageRank实践

本文链接：https://blog.csdn.net/CaiXinQiWorld/article/details/73928137

版权

大数据同时被 3 个专栏收录

4 篇文章 0 订阅

订阅专栏

云计算

2 篇文章 0 订阅

订阅专栏

mapreduce

1 篇文章 0 订阅

订阅专栏

【声明：鄙人菜鸟一枚，写的都是初级博客，如遇大神路过鄙地，请多赐教；内容有误，请批评指教，如有雷同，属我偷懒转运的，能给你带来收获就是我的博客价值所在。】

前面已经介绍了基于MapReduce的大图划分算法，其中最具代表性之一的实践应用便是PageRank网页排序。作为图划分算法的一种，PageRank算法自然具有在分布式系统上高效处理大规模图数据的能力，其最初被用于网页（具有超链接结构）的Rank（排序处理），但利用图中的拓扑结构关系，PageRank也可应用于节点的Rank。而且它也为很多重要的图分析算法奠定了基础。下面将详细介绍PageRank的原理以及在Hadoop分布式环境下利用java语言编译的MapReduce程序实现过程。

1 PageRank

PageRank算法是通过计算每一个网页的 PageRank 值，然后根据这个值的大小对网页的重要性进行排序。它的思想是模拟网页浏览者的浏览行为，估计这个网页浏览者分布在各个网页上的概率（衡量对应网页的重要性）。网上有太多对它的作详尽介绍的文字（比如：度娘的百度百科）

2 PageRank的最简原理模型

互联网中的网页可以看出是一个矢量图，其中网页是节点，如果网页 $\mathrm A$ 有链接到网页 B，则存在一条有向边 A $\to$ B，下面是一个简单的示例：
这里写图片描述
这个例子中有四个网页（小规模），如果当前在 A 网页，那么网页浏览者将会以 $1\over 3$ 的概率点击跳转到 B、C、D，这里的 3 表示 A 有 3 条出链，如果一个网页有k条出链，那么跳转任意一个出链上的概率是 $1\over k$ ，同理 D 到 B 、C 的概率各为 $1\over 2$ ，而 B 到 C 的概率为 0。一般用转移矩阵表示网页浏览者的跳转概率，如果用 n 表示网页的数目，则转移矩阵 $M$ 是一个 n×n 的方阵；如果网页 j 有 k 条出链，那么对出链指向的每一个网页i，有 $M[i][j]={1\over k}$ ，而其他网页的 $M[i][j]=0$ ；上面示例图对应的转移矩阵如下：

M = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 3 1 3 1 3 1 2 00 1 2 1000 0 1 2 1 2 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$M=\left[ \begin{array}{} 0&1\over2&1&0\\ 1\over3&0&0&1\over2\\ 1\over3&0&0&1\over2\\ 1\over3&1\over2&0&0 \end{array} \right]$
假设起初网页浏览者处在每一个网页的概率都是相等的，即

1n $1\over n$ ，于是初始的概率分布就是一个所有值都为

1n $1\over n$ 的 n 维列向量

V0 $V_0$ ，用

V0 $V_0$ 去右乘转移矩阵

M $M$ ，就得到了第一步之后网页浏览者的概率分布向量

MV0 $MV_0$ , (n×n)×(n×1) 依然得到一个 n×1 的矩阵。下面是

V1 $V_1$ 的计算过程：

V 1 = M V 0 = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 3 1 3 1 3 1 2 00 1 2 1000 0 1 2 1 2 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 4 1 4 1 4 1 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 9 24 5 24 5 24 5 24 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$V_1=MV_0=\left[ \begin{array}{} 0&1\over2&1&0\\1\over3&0&0&1\over2\\1\over3&0&0&1\over2\\1\over3&1\over2&0&0 \end{array} \right]\left[ \begin{array}{} 1\over4\\1\over4\\1\over4\\1\over4 \end{array} \right]=\left[ \begin{array}{} 9\over24\\5\over24\\5\over24\\5\over24 \end{array} \right]$
注意矩阵M中 M[i][j] 不为0表示用一个链接从 j 指向 i，M 的第一行乘以

V0 $V_0$ ，表示累加所有网页到网页A的概率即得到9/24。得到了

V1 $V_1$ 后，再用

V1 $V_1$ 去右乘 M 得到

V2 $V_2$ ，一直下去，最终 V 会收敛，即

Vn=MVn−1 $V_n=MV_{n-1}$ ，上面的图示例，不断的迭代，最终

V=[3/9,2/9,2/9,2/9]T $V=[3/9,2/9,2/9,2/9]^T$ ：

⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 1 / 4 1 / 4 1 / 4 1 / 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 9 / 24 5 / 24 5 / 24 5 / 24 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 15 / 48 11 / 48 11 / 48 11 / 48 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 11 / 32 7 / 32 7 / 32 7 / 32 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ . . . ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 3 / 9 2 / 9 2 / 9 2 / 9 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥

$\left[ \begin{array}{} 1/4\\1/4\\1/4\\1/4 \end{array} \right]\left[ \begin{array}{} 9/24\\5/24\\5/24\\5/24 \end{array} \right]\left[ \begin{array}{} 15/48\\11/48\\11/48\\11/48 \end{array} \right]\left[ \begin{array}{} 11/32\\7/32\\7/32\\7/32 \end{array} \right]...\left[ \begin{array}{} 3/9\\2/9\\2/9\\2/9 \end{array} \right]$

3 终止点问题

上述网页浏览者的行为是一个马尔科夫过程的实例，要满足收敛性，需要具备一个条件：
图是强连通的，即从任意网页可以到达其他任意网页。
互联网上的网页不满足强连通的特性，因为有一些网页不指向任何网页，如果按照上面的计算，网页浏览者到达这样的网页后便终止了，导致前面累计得到的转移概率被清零，这样下去，最终的得到的概率分布向量所有元素几乎都为 0。假设我们把上面图中 C 到 A 的链接丢掉，C 变成了一个终止点，得到下面这个图：
这里写图片描述
对应的转移矩阵为：

M = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 3 1 3 1 3 1 2 00 1 2 0000 0 1 2 1 2 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$M=\left[ \begin{array}{} 0&1\over2&0&0\\ 1\over3&0&0&1\over2\\ 1\over3&0&0&1\over2\\ 1\over3&1\over2&0&0 \end{array} \right]$
连续迭代下去，最终所有元素都为 0：

⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 1 / 4 1 / 4 1 / 4 1 / 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 3 / 24 5 / 24 5 / 24 5 / 24 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 5 / 48 7 / 48 7 / 48 7 / 48 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 21 / 288 31 / 288 31 / 288 31 / 288 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ . . . ⎡ ⎣ ⎢ ⎢ ⎢ 0000 ⎤ ⎦ ⎥ ⎥ ⎥

$\left[ \begin{array}{} 1/4\\1/4\\1/4\\1/4 \end{array} \right]\left[ \begin{array}{} 3/24\\5/24\\5/24\\5/24 \end{array} \right]\left[ \begin{array}{} 5/48\\7/48\\7/48\\7/48 \end{array} \right]\left[ \begin{array}{} 21/288\\31/288\\31/288\\31/288 \end{array} \right]...\left[ \begin{array}{} 0\\0\\0\\0 \end{array} \right]$

4 陷阱问题

另外一个问题就是陷阱问题，即有些网页不存在指向其他网页的链接，但存在指向自己的链接。比如下面这个图：
这里写图片描述
网页浏览者跳转到 C 网页后，就像跳进了陷阱，陷入了漩涡，再也不能从 C 中出来，最终将导致概率分布值全部转移到 C 上来，这使得其他网页的概率分布值为 0，从而整个网页排名就失去了意义。如果按照上面图对应的转移矩阵为：

M = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 3 1 3 1 3 1 2 00 1 2 0000 0 1 2 1 2 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$M=\left[ \begin{array}{} 0&1\over2&0&0\\ 1\over3&0&0&1\over2\\ 1\over3&0&0&1\over2\\ 1\over3&1\over2&0&0 \end{array} \right]$
不断的迭代下去，就变成了这样：

⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 1 / 4 1 / 4 1 / 4 1 / 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 3 / 24 5 / 24 11 / 24 5 / 24 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 5 / 48 7 / 48 29 / 48 7 / 48 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 21 / 288 31 / 288 205 / 288 31 / 288 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ . . . ⎡ ⎣ ⎢ ⎢ ⎢ 0010 ⎤ ⎦ ⎥ ⎥ ⎥

$\left[ \begin{array}{} 1/4\\1/4\\1/4\\1/4 \end{array} \right]\left[ \begin{array}{} 3/24\\5/24\\11/24\\5/24 \end{array} \right]\left[ \begin{array}{} 5/48\\7/48\\29/48\\7/48 \end{array} \right]\left[ \begin{array}{} 21/288\\31/288\\205/288\\31/288 \end{array} \right]...\left[ \begin{array}{} 0\\0\\1\\0 \end{array} \right]$

5 解决终止点问题和陷阱问题

上述过程忽略了一个问题，那就是网页浏览者浏览的随意性，浏览者会随机地选择网页，而当遇到一个结束网页或者一个陷阱网页（比如两个示例中的C）时，他可能会在浏览器的地址中随机输入一个地址，当然这个地址可能又是原来的网页，但这有可能逃离这个陷阱。根据现实网页浏览者的浏览行为，对算法进行改进。假设每一步，网页浏览者离开当前网页跳转到各个网页的概率是1/n，查看当前网页的概率为a，那么他从浏览器地址栏跳转的概率为(1-a)，于是原来的迭代公式转化为：

V' = α M V + (1 - α) e

$V^{'}= \alpha MV+(1-\alpha)e$
现在我们来计算带陷阱的网页图的概率分布：

V 1 = α M V 0 + (1 - α) e = 0.8 \times ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 3 1 3 1 3 1 2 00 1 2 0010 0 1 2 1 2 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 4 1 4 1 4 1 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ + 0.2 \times ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 4 1 4 1 4 1 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 9 60 13 60 25 60 13 60 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

$V_1=\alpha MV_0+(1-\alpha)e=0.8\times\left[ \begin{array}{} 0&1\over2&0&0\\ 1\over3&0&0&1\over2\\ 1\over3&0&1&1\over2\\ 1\over3&1\over2&0&0 \end{array} \right]\left[ \begin{array}{} 1\over4\\1\over4\\1\over4\\1\over4 \end{array} \right]+0.2\times\left[ \begin{array}{} 1\over4\\1\over4\\1\over4\\1\over4 \end{array} \right]=\left[ \begin{array}{} 9\over60\\13\over60\\25\over60\\13\over60 \end{array} \right]$
重复迭代下去，得到：

⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 1 / 4 1 / 4 1 / 4 1 / 4 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 9 / 24 5 / 24 5 / 24 5 / 24 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 15 / 48 11 / 48 11 / 48 11 / 48 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 11 / 32 7 / 32 7 / 32 7 / 32 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ . . . ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ 3 / 9 2 / 9 2 / 9 2 / 9 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥

6 基于MapReduce的Page Rank算法实现

上面的演算过程，采用矩阵相乘，不断迭代，直到迭代前后概率分布向量的值变化不大，一般迭代到 30 次以上就收敛了。真的的 web 结构的转移矩阵非常大，目前的网页数量已经超过 100 亿，转移矩阵是 100 亿* 100 亿的矩阵，故借助 MapReduce 的分布式计算方式来解决。

6.1 爬取的图数据

我们把 web 图中的每一个网页及其链出的网页作为一行，这样第四节中的 web 图结构用如下方式表示：

0123 0 A B C D 1 B A C B 2 C D 3 D

$\begin{array}{c|lcr} & 0 & 1 & 2 & 3 \\ \hline 0 & A & B & C & D \\ 1 & B & A & D & \\ 2 & C & C & \\ 3 & D & B & \end{array}$
我们从网上爬取得原始图数据便是这种结构形式：

123456789 \dots 0703144121145520124 \dots 18262581271766651472109749112817785217731287102117935313942464159114182477531786247518951669249754184224861896190026476219042497189719013532064190525081898190242430021906251918991903426190725210

$\begin{array}{c|lcr} & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline 1 \\ 2 \\ 3 & 703 & 826 & 1097 & 1287 & 1591 & 1895 & 1896 & 1897& 1898& 1899 \\ 4 & 144 & 258 & 491 & 1021 & 1418 & 1669 & 1900 & 1901 & 1902 & 1903 \\ 5 & 121& 127& 128 & 179& 247& 249 & 264 & 353 & 424 &426 \\ 6 \\ 7& 145 &176 & 177 & 353 & 753& 754 & 762 & 2064 & 3002 \\ 8& 520& 665 & 852& 1394 & 1786 &1842 & 1904 & 1905 & 1906 &1907 \\ 9& 124 & 147& 177 & 246 & 247& 248 & 249 & 250 &251 &252 \\ …&… \end{array}$
空的地方为null，表示空地址。完整的源数据：p2p-Gnutella08-adj.txt（上传至网盘http://pan.baidu.com/s/1dFIF31Z）。

6.2 MapReduce代码实现

这里采用java语言编写，有四部分核心代码：
（1）BuildPageRankRecords.java 将txt文本图数据转换为Hadoop可写的数据

public class BuildPageRankRecords extends Configured implements Tool {
  private static final Logger LOG = Logger.getLogger(BuildPageRankRecords.class);
  private static final String NODE_CNT_FIELD = "node.cnt";
  private static class MyMapper extends Mapper<LongWritable, Text, IntWritable, PageRankNode> {
    private static final IntWritable nid = new IntWritable();
    private static final PageRankNode node = new PageRankNode();

 public void setup(Mapper<LongWritable, Text, IntWritable, PageRankNode>.Context context) {
      int n = context.getConfiguration().getInt(NODE_CNT_FIELD, 0);
      if (n == 0) {
        throw new RuntimeException(NODE_CNT_FIELD + " cannot be 0!");
      }
      node.setType(PageRankNode.Type.Complete);
      node.setPageRank((float) -StrictMath.log(n));
    }
    public void map(LongWritable key, Text t, Context context) throws IOException,
        InterruptedException {
      String[] arr = t.toString().trim().split("\\s+");
      nid.set(Integer.parseInt(arr[0]));
      if (arr.length == 1) {
        node.setNodeId(Integer.parseInt(arr[0]));
        node.setAdjacencyList(new ArrayListOfIntsWritable());
      } 
else {
        node.setNodeId(Integer.parseInt(arr[0]));
        int[] neighbors = new int[arr.length - 1];
        for (int i = 1; i < arr.length; i++) {
          neighbors[i - 1] = Integer.parseInt(arr[i]);}
        node.setAdjacencyList(new ArrayListOfIntsWritable(neighbors));}

      context.getCounter("graph", "numNodes").increment(1);
      context.getCounter("graph", "numEdges").increment(arr.length - 1);
      if (arr.length > 1) {
        context.getCounter("graph", "numActiveNodes").increment(1);
      }
      context.write(nid, node);
    }
  }

  public static void main(String[] args) throws Exception {
    ToolRunner.run(new BuildPageRankRecords(), args);
  }
}

（2）PartitionGraph.java 划分图

public class PartitionGraph extends Configured implements Tool {
  private static final Logger LOG = Logger.getLogger(PartitionGraph.class);
  public static void main(String[] args) throws Exception {
    ToolRunner.run(new PartitionGraph(), args);
  }
  public PartitionGraph() {}
  private static final String INPUT = "input";
  private static final String OUTPUT = "output";
  private static final String NUM_NODES = "numNodes";
  private static final String NUM_PARTITIONS = "numPartitions";
  private static final String RANGE = "range";

    if (useRange) {
      job.setPartitionerClass(RangePartitioner.class); }
    FileSystem.get(conf).delete(new Path(outPath), true);
    job.waitForCompletion(true);
    return 0;
  }
}

（3）RunPageRankBasic.java

public class RunPageRankBasic extends Configured implements Tool {

  private static class MapClass extends
      Mapper<IntWritable, PageRankNode, IntWritable, PageRankNode> {
    private static final IntWritable neighbor = new IntWritable();
    private static final PageRankNode intermediateMass = new PageRankNode();
    private static final PageRankNode intermediateStructure = new PageRankNode();
    public void map(IntWritable nid, PageRankNode node, Context context)
        throws IOException, InterruptedException {

  // Mapper的combiner操控
  private static class MapWithInMapperCombiningClass extends
      Mapper<IntWritable, PageRankNode, IntWritable, PageRankNode> {
    // 根据目标节点PageRank mass权重key值
    private static final HMapIF map = new HMapIF();
    private static final PageRankNode intermediateStructure = new PageRankNode();
    public void setup(Context context) throws IOException {
      map.clear();
    }
    public void map(IntWritable nid, PageRankNode node, Context context)
        throws IOException, InterruptedException {
      context.write(nid, intermediateStructure);
      int massMessages = 0;
      int massMessagesSaved = 0;
      // 沿着出链边分配PageRank mass至相邻节点
      if (node.getAdjacenyList().size() > 0) {
        // Each neighbor gets an equal share of PageRank mass.
        ArrayListOfIntsWritable list = node.getAdjacenyList();
        float mass = node.getPageRank() - (float) StrictMath.log(list.size());
        context.getCounter(PageRank.edges).increment(list.size());

  private static class CombineClass extends
      Reducer<IntWritable, PageRankNode, IntWritable, PageRankNode> {
    private static final PageRankNode intermediateMass = new PageRankNode();
    public void reduce(IntWritable nid, Iterable<PageRankNode> values, Context context)
        throws IOException, InterruptedException {
      int massMessages = 0;

  // Reduce阶段
  private static class ReduceClass extends
      Reducer<IntWritable, PageRankNode, IntWritable, PageRankNode> {
    private float totalMass = Float.NEGATIVE_INFINITY;
    public void reduce(IntWritable nid, Iterable<PageRankNode> iterable, Context context)
        throws IOException, InterruptedException {
      Iterator<PageRankNode> values = iterable.iterator();
      // PageRank mass累积节点更新
      node.setPageRank(mass);
      context.getCounter(PageRank.massMessagesReceived).increment(massMessagesReceived);
     public void cleanup(Context context) throws IOException {
      Configuration conf = context.getConfiguration();
      String taskId = conf.get("mapred.task.id");
      String path = conf.get("PageRankMassPath");
      Preconditions.checkNotNull(taskId);
      Preconditions.checkNotNull(path); }
  }

  // Mapper阶段：（分配丢失的PageRank mass）并记录随机跳转因子
  private static class MapPageRankMassDistributionClass extends
      Mapper<IntWritable, PageRankNode, IntWritable, PageRankNode> {
    private float missingMass = 0.0f;
    private int nodeCnt = 0;
    public void setup(Context context) throws IOException {
      Configuration conf = context.getConfiguration();
      missingMass = conf.getFloat("MissingMass", 0.0f);
      nodeCnt = conf.getInt("NodeCount", 0);  }
    public void map(IntWritable nid, PageRankNode node, Context context)
        throws IOException, InterruptedException {
      float p = node.getPageRank();
      float jump = (float) (Math.log(ALPHA) - Math.log(nodeCnt));
      float link = (float) Math.log(1.0f - ALPHA)
          + sumLogProbs(p, (float) (Math.log(missingMass) - Math.log(nodeCnt)));
      p = sumLogProbs(jump, link);
      node.setPageRank(p);
      context.write(nid, node);  }
  }
  // PageRank的迭代过程
    for (int i = s; i < e; i++) {
      iteratePageRank(i, i + 1, basePath, n, useCombiner, useInmapCombiner);
    }
    return 0;
  }
    // 执行迭代
  private void iteratePageRank(int i, int j, String basePath, int numNodes,
      boolean useCombiner, boolean useInMapperCombiner) throws Exception {
    // 每次迭代过程由两个阶段组成（两个MapReduce Job）
    // Job 1: 沿着出链边分配PageRank mass
    float mass = phase1(i, j, basePath, numNodes, useCombiner, useInMapperCombiner);
    float missing = 1.0f - (float) StrictMath.exp(mass);
    // Job 2: 分配丢失的mass并关注（网页）随机跳转因子
    phase2(i, j, missing, basePath, numNodes); }
  private void phase2(int i, int j, float missing, String basePath, int numNodes) throws Exception{ 
  Job job = Job.getInstance(getConf());
    job.setJobName("PageRank:Basic:iteration" + j + ":Phase2");
    job.setJarByClass(RunPageRankBasic.class);
    System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
  }
  // 对数概率相加
  private static float sumLogProbs(float a, float b) {
    if (a == Float.NEGATIVE_INFINITY)  return b;
    if (b == Float.NEGATIVE_INFINITY)  return a;
    if (a < b) {  return (float) (b + StrictMath.log1p(StrictMath.exp(a - b)));  }
    return (float) (a + StrictMath.log1p(StrictMath.exp(b - a))); }
}

（4）FindMaxPageRankNodes.java 对Node进行排序得到最终的权重序列

public class FindMaxPageRankNodes extends Configured implements Tool {
  private static final Logger LOG = Logger.getLogger(FindMaxPageRankNodes.class);
  private static class MyMapper extends
      Mapper<IntWritable, PageRankNode, IntWritable, FloatWritable> {
    private TopScoredObjects<Integer> queue;
    public void setup(Context context) throws IOException {
      int k = context.getConfiguration().getInt("n", 100);
      queue = new TopScoredObjects<>(k);   }
    public void map(IntWritable nid, PageRankNode node, Context context) throws IOException,
        InterruptedException {
      queue.add(node.getNodeId(), node.getPageRank());
    }
  }
  private static class MyReducer extends
      Reducer<IntWritable, FloatWritable, IntWritable, Text> {
    private static TopScoredObjects<Integer> queue;
    public void setup(Context context) throws IOException {
      int k = context.getConfiguration().getInt("n", 100);
      queue = new TopScoredObjects<Integer>(k);
    }
    public void reduce(IntWritable nid, Iterable<FloatWritable> iterable, Context context)
        throws IOException {
      Iterator<FloatWritable> iter = iterable.iterator();
      queue.add(nid.get(), iter.next().get());
      if (iter.hasNext()) {
        throw new RuntimeException();  }
    }
  }
  public FindMaxPageRankNodes() {
  }
  private static final String INPUT = "input";
  private static final String OUTPUT = "output";
  private static final String TOP = "top";
  @SuppressWarnings({ "static-access" })
  public int run(String[] args) throws Exception {
    Options options = new Options();
    if (!cmdline.hasOption(INPUT) || !cmdline.hasOption(OUTPUT) || !cmdline.hasOption(TOP)) {
      System.out.println("args: " + Arrays.toString(args));
      HelpFormatter formatter = new HelpFormatter();
      formatter.setWidth(120);
      formatter.printHelp(this.getClass().getName(), options);
      ToolRunner.printGenericCommandUsage(System.out);
      return -1;  }
  }
  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new FindMaxPageRankNodes(), args);
System.exit(res);  }
}

7 运行环境

VMware Workstation 12.0 (64 位)
CentOS 6.4 (64 位)
JDK 1.7.0 (64 位)
Hadoop 2.6.0 (64 位，伪分布式配置)
Eclipse 3.8 (64 位)

8 PageRank的实现结果

建立实现算法过程用到的四个主要类（class）java文件，并经过编译、run on hadoop，最终得到PageRank MapReduce代码在Eclipse上的运行结果（这里取top 20），Eclipse界面截图如下：
这里写图片描述

飘逸慕嫣然

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
实践Demo：基于MapReduce的PageRank网页排序

【声明：鄙人菜鸟一枚，写的都是初级博客，如遇大神路过鄙地，请多赐教；内容有误，请批评指教，如有雷同，属我偷懒转运的，能给你带来收获就是我的博客价值所在。】在前面的综述中，我们已经介绍了基于MapReduce的大图划分算法，其中最具代表性之一的实践应用便是PageRank网页排序。作为图划分算法的一种，PageRank算法自然具有在分布式系统上高效处理大规模图数据的能力，其最初被用于
复制链接

扫一扫