PageRank网页排名算法

互联网上各个网页之间的链接关系我们都可以看成是一个有向图,一个网页的重要性由链接到该网页的其他网页来投票,一个较多链入的页面会有比较高等级,反之如果一个页面没有链入或链入较少等级则低,网页的PR值越高,代表网页越重要

假设一个有A、B、C、D四个网页组成的集合,B、C、D三个页面都链入到A,则A的PR值将是B、C、D三个页面PR值的总和:

PR(A)=PR(B)+PR(C)+PR(D)

继续上面的假设,B除了链接到A,还链接到C和D,C除了链接到A,还链接到B,而D只链接到A,所以在计算A的PR值时,B的PR值只能投出1/3的票,C的PR值只能投出1/2的票,而D只链接到A,所以能投出全票,所以A的PR值总和应为:

PR(A)=PR(B)/3+PR(C)/2+PR(D)

所以可以得出一个网页的PR值计算公式应为:

其中,Bu是所有链接到网页u的网页集合,网页v是属于集合Bu的一个网页,L(v)则是网页v的对外链接数(即出度)

 

图1-1

 

表1-2 根据图1-1计算的PR值

 

PA(A)

P(B)

PR(C)PR(D)
初始值0.25 0.250.25

 0.25

一次迭代 0.125 0.333 0.083

0.458

二次迭代 0.1665 0.49970.0417

 0.2912

…… ………………

……

n次迭代 0.19990.39990.0666

0.3333

 

 表1-2,经过几次迭代后,PR值逐渐收敛稳定

然而在实际的网络环境中,超链接并没有那么理想化,比如PageRank模型会遇到如下两个问题:

  1.排名泄露 

  如图1-3所示,如果存在网页没有出度链接,如A节点所示,则会产生排名泄露问题,经过多次迭代后,所有网页的PR只都趋向于0

图1-3

 

 

PR(A)

PR(B)PR(C)PR(D)
初始值

0.25

0.250.250.25
一次迭代

0.125

0.1250.250.25
二次迭代

0.125

0.1250.1250.25
三次迭代

0.125

0.1250.1250.125
……

……

………………
n次迭代

0

000

 

表1-4为图1-3多次迭代后结果 

   2.排名下沉

   如图1-5所示,若网页没有入度链接,如节点A所示,经过多次迭代后,A的PR值会趋向于0

  

图1-5

 

 

PR(A)

PR(B)PR(C)PR(D)
初始值

0.25

0.250.250.25
一次迭代

0

0.3750.250.375
二次迭代

0

0.3750.3750.25
三次迭代

0

0.250.3750.375
……

……

………………
n次迭代

0

………………

表1-5为图1-4多次迭代后结果 

 

 

我们假定上王者随机从一个网页开始浏览,并不断点击当前页面的链接进行浏览,直到链接到一个没有任何出度链接的网页,或者上王者感到厌倦,随机转到另外的网页开始新一轮的浏览,这种模型显然更贴近用户的习惯。为了处理那些没有任何出度链接的页面,引入阻尼系数d来表示用户到达缪戈页面后继续向后浏览的概率,一般将其设为0.85,而1-d就是用户停止点击,随机转到另外的网页开始新一轮浏览的概率,所以引入随机浏览模型的PageRank公式如下:

 

图1-6

代码1-7为根据图1-6建立数据集

root@lejian:/data# cat links 
A       B,C,D
B       A,D
C       D
D       B

 

GraphBuilder步骤的主要目的是建立网页之间的链接关系图,以及为每一个网页分配初始的PR值,由于我们的数据集已经建立好网页之间的链接关系图,所以在代码1-8中只要为每个网页分配相等的PR值即可

代码1-8

package com.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class GraphBuilder {

	public static class GraphBuilderMapper extends Mapper<LongWritable, Text, Text, Text> {

		private Text url = new Text();
		private Text linkUrl = new Text();

		protected void map(LongWritable key, Text value, Context context) throws java.io.IOException, InterruptedException {
			String[] tuple = value.toString().split("\t");
			url.set(tuple[0]);
			if (tuple.length == 2) {
				// 网页有出度
				linkUrl.set(tuple[1] + "\t1.0");
			} else {
				// 网页无出度
				linkUrl.set("\t1.0");
			}
			context.write(url, linkUrl);
		};
	}

	protected static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		job.setJobName("Graph Builder");
		job.setJarByClass(GraphBuilder.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		job.setMapperClass(GraphBuilderMapper.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);
	}

}

 

PageRankIter步骤的主要目的是迭代计算PageRank数值,直到满足运算结束条件,比如收敛或达到预定的迭代次数,这里采用预设迭代次数的方式,多次运行该步骤

代码1-9

package com.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PageRankIter {

	private static final double DAMPING = 0.85;

	public static class PRIterMapper extends Mapper<LongWritable, Text, Text, Text> {

		private Text prKey = new Text();
		private Text prValue = new Text();

		protected void map(LongWritable key, Text value, Context context) throws java.io.IOException, InterruptedException {
			String[] tuple = value.toString().split("\t");
			if (tuple.length <= 2) {
				return;
			}
			String[] linkPages = tuple[1].split(",");
			double pr = Double.parseDouble(tuple[2]);
			for (String page : linkPages) {
				if (page.isEmpty()) {
					continue;
				}
				prKey.set(page);
				prValue.set(tuple[0] + "\t" + pr / linkPages.length);
				context.write(prKey, prValue);
			}
			prKey.set(tuple[0]);
			prValue.set("|" + tuple[1]);
			context.write(prKey, prValue);
		};
	}

	public static class PRIterReducer extends Reducer<Text, Text, Text, Text> {

		private Text prValue = new Text();

		protected void reduce(Text key, Iterable<Text> values, Context context) throws java.io.IOException, InterruptedException {
			String links = "";
			double pageRank = 0;
			for (Text val : values) {
				String tmp = val.toString();
				if (tmp.startsWith("|")) {
					links = tmp.substring(tmp.indexOf("|") + 1);
					continue;
				}
				String[] tuple = tmp.split("\t");
				if (tuple.length > 1) {
					pageRank += Double.parseDouble(tuple[1]);
				}
			}
			pageRank = (1 - DAMPING) + DAMPING * pageRank;
			prValue.set(links + "\t" + pageRank);
			context.write(key, prValue);
		};
	}

	protected static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		job.setJobName("PageRankIter");
		job.setJarByClass(PageRankIter.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		job.setMapperClass(PRIterMapper.class);
		job.setReducerClass(PRIterReducer.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);
	}

}

 

 

PageRankViewer步骤将迭代计算得到的最终排名结果按照PageRank值从大到小的顺序进行输出,并不需要reduce,输出的键值对为<PageRank,URL>

代码1-10

package com.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PageRankViewer {

	public static class PageRankViewerMapper extends Mapper<LongWritable, Text, FloatWritable, Text> {

		private Text outPage = new Text();
		private FloatWritable outPr = new FloatWritable();

		protected void map(LongWritable key, Text value, Context context) throws java.io.IOException, InterruptedException {
			String[] line = value.toString().split("\t");
			outPage.set(line[0]);
			outPr.set(Float.parseFloat(line[2]));
			context.write(outPr, outPage);
		};

	}

	public static class DescFloatComparator extends FloatWritable.Comparator {

		public float compare(WritableComparator a, WritableComparable<FloatWritable> b) {
			return -super.compare(a, b);
		}

		public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
			return -super.compare(b1, s1, l1, b2, s2, l2);
		}
	}

	protected static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		job.setJobName("PageRankViewer");
		job.setJarByClass(PageRankViewer.class);
		job.setOutputKeyClass(FloatWritable.class);
		job.setSortComparatorClass(DescFloatComparator.class);
		job.setOutputValueClass(Text.class);
		job.setMapperClass(PageRankViewerMapper.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);
	}

}

 

 

代码1-11

package com.hadoop.mapreduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class PageRankDriver {

	public static void main(String[] args) throws Exception {
		if (args == null || args.length != 3) {
			throw new RuntimeException("请输入输入路径、输出路径和迭代次数");
		}
		int times = Integer.parseInt(args[2]);
		if (times <= 0) {
			throw new RuntimeException("迭代次数必须大于零");
		}
		String UUID = "/" + java.util.UUID.randomUUID().toString();
		String[] forGB = { args[0], UUID + "/Data0" };
		GraphBuilder.main(forGB);
		String[] forItr = { "", "" };
		for (int i = 0; i < times; i++) {
			forItr[0] = UUID + "/Data" + i;
			forItr[1] = UUID + "/Data" + (i + 1);
			PageRankIter.main(forItr);
		}
		String[] forRV = { UUID + "/Data" + times, args[1] };
		PageRankViewer.main(forRV);
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Path path = new Path(UUID);
		fs.deleteOnExit(path);
	}

}

 

运行代码1-11,结果如代码1-12所示

代码1-12

root@lejian:/data# hadoop jar pageRank.jar com.hadoop.mapreduce.PageRankDriver /data /output 10
…… root@lejian:/data# hadoop fs -ls -R /output
-rw-r--r--   1 root supergroup          0 2017-02-10 17:55 /output/_SUCCESS
-rw-r--r--   1 root supergroup         50 2017-02-10 17:55 /output/part-r-00000
root@lejian:/data# hadoop fs -cat /output/part-r-00000
1.5149547       B
1.3249696       D
0.78404236      A
0.37603337      C

 

转载于:https://www.cnblogs.com/baoliyan/p/6380676.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值