1.读入参数构建sparkContext
if (args.length < 1) {
System.err.println("Usage: SparkPageRank <file> <iter>")
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("PageRank")
val iters = if (args.length > 0) args(1).toInt else 10
val ctx = new SparkContext(sparkConf)
2.解析日志,srcUrl - neighborUrl, 并对key去重
val lines = ctx.textFile(args(0), 1)
val links = lines.map{ s =>
val parts = s.split("\\s+")
(parts(0), parts(1))
}.distinct().groupByKey().cache()
3. 初始化 ranks, 每一个url初始分值为1
var ranks = links.mapValues(v => 1.0)
4. 迭代iters次; 每次迭代中做如下处理, links(urlKey, neighborUrls) join (urlKey, rank(分值));对neighborUrls以及初始 rank,每一个neighborUrl ,
neighborUrlKey, 初始rank/size(新的rank贡献值);然后再进行reduceByKey相加 并对分值 做调整 0.15 + 0.85 * _
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url => (url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
5.输出排名
val output = ranks.collect()
output.foreach(tup => println(tup._1 + " has rank: " + tup._2 + "."))
ctx.stop()