below code path are all from sparks' example beside some comments are added by me.
val lines = ctx.textFile(args(0), 1)
//-1 generate links of <src,targets> pair
var links = lines.map{ s =>
val parts = s.split("\\s+")
(parts(0), parts(1)) //-pair of <src,target>
}.distinct() //-needless if dedupliate
.groupByKey().cache() //-raw:利用groupby生成一个准备join的表,模拟表数据实际情况; 由于links多次迭代所以要cache提升性能 #B
//-leib.如果此行打开,上行也要同时打开否则redueByKey()异常,因为for()中flatMap()会产生(Char,Double)
// .partitionBy(new org.apache.spark.HashPartitioner(2)).cache()
//-2 generate ranks with default value,ie <spawnup-url,default-rank>
//-use val if #A is comment
var ranks = links.mapValues(v => 1.0) // ie <raw-links-key,1.0>
//-3
for (i <- 1 to iters) {
//-3.1 reverse the spawnup urls to target urls:inner join;由于links是url全集可能性能影响大
//- 交换links,ranks是否可以提升性能? no ,this is not leftJoin but inner join
// ?links数量太大时,对于后续深迭代计算影响大,可以先利用contribs计算新的links(mapValues())再进行下一次join
//-note:both links and ranks rdd are same partitioner,so no shuffle is necessary for join op
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) => //-why uses 'case' clause?元组tuple就要用
val size = urls.size //-target(to) urls size
urls.map(url => (url, rank / size)) //-avg rank per target url
}
//-3.2 merge the contributed ranks per target url; 注意:此ranks不断收窄(慢慢远离出发urls),导致要计算的数据越来越少,see #A
//-为什么不用恢复到原ranks节点数?若果恢复,统计数据将再恢复为第二次的数据
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _) //加权求和; retains same partittoner with join
//-#A:若果数量确实庞大,可以使用此方法大约每隔几轮缓存下结果,这样在10轮以上的就快很多了? cmp #B
// val oldlinks = links
// links = links.join(ranks).map{ case (k,(urls, rank)) => (k,urls)} //-added by leib
// oldlinks.unpersist(false)
println("step------------------------------"+i+"---------------------------------")
ranks.foreach(s => println("-result:" + s._1 + " - " + s._2))
}
ref: