背景
测试了一个case,用GraphX 1.6跑标准的LPA算法,使用的是内置的LabelPropagation算法包。数据集是Google web graph,(忽略可能这个数据集不是很合适),资源情况是standalone模式,18个worker,每个worker起一个executor,50g内存,32核,数据加载成18个分区。
case里执行200轮迭代,代码:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
// load the graph
val google = GraphLoader.edgeListFile(sc, "/home/admin/benchmark/data/google/web-Google.txt", false, 18)
LabelPropagation.run(google, 200)
GraphX的执行方式
graphx的LPA是使用自己封装的Pregel跑的,先说优点,问题在后面暴露后分析:
1. 包掉了使用VertexRDD和EdgeRDD做BSP的过程,api简单,泛型清晰
2. 某轮迭代完成后,本轮没有msg流动的话,判定早停,任务结束
3. 迭代开始前,graph自动cache,结束后,某些中间结果rdd自动uncache
代码如下:
def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
(graph: Graph[VD, ED],
initialMsg: A,
maxIterations: Int = Int.MaxValue,
activeDirection: EdgeDirection = EdgeDirection.Either)
(vprog: (VertexId, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] =
{
var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
// compute the messages
var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
var activeMessages = messages.count()
// Loop
var prevG: Graph[VD, ED] = null
var i = 0
while (activeMessages > 0 && i < maxIterations) {
// Receive the messages and update the vertices.
prevG = g
g = g.joinVertices(messages)(vprog).cache()
val oldMessages = messages
// Send new messages, skipping edges where neither side received a message. We must cache
// messages so it can be materialized on the next line, allowing us to uncache the previous
// iteration.
messages = g.mapReduceTriplets(
sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
// The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages
// (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages
// and the vertices of g).
activeMessages = messages.count()
logInfo("Pregel finished iteration " + i)
// Unpersist the RDDs hidden by newly-materialized RDDs
oldMessages.unpersist(blocking = false)
prevG.unpersistVertices(blocking = false)
prevG.edges.unpersist(blocking = false)
// count the iteration
i +=