一。图的结构
如何定义图?
举个栗子A:
val userGraph: Graph[(String, String), String]
userGraph是个图变量(定义图结构的变量的简称),其中(String,String)是顶点属性类型,String是边属性类型。顶点(Vertex),边(Edge),三元组(Triplet)结构后面会介绍。
1)顶点Vertex的结构
举个栗子B:
val sc: SparkContext
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
(5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
其中,RDD[(VertexId,VD)]等同VertexRDD[VD]。VD表示顶点属性类型,如栗子B中顶点属性类型为(String,String)。VertexId,表示顶点id类型,在Spark GraphX源码里面,VertexId等同Long,即type VertexId = Long。恐怕你已知道,顶点结构体里面,包含两种类型,一个是顶点id的类型(源码里已写死为Long类型),一个是顶点属性的类型(用户可以自定义其类型,即该顶点所携带的信息量)。
2)边Edge的结构
举个栗子C:
// Assume the SparkContext has already been constructed
val sc: SparkContext
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
其中,RDD[Edge[ED]]等同EdgeRDD[ED]。ED表示边属性类型,如栗子C中边属性类型为String。Edge(srcId,dstId,edgeAttr),srcId表示源顶点id值,dstId表示目标顶点id值,edgeAttr表示边属性值。So,边结构体里面,包含两种类型,三个变量,即一个是源顶点id(与目标顶点id类型一样,在源码里也已写死为Long类型),一个是目标顶点id,一个是源顶点到目标顶点这条边的属性(用户可以自定义其类型,即该条边所携带的信息量)。
3)既含有顶点属性也含有边属性的结构
举个栗子D:
val graph: Graph[(String, String), String] // Constructed from above
// Use the triplets view to create an RDD of facts.
val facts: RDD[String] =
graph.triplets.map(triplet =>
triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
facts.collect.foreach(println(_))
val triplets: RDD[EdgeTriplet[VD, ED]]=graph.triplets
其中,RDD[EdgeTriplet[VD, ED]]里面VD表示顶点属性类型,ED表示边属性类型。从栗子D中可以看出,与Edge[ED]类型区别的是,在EdgeTriplet[VD, ED]类型的变量里,不仅含有边属性,也含有边两端顶点的属性。
class EdgeTriplet[VD, ED] extends Edge[ED] {
/**
* The source vertex attribute
*/
var srcAttr: VD = _ // nullValue[VD]
/**
* The destination vertex attribute
*/
var dstAttr: VD = _ // nullValue[VD]
……
}
case class Edge[@specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED] (
var srcId: VertexId = 0,
var dstId: VertexId = 0,
var attr: ED = null.asInstanceOf[ED])
4)下图是从sparkGraphX官网上复制过来,图能更清晰地传递文字的含义。
二。生成图的几种方式
1)根据顶点和边生成图
/**
* Construct a graph from a collection of vertices and
* edges with attributes. Duplicate vertices are picked arbitrarily and
* vertices found in the edge collection but not in the input
* vertices are assigned the default attribute.
*
* @tparam VD the vertex attribute type
* @tparam ED the edge attribute type
* @param vertices the "set" of vertices and their attributes
* @param edges the collection of edges in the graph
* @param defaultVertexAttr the default vertex attribute to use for vertices that are
* mentioned in edges but not in vertices
* @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
* @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
*/
def apply[VD: ClassTag, ED: ClassTag](
vertices: RDD[(VertexId, VD)],
edges: RDD[Edge[ED]],
defaultVertexAttr: VD = null.asInstanceOf[VD],
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
}
2)根据边生成图
/**
* Construct a graph from a collection of edges.
*
* @param edges the RDD containing the set of edges in the graph
* @param defaultValue the default vertex attribute to use for each vertex
* @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
* @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
*
* @return a graph with edge attributes described by `edges` and vertices
* given by all vertices in `edges` with value `defaultValue`
*/
def fromEdges[VD: ClassTag, ED: ClassTag](
edges: RDD[Edge[ED]],
defaultValue: VD,
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
}
其中,顶点可以根据边的信息生成。
val edgesCached = edges.withTargetStorageLevel(edgeStorageLevel).cache()
val vertices =
VertexRDD.fromEdges(edgesCached, edgesCached.partitions.length, defaultVertexAttr)
.withTargetStorageLevel(vertexStorageLevel)
3)根据边的二元组生成图
/**
* Construct a graph from a collection of edges encoded as vertex id pairs.
*
* @param rawEdges a collection of edges in (src, dst) form
* @param defaultValue the vertex attributes with which to create vertices referenced by the edges
* @param uniqueEdges if multiple identical edges are found they are combined and the edge
* attribute is set to the sum. Otherwise duplicate edges are treated as separate. To enable
* `uniqueEdges`, a [[PartitionStrategy]] must be provided.
* @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
* @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
*
* @return a graph with edge attributes containing either the count of duplicate edges or 1
* (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex.
*/
def fromEdgeTuples[VD: ClassTag](
rawEdges: RDD[(VertexId, VertexId)],
defaultValue: VD,
uniqueEdges: Option[PartitionStrategy] = None,
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] =
{
val edges = rawEdges.map(p => Edge(p._1, p._2, 1))
val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
uniqueEdges match {
case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b)
case None => graph
}
}
三。图的计算模型(重点)——Pregel的编程模型
1)定义
很多有关图的算法都是基于Pregel实现的。
其核心为三个函数(用户可自定义实现这些函数体,但函数输入输出的格式已固定),如下:
(以下解释是翻译源码注释)
vprog函数:
官方解释:用户定义的顶点程序。该顶点程序运行在每个顶点上,接受到达该顶点的消息并计算一个新的顶点值。在第一次迭代中,顶点程序在所有顶点上调用,并传递默认消息。在随后的迭代中,顶点程序只在接收消息的顶点上调用。
通俗用语:用户可自定义实现函数体,在第一次迭代中,函数作用于图的每个顶点,在随后的迭代中,该函数只作用接收message的顶点。函数的入参是顶点属性值和顶点接收的message(message类型用户可以自定义),通过用户实现的函数体+message来更新顶点原来的属性值。顶点接收的message应该是sendMsg+mergeMsg函数的结果。
sendMsg函数:
官方解释:用户提供的函数,该函数应用于当前迭代中接收消息的顶点的外边缘。
通俗用语:用户可自定义实现函数体,函数作用于图的每条边,函数的入参是边的三元组,通过用户实现的函数体+边的三元组(源顶点及其属性,目标顶点及其属性,源顶点与目标顶点之间边的属性)向顶点传递message(用户可以自定义message类型)。
mergeMsg函数:
官方解释:用户提供的函数,接收两个类型为A的信息,合并成一个类型为A的信息。这个函数一定是可交换的和可关联的,并且在理想的情况下,类型为A的信息的大小不应该增加。
通俗用语:用户可自定义实现函数体,函数作用于图的每个顶点,根据sendMsg函数向每个顶点传递message,mergeMsg函数主要是合并传递给顶点的两个message。假设message类型为A,该函数的入参是两个类型为A的message,通过用户实现的函数体+两个message合并成一个类型为A的message。
/**
* Execute a Pregel-like iterative vertex-parallel abstraction. The
* user-defined vertex-program `vprog` is executed in parallel on
* each vertex receiving any inbound messages and computing a new
* value for the vertex. The `sendMsg` function is then invoked on
* all out-edges and is used to compute an optional message to the
* destination vertex. The `mergeMsg` function is a commutative
* associative function used to combine messages destined to the
* same vertex.
*
* On the first iteration all vertices receive the `initialMsg` and
* on subsequent iterations if a vertex does not receive a message
* then the vertex-program is not invoked.
*
* This function iterates until there are no remaining messages, or
* for `maxIterations` iterations.
*
* @tparam VD the vertex data type
* @tparam ED the edge data type
* @tparam A the Pregel message type
*
* @param graph the input graph.
*
* @param initialMsg the message each vertex will receive at the first
* iteration
*
* @param maxIterations the maximum number of iterations to run for
*
* @param activeDirection the direction of edges incident to a vertex that received a message in
* the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only
* out-edges of vertices that received a message in the previous round will run. The default is
* `EdgeDirection.Either`, which will run `sendMsg` on edges where either side received a message
* in the previous round. If this is `EdgeDirection.Both`, `sendMsg` will only run on edges where
* *both* vertices received a message.
*
* @param vprog the user-defined vertex program which runs on each
* vertex and receives the inbound message and computes a new vertex
* value. On the first iteration the vertex program is invoked on
* all vertices and is passed the default message. On subsequent
* iterations the vertex program is only invoked on those vertices
* that receive messages.
*
* @param sendMsg a user supplied function that is applied to out
* edges of vertices that received messages in the current
* iteration
*
* @param mergeMsg a user supplied function that takes two incoming
* messages of type A and merges them into a single message of type
* A. ''This function must be commutative and associative and
* ideally the size of A should not increase.''
*
* @return the resulting graph at the end of the computation
*
*/
def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
(graph: Graph[VD, ED],
initialMsg: A,
maxIterations: Int = Int.MaxValue,
activeDirection: EdgeDirection = EdgeDirection.Either)
(vprog: (VertexId, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED]
此处如果有图,会更清晰,但是时间有限,暂时占位,后期补充。
2)Pregel函数体如下:
def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
(graph: Graph[VD, ED],
initialMsg: A,
maxIterations: Int = Int.MaxValue,
activeDirection: EdgeDirection = EdgeDirection.Either)
(vprog: (VertexId, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] =
{
require(maxIterations > 0, s"Maximum number of iterations must be greater than 0," +
s" but got ${maxIterations}")
var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
// compute the messages
var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
var activeMessages = messages.count()
// Loop
var prevG: Graph[VD, ED] = null
var i = 0
while (activeMessages > 0 && i < maxIterations) {
// Receive the messages and update the vertices.
prevG = g
g = g.joinVertices(messages)(vprog).cache()
val oldMessages = messages
// Send new messages, skipping edges where neither side received a message. We must cache
// messages so it can be materialized on the next line, allowing us to uncache the previous
// iteration.
messages = GraphXUtils.mapReduceTriplets(
g, sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
// The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages
// (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages
// and the vertices of g).
activeMessages = messages.count()
logInfo("Pregel finished iteration " + i)
// Unpersist the RDDs hidden by newly-materialized RDDs
oldMessages.unpersist(blocking = false)
prevG.unpersistVertices(blocking = false)
prevG.edges.unpersist(blocking = false)
// count the iteration
i += 1
}
messages.unpersist(blocking = false)
g
} // end of apply
这个函数体其实也很好理解啦。
第一步:vprog函数对图中每个顶点的属性值通过初始(默认)message进行更新。
第二步:当然是通过sendMsg和mergeMsg函数对图中每个顶点依据边传送信息和合并信息,最终形成RDD[(顶点,message)]。噢,当然啦,不是每个顶点都能接收到message,这取决于用户实现的sendMsg函数逻辑。
第三步:应该首先判断接收到message的顶点个数(即活跃的顶点个数),如有存在接收message的顶点,同时,当前迭代的次数小于最大的迭代次数,则调用vprog函数,作用于接收message的顶点,更新这些顶点的属性值。
第四步:循环。。。
四。常用的图的接口
(只整理我在实现louvain算法时所用到的图的接口函数,后期如遇新的函数,必会添加,哈哈哈,我的代码应该很清晰,不需要啰嗦的解释了哈~~)
1)triplets函数
val edges: RDD[Edge[Double]] = graph.triplets.filter(edgeTriplet => edgeTriplet.srcAttr._cId != edgeTriplet.dstAttr._cId).map {
case (edgeTriplet: EdgeTriplet[VertexData, Double]) =>
val srcCId: VertexId = edgeTriplet.srcAttr._cId
val dstCId: VertexId = edgeTriplet.dstAttr._cId
val weight: Double = edgeTriplet.attr
val minVertexId: VertexId = math.min(srcCId, dstCId)
val maxVertexId: VertexId = math.max(srcCId, dstCId)
((minVertexId, maxVertexId), weight)
}
2)zip函数
val changeCount: Long = graph.vertices.zip(maxChangeInfo).filter {
case ((vId1: VertexId, vertex: VertexData), (vId2: VertexId, cId: VertexId, maxModularityChange: Double)) =>
vertex._cId != cId
}.count()
3)connectedComponents函数(这个我解释下,这个函数是先判断图中存在的联通图,然后在每个联通图中,以最小的顶点id为该联通图的标识,表示拥有相同标识的顶点是属于一个联通图)
val newMaxChangeInfo: VertexRDD[VertexId] = Graph.fromEdgeTuples(maxChangeInfo.map {
case (vId: VertexId, cId: VertexId, maxModularityChange: Double) => (vId, cId)
}, 0)
.connectedComponents()
.vertices
val updateInfoByCId: Graph[VertexData, Double] = graph.joinVertices(maxChangeInfo)((vId: VertexId, vertexData: VertexData, cId: VertexId) => {
val newVertexData: VertexData = new VertexData(vId, cId)
newVertexData._degree = vertexData._degree
newVertexData._innerDegree = vertexData._innerDegree
newVertexData._innerVertices = vertexData._innerVertices
newVertexData
})
val louvainG = initG.outerJoinVertices(vertexAttr)((vId: VertexId, oldVertexAttr: None.type, newVertexAttr: Option[Double]) => {
val vertexData: VertexData = new VertexData(vId, vId)
val weights: Double = newVertexAttr.getOrElse(0)
vertexData._degree = weights
vertexData._innerVertices += vId
vertexData._commVertices += vId
vertexData
})