graphx pagerank 源码解析

最新推荐文章于 2022-08-12 09:00:57 发布

Kallyn

最新推荐文章于 2022-08-12 09:00:57 发布

阅读量1.3k

点赞数

分类专栏：技术

本文链接：https://blog.csdn.net/klliu_control/article/details/79387525

版权

技术专栏收录该内容

11 篇文章 0 订阅

订阅专栏

1.在spark官网http://spark.apache.org/downloads.html下载source code。

参考网址：http://blog.csdn.net/lsshlsw/article/details/41176093

2.找到pagerank文件的位置为:

\graphx\src\main\scala\org\apache\spark\graphx\lib\PageRank.scala

3.PageRank代码简介

/**
 * PageRank algorithm implementation. There are two implementations of PageRank implemented.
 *
 * The first implementation uses the standalone `Graph` interface and runs PageRank
 * for a fixed number of iterations:
 * {{{
 * var PR = Array.fill(n)( 1.0 )
 * val oldPR = Array.fill(n)( 1.0 )
 * for( iter <- 0 until numIter ) {
 *   swap(oldPR, PR)
 *   for( i <- 0 until n ) {
 *     PR[i] = alpha + (1 - alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum
 *   }
 * }
 * }}}
 *
 * The second implementation uses the `Pregel` interface and runs PageRank until
 * convergence:
 *
 * {{{
 * var PR = Array.fill(n)( 1.0 )
 * val oldPR = Array.fill(n)( 0.0 )
 * while( max(abs(PR - oldPr)) > tol ) {
 *   swap(oldPR, PR)
 *   for( i <- 0 until n if abs(PR[i] - oldPR[i]) > tol ) {
 *     PR[i] = alpha + (1 - \alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum
 *   }
 * }
 * }}}
 *
 * `alpha` is the random reset probability (typically 0.15), `inNbrs[i]` is the set of
 * neighbors which link to `i` and `outDeg[j]` is the out degree of vertex `j`.
 *
 * @note This is not the "normalized" PageRank and as a consequence pages that have no
 * inlinks will have a PageRank of alpha.
 */

PageRank模型分为静态和动态两大类:

第一种：（静态）在调用时提供一个参数number，用于指定迭代次数，即无论结果如何，该算法在迭代number次后停止计算，返回图结果。

第二种：（动态）在调用时提供一个参数tol，用于指定前后两次迭代的结果差值应小于tol，以达到最终收敛的效果时才停止计算，返回图结果。

共提供了5种调用方式

第1种(静态，非并行化,不指定源顶点)

def staticPageRank(numIter: Int, resetProb: Double = 0.15): Graph[Double, Double]

Run PageRank for a fixed number of iterations returning a graph with vertex attributes containing the PageRank and edge attributes the normalized edge weight.

第2种(静态，非并行化,指定1个源顶点)

def staticPersonalizedPageRank(src: VertexId, numIter: Int, resetProb: Double = 0.15): Graph[Double, Double]

Run Personalized PageRank for a fixed number of iterations with with all iterations originating at the source node returning a graph with vertex attributes containing the PageRank and edge attributes the normalized edge weight.
第3种(静态，并行化,指定源顶点集)

def staticParallelPersonalizedPageRank(sources: Array[VertexId], numIter: Int, resetProb: Double = 0.15): Graph[Vector, Double]
Run parallel personalized PageRank for a given array of source vertices, such that all random walks are started relative to the source vertices

第4种(动态，不指定源顶点)

pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
Run a dynamic version of PageRank returning a graph with vertex attributes containing the PageRank and edge attributes containing the normalized edge weight.

第5种(静态, 指定源顶点)

def personalizedPageRank(src: VertexId, tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
Run personalized PageRank for a given vertex, such that all random walks are started relative to the source node.

第1、2种静态非并行化调用方式的源代码:

/**
   * Run PageRank for a fixed number of iterations returning a graph
   * with vertex attributes containing the PageRank and edge
   * attributes the normalized edge weight.
   *
   * @tparam VD the original vertex attribute (not used)
   * @tparam ED the original edge attribute (not used)
   *
   * @param graph the graph on which to compute PageRank
   * @param numIter the number of iterations of PageRank to run
   * @param resetProb the random reset probability (alpha)
   *
   * @return the graph containing with each vertex containing the PageRank and each edge
   *         containing the normalized weight.
   */
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], numIter: Int,
    resetProb: Double = 0.15): Graph[Double, Double] =
  {
    runWithOptions(graph, numIter, resetProb)
  }

  /**
   * Run PageRank for a fixed number of iterations returning a graph
   * with vertex attributes containing the PageRank and edge
   * attributes the normalized edge weight.
   *
   * @tparam VD the original vertex attribute (not used)
   * @tparam ED the original edge attribute (not used)
   *
   * @param graph the graph on which to compute PageRank
   * @param numIter the number of iterations of PageRank to run
   * @param resetProb the random reset probability (alpha)
   * @param srcId the source vertex for a Personalized Page Rank (optional)
   *
   * @return the graph containing with each vertex containing the PageRank and each edge
   *         containing the normalized weight.
   *
   */
  def runWithOptions[VD: ClassTag, ED: ClassTag](
      graph: Graph[VD, ED], numIter: Int, resetProb: Double = 0.15,
      srcId: Option[VertexId] = None): Graph[Double, Double] =
  {
    require(numIter > 0, s"Number of iterations must be greater than 0," +
      s" but got ${numIter}")
    require(resetProb >= 0 && resetProb <= 1, s"Random reset probability must belong" +
      s" to [0, 1], but got ${resetProb}")

    val personalized = srcId.isDefined
    val src: VertexId = srcId.getOrElse(-1L)

    // Initialize the PageRank graph with each edge attribute having
    // weight 1/outDegree and each vertex with attribute 1.0.
    // When running personalized pagerank, only the source vertex
    // has an attribute 1.0. All others are set to 0.
    var rankGraph: Graph[Double, Double] = graph
      // Associate the degree with each vertex 将每个顶点进行连接（度的传递）得到顶点属性值为出度数
      .outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => deg.getOrElse(0) }
      // Set the weight on the edges based on the degree 通过顶点的出度数为每条边设置权重值 TripletFields.Src: Expose the source and edge fields but not the destination field
      .mapTriplets( e => 1.0 / e.srcAttr, TripletFields.Src )
      // Set the vertex attributes to the initial pagerank values 设置每个顶点的初始属性值为1.0 或者 0, 如果没有指定初始源顶点,则设置所有的顶点属性值为1.0;如果指定了初始源顶点,则将指定的源顶点属性值设为1.0,其他非源顶点属性值为0.0
      .mapVertices { (id, attr) =>
        if (!(id != src && personalized)) 1.0 else 0.0
      }

    def delta(u: VertexId, v: VertexId): Double = { if (u == v) 1.0 else 0.0 }

    var iteration = 0
    var prevRankGraph: Graph[Double, Double] = null
    while (iteration < numIter) {
      rankGraph.cache()

      // Compute the outgoing rank contributions of each vertex, perform local preaggregation, and
      // do the final aggregation at the receiving vertices. Requires a shuffle for aggregation.
      val rankUpdates = rankGraph.aggregateMessages[Double](
        ctx => ctx.sendToDst(ctx.srcAttr * ctx.attr), _ + _, TripletFields.Src)

      // Apply the final rank updates to get the new ranks, using join to preserve ranks of vertices
      // that didn't receive a message. Requires a shuffle for broadcasting updated ranks to the
      // edge partitions.
      prevRankGraph = rankGraph
      val rPrb = if (personalized) {
        (src: VertexId, id: VertexId) => resetProb * delta(src, id)
      } else {
        (src: VertexId, id: VertexId) => resetProb
      }
      //如果指定了源顶点,则其只能利用接收的信息值来对自身顶点属性值进行更新，而没有shuffle操作。
      rankGraph = rankGraph.outerJoinVertices(rankUpdates) {
        (id, oldRank, msgSumOpt) => rPrb(src, id) + (1.0 - resetProb) * msgSumOpt.getOrElse(0.0)
      }.cache()

      rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices
      logInfo(s"PageRank finished iteration $iteration.")
      prevRankGraph.vertices.unpersist(false)
      prevRankGraph.edges.unpersist(false)

      iteration += 1
    }

    // SPARK-18847 If the graph has sinks (vertices with no outgoing edges) correct the sum of ranks
    normalizeRankSum(rankGraph, personalized)
  }

第3种静态并行化调用方式的源代码:

/**
 * Run Personalized PageRank for a fixed number of iterations, for a
 * set of starting nodes in parallel. Returns a graph with vertex attributes
 * containing the pagerank relative to all starting nodes (as a sparse vector) and
 * edge attributes the normalized edge weight
 *
 * @tparam VD The original vertex attribute (not used)
 * @tparam ED The original edge attribute (not used)
 *
 * @param graph The graph on which to compute personalized pagerank
 * @param numIter The number of iterations to run
 * @param resetProb The random reset probability
 * @param sources The list of sources to compute personalized pagerank from
 * @return the graph with vertex attributes
 *         containing the pagerank relative to all starting nodes (as a sparse vector
 *         indexed by the position of nodes in the sources list) and
 *         edge attributes the normalized edge weight
 */
def runParallelPersonalizedPageRank[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED],
  numIter: Int, resetProb: Double = 0.15,
  sources: Array[VertexId]): Graph[Vector, Double] = {
  require(numIter > 0, s"Number of iterations must be greater than 0," +
    s" but got ${numIter}")
  require(resetProb >= 0 && resetProb <= 1, s"Random reset probability must belong" +
    s" to [0, 1], but got ${resetProb}")
  require(sources.nonEmpty, s"The list of sources must be non-empty," +
    s" but got ${sources.mkString("[", ",", "]")}")

  // TODO if one sources vertex id is outside of the int range
  // we won't be able to store its activations in a sparse vector
  require(sources.max <= Int.MaxValue.toLong,
    s"This implementation currently only works for source vertex ids at most ${Int.MaxValue}")
  //Creates a sparse vector using unordered (index, value) pairs and Converts the instance to a breeze vector(基向量).

val zero = Vectors.sparse(sources.size, List()).asBreeze
  val sourcesInitMap = sources.zipWithIndex.map { case (vid, i) =>
    val v = Vectors.sparse(sources.size, Array(i), Array(1.0)).asBreeze
    (vid, v)
  }.toMap
  val sc = graph.vertices.sparkContext
  val sourcesInitMapBC = sc.broadcast(sourcesInitMap)
  // Initialize the PageRank graph with each edge attribute having
  // weight 1/outDegree and each source vertex with attribute 1.0.
  var rankGraph = graph
    // Associate the degree with each vertex
    .outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => deg.getOrElse(0) }
    // Set the weight on the edges based on the degree
    .mapTriplets(e => 1.0 / e.srcAttr, TripletFields.Src)
    .mapVertices { (vid, attr) =>
      if (sourcesInitMapBC.value contains vid) {
        sourcesInitMapBC.value(vid)
      } else {
        zero
      }
    }

  var i = 0
  while (i < numIter) {
    val prevRankGraph = rankGraph
    // Propagates the message along outbound edges
    // and adding start nodes back in with activation resetProb
    val rankUpdates = rankGraph.aggregateMessages[BV[Double]](
      ctx => ctx.sendToDst(ctx.srcAttr *:* ctx.attr),
      (a : BV[Double], b : BV[Double]) => a +:+ b, TripletFields.Src)

    rankGraph = rankGraph.outerJoinVertices(rankUpdates) {
      (vid, oldRank, msgSumOpt) =>
        val popActivations: BV[Double] = msgSumOpt.getOrElse(zero) *:* (1.0 - resetProb)
        val resetActivations = if (sourcesInitMapBC.value contains vid) {
          sourcesInitMapBC.value(vid) *:* resetProb
        } else {
          zero
        }
        popActivations +:+ resetActivations
      }.cache()

    rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices
    prevRankGraph.vertices.unpersist(false)
    prevRankGraph.edges.unpersist(false)

    logInfo(s"Parallel Personalized PageRank finished iteration $i.")

    i += 1
  }

  // SPARK-18847 If the graph has sinks (vertices with no outgoing edges) correct the sum of ranks
  val rankSums = rankGraph.vertices.values.fold(zero)(_ +:+ _)
  rankGraph.mapVertices { (vid, attr) =>
    Vectors.fromBreeze(attr /:/ rankSums)
  }
}

第4、5种动态pagerank源代码:

/**
   * Run a dynamic version of PageRank returning a graph with vertex attributes containing the
   * PageRank and edge attributes containing the normalized edge weight.
   *
   * @tparam VD the original vertex attribute (not used)
   * @tparam ED the original edge attribute (not used)
   *
   * @param graph the graph on which to compute PageRank
   * @param tol the tolerance allowed at convergence (smaller => more accurate).
   * @param resetProb the random reset probability (alpha)
   *
   * @return the graph containing with each vertex containing the PageRank and each edge
   *         containing the normalized weight.
   */
  def runUntilConvergence[VD: ClassTag, ED: ClassTag](
    graph: Graph[VD, ED], tol: Double, resetProb: Double = 0.15): Graph[Double, Double] =
  {
      runUntilConvergenceWithOptions(graph, tol, resetProb)
  }

  /**
   * Run a dynamic version of PageRank returning a graph with vertex attributes containing the
   * PageRank and edge attributes containing the normalized edge weight.
   *
   * @tparam VD the original vertex attribute (not used)
   * @tparam ED the original edge attribute (not used)
   *
   * @param graph the graph on which to compute PageRank
   * @param tol the tolerance allowed at convergence (smaller => more accurate).
   * @param resetProb the random reset probability (alpha)
   * @param srcId the source vertex for a Personalized Page Rank (optional)
   *
   * @return the graph containing with each vertex containing the PageRank and each edge
   *         containing the normalized weight.
   */
  def runUntilConvergenceWithOptions[VD: ClassTag, ED: ClassTag](
      graph: Graph[VD, ED], tol: Double, resetProb: Double = 0.15,
      srcId: Option[VertexId] = None): Graph[Double, Double] =
  {
    require(tol >= 0, s"Tolerance must be no less than 0, but got ${tol}")
    require(resetProb >= 0 && resetProb <= 1, s"Random reset probability must belong" +
      s" to [0, 1], but got ${resetProb}")

    val personalized = srcId.isDefined
    val src: VertexId = srcId.getOrElse(-1L)

    // Initialize the pagerankGraph with each edge attribute
    // having weight 1/outDegree and each vertex with attribute 1.0.
    val pagerankGraph: Graph[(Double, Double), Double] = graph
      // Associate the degree with each vertex
      .outerJoinVertices(graph.outDegrees) {
        (vid, vdata, deg) => deg.getOrElse(0)
      }
      // Set the weight on the edges based on the degree
      .mapTriplets( e => 1.0 / e.srcAttr )
      // Set the vertex attributes to (initialPR, delta = 0)
      .mapVertices { (id, attr) =>
        if (id == src) (0.0, Double.NegativeInfinity) else (0.0, 0.0)
      }
      .cache()

    // Define the three functions needed to implement PageRank in the GraphX
    // version of Pregel
    def vertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = {
      val (oldPR, lastDelta) = attr
      val newPR = oldPR + (1.0 - resetProb) * msgSum
      (newPR, newPR - oldPR)
    }

    def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
      msgSum: Double): (Double, Double) = {
      val (oldPR, lastDelta) = attr
      val newPR = if (lastDelta == Double.NegativeInfinity) {
        1.0
      } else {
        oldPR + (1.0 - resetProb) * msgSum
      }
      (newPR, newPR - oldPR)
    }

    def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
      if (edge.srcAttr._2 > tol) {
        Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
      } else {
        Iterator.empty
      }
    }

    def messageCombiner(a: Double, b: Double): Double = a + b

    // The initial message received by all vertices in PageRank
    val initialMessage = if (personalized) 0.0 else resetProb / (1.0 - resetProb)

    // Execute a dynamic version of Pregel.
    val vp = if (personalized) {
      (id: VertexId, attr: (Double, Double), msgSum: Double) =>
        personalizedVertexProgram(id, attr, msgSum)
    } else {
      (id: VertexId, attr: (Double, Double), msgSum: Double) =>
        vertexProgram(id, attr, msgSum)
    }

    val rankGraph = Pregel(pagerankGraph, initialMessage, activeDirection = EdgeDirection.Out)(
      vp, sendMessage, messageCombiner)
      .mapVertices((vid, attr) => attr._1)

    // SPARK-18847 If the graph has sinks (vertices with no outgoing edges) correct the sum of ranks
    normalizeRankSum(rankGraph, personalized)
  }

  // Normalizes the sum of ranks to n (or 1 if personalized)
  private def normalizeRankSum(rankGraph: Graph[Double, Double], personalized: Boolean) = {
    val rankSum = rankGraph.vertices.values.sum()
    if (personalized) {
      rankGraph.mapVertices((id, rank) => rank / rankSum)
    } else {
      val numVertices = rankGraph.numVertices
      val correctionFactor = numVertices.toDouble / rankSum
      rankGraph.mapVertices((id, rank) => rank * correctionFactor)
    }
  }

参考网址:http://spark.apache.org/docs/2.2.1/api/scala/index.html#org.apache.spark.graphx.lib.PageRank$