Spark RDD依赖的深度优先搜索

版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/oscarun/article/details/89085611

1 Overview

最近在刷刷算法题,看到经典的树搜索的算法,正巧之前记得 Spark RDD 中有一处利用 DFS 来判断 RDD 依赖关系的代码,因此专门拿出来分析一下。

2 Code

/**
* Return the ancestors of the given RDD that are related to it only through a sequence of
* narrow dependencies. This traverses the given RDD's dependency tree using DFS, but maintains
* no ordering on the RDDs returned.
*/
private[spark] def getNarrowAncestors: Seq[RDD[_]] = {
    val ancestors = new mutable.HashSet[RDD[_]]
    
    def visit(rdd: RDD[_]): Unit = {
      val narrowDependencies = rdd.dependencies.filter(_.isInstanceOf[NarrowDependency[_]])
      val narrowParents = narrowDependencies.map(_.rdd)
      val narrowParentsNotVisited = narrowParents.filterNot(ancestors.contains)
      narrowParentsNotVisited.foreach { parent =>
        ancestors.add(parent)
        visit(parent)
      }
    }
    
    visit(this)
    
    // In case there is a cycle, do not include the root itself
    ancestors.filterNot(_ == this).toSeq
}

3 分析

代码很清晰,就是用递归的方式写完这个寻找 RDD 的 Narrow 祖先。

val ancestors = new mutable.HashSet[RDD[_]]

ancestors 是一个 Set 数据结构,用来存放已经查找过的 父 RDD。

narrowDependencies, narrowParents, narrowParentsNotVisited 三个变量,按照名字是很容易理解的,分别是找到 RDD 的窄依赖,窄依赖的父依赖以及没有被访问过的窄依赖。

最后这一段,将没有被访问过的父依赖,依次加入 ancetors 表示已经访问过了。

narrowParentsNotVisited.foreach { parent =>
    ancestors.add(parent)
    visit(parent)
}

有心的读者会发现最后一行注释。

In case there is a cycle, do not include the root itself

大意就是如果如果不去除根节点 RDD,那么 narrowParentsNotVisited 是不能被结束的,意思就是相乘了环而导致循环无法结束。

4 Test Case

// org/apache/spark/rdd/RDDSuite.scala
test("getNarrowAncestors") {
    val rdd1 = sc.parallelize(1 to 100, 4)
    val rdd2 = rdd1.filter(_ % 2 == 0).map(_ + 1)
    val rdd3 = rdd2.map(_ - 1).filter(_ < 50).map(i => (i, i))
    val rdd4 = rdd3.reduceByKey(_ + _)
    val rdd5 = rdd4.mapValues(_ + 1).mapValues(_ + 2).mapValues(_ + 3)
    val ancestors1 = rdd1.getNarrowAncestors
    val ancestors2 = rdd2.getNarrowAncestors
    val ancestors3 = rdd3.getNarrowAncestors
    val ancestors4 = rdd4.getNarrowAncestors
    val ancestors5 = rdd5.getNarrowAncestors
    
    // Simple dependency tree with a single branch
    assert(ancestors1.size === 0)
    assert(ancestors2.size === 2)
    assert(ancestors2.count(_ === rdd1) === 1)
    assert(ancestors2.count(_.isInstanceOf[MapPartitionsRDD[_, _]]) === 1)
    assert(ancestors3.size === 5)
    assert(ancestors3.count(_.isInstanceOf[MapPartitionsRDD[_, _]]) === 4)
    
    // Any ancestors before the shuffle are not considered
    assert(ancestors4.size === 0)
    assert(ancestors4.count(_.isInstanceOf[ShuffledRDD[_, _, _]]) === 0)
    assert(ancestors5.size === 3)
    assert(ancestors5.count(_.isInstanceOf[ShuffledRDD[_, _, _]]) === 1)
    assert(ancestors5.count(_ === rdd3) === 0)
    assert(ancestors5.count(_.isInstanceOf[MapPartitionsRDD[_, _]]) === 2)
}

建议可以跑一下 RDDSuite.scala 测试类中的关于 getNarrowAncestors 方法。很显然,针对第二部分的情况,窄依赖只跟踪到 shuffle 之前,也就是一个 RDD 血缘遇到 shuffle 操作,那么窄依赖的依赖链条就会重新计数。

展开阅读全文

spark RDD 局部变量问题

03-21

请问各位大神一个问题 使用spark 1.6rnrn目前的需求是要统计用户经过搜索把商品加入购物车的 logrn埋点中 购物车的埋点并没有记录搜索词rn所以必须记住上下文 确定用户浏览商品时 是经过搜索的rnrn我的做法是遇到浏览商品时 [ udid+埋点id ,搜索词 ] 放到局部变量 val dic=new mutable.HashMap[String,String]rn如果发现Map get有值 代表加入购物车前 有经过搜索浏览商品 就把结果放到rn rnQueryLogOut(Option(queryword),Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rnrn如果没有rnrn就写入rnQueryLogOut("",Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rnrn最后使用 .filter(_.q.getOrElse("")!="") 过滤rnrnrn但现在会卡在最后几task 一直做不完 rnrn不知道是哪边出了问题 是不是因局部变量再分布式时不能这样用? 请各位大神帮忙看下rn谢谢~rnrnrn代码如下rnrnrn[code=text]rncase class QueryLogOut(q: Option[String], smseq: Option[String],rn udid : Option[String], city: Option[String],rn mem_guid: Option[String], datetime: Option[String], typename: Option[String])rnrnobject SearchToItem rnrn val dic=new mutable.HashMap[String,String]rn val dicRowNum=new mutable.HashMap[String,String]rn val dicType=new mutable.HashMap[String,String]rn val SEARCH_TO_CART_RELATION_NUM=1rn rnrn def main(args:Array[String]) rn val sparkConf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")rn .set("spark.kryoserializer.buffer.max", "192M")rn val sc = new SparkContext(sparkConf)rn val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)rnrn val newSql=rn """select udid,rn | page_col,rn | col_content,rn | query,rn | city,rn | rownum,rn | mem_guid,rn | datetime,rn | typern | from temp.search2cartrn | """.stripMarginrnrn val searchToItemDF=hiveContext.sql(newSql)rnrnrn val searchToItemNewRDD=searchToItemDF.repartition(300,searchToItemDF.col("udid")).rn map( r => proessSearchToItemNew(Option(r.get(0)),Option(r.get(1)),Option(r.get(2)),Option(r.get(3))rn ,Option(r.get(4)),Option(r.get(5)),Option(r.get(6)),Option(r.get(7)),Option(r.get(8))))rn .filter(_.q.getOrElse("")!="")rnrn val searchToItemNewDF = hiveContext.createDataFrame(searchToItemNewRDD)rn .rdd.map r => r.mkString("\t") rn .saveAsTextFile("/data/search2item")rnrn sc.stop()rnrnrn rnrn def proessSearchToItemNew(udid_o: Option[Any], page_col_o: Option[Any], col_pos_content_o :rn Option[Any], query_o: Option[Any], city_o: Option[Any], rownum_o: Option[Any] ,rn mem_guid_o : Option[Any], datetime_o: Option[Any], typename_o: Option[Any]rn ):QueryLogOut=rnrnrn val udid=udid_o.getOrElse("").toStringrn val page_col=page_col_o.getOrElse("").toStringrn val col_pos_content=col_pos_content_o.getOrElse("").toStringrn val query=query_o.getOrElse("").toStringrn val city=city_o.getOrElse("").toStringrn val rowNum=rownum_o.getOrElse("0").toStringrn val mem_guid=mem_guid_o.getOrElse("").toStringrn val datetime=datetime_o.getOrElse("").toStringrn val typename=typename_o.getOrElse("").toStringrnrnrn var queryLogOut= QueryLogOut(Option(""),Option(""),Option(""),Option(""),Option(""),Option(""),Option(""))rnrnrn breakable rn page_col match rn case "3010" => //#点击「加入购物车」按钮,把商品加入购物车rn val userSearchType=udid + "3035"rn val userCartType=udid + page_colrnrn if (dic.contains(userSearchType) & dicRowNum.contains(userSearchType))rn var typename = dicType.getOrElse(userSearchType,"")rn var queryword= dic.getOrElse(userSearchType,"")rnrn if(isInvalidProcessLog(userSearchType,rowNum))rn if(dicRowNum.contains(userCartType) & (rowNum.toInt - dicRowNum.getOrElse(userCartType,"0").toInt) == SEARCH_TO_CART_RELATION_NUM)rn queryword=dic.getOrElse(userCartType,"")rn typename=dicType.getOrElse(userCartType,"")rn dic.put(userCartType,queryword)rn elsern queryword=""rn break()rn rn rnrn if (!Strings.isNullOrEmpty(col_pos_content))rn dic.put(userCartType,queryword)rn dicRowNum.put(userCartType,rowNum)rn dicType.put(userCartType,typename)rn queryLogOut=rn QueryLogOut(Option(queryword),Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rn rn rnrn rnrn case "3036" => //#浏览商品rn val itemType=udid + "3036"rn val typeNum="-3036"rn queryLogOut=processItemPage(udid,page_col,col_pos_content,query,city,rowNum,mem_guid,datetime,typename,itemType,typeNum)rn rnrnrn case _ =>rn rnrn rnrnrn queryLogOutrn rnrnrn def processItemPage(udid: String, page_col: String, col_pos_content : String, query: String,rn city: String, rownum: String, mem_guid : String, datetime: String, typename: Stringrn , ItemType: String, typeNum: String):QueryLogOut= rnrn val userItemType=udid + "3036"rn val userGoodsItemType=udid + "4011"rn val userSearchType=udid + "3035"rn val userCartType=udid + "3010"rnrn // var returnBoolean=falsern var typename=""rn var queryword=""rnrn breakable rn if(dic.contains(userSearchType) & dicRowNum.contains(userSearchType))rn queryword=dic.getOrElse(userSearchType,"")rn typename = dicType.getOrElse(userSearchType,"")rnrn if(Strings.isNullOrEmpty(queryword) | "NULL".equals(queryword) | "3".equals(typename) )rn queryword=""rn break()rn rnrn if (isInvalidProcessLog(userSearchType,rownum) )rn if (isInvalidProcessLog(userItemType,rownum))rn if (isInvalidProcessLog(userGoodsItemType,rownum))rn if(isInvalidProcessLog(userCartType,rownum))rn queryword=""rn break()rn elsern queryword=dic.getOrElse(userCartType,"")rn typename = dicType.getOrElse(userCartType,"")rn rn elsern queryword=dic.getOrElse(userGoodsItemType,"")rn typename = dicType.getOrElse(userGoodsItemType,"")rn rn elsern queryword=dic.getOrElse(userItemType,"")rn typename = dicType.getOrElse(userItemType,"")rn rn rnrn if (!Strings.isNullOrEmpty(col_pos_content))rn dic.put(ItemType,queryword)rn dicRowNum.put(ItemType,rownum)rn dicType.put(ItemType,typename)rnrn typename=typename+typeNum // 1-3036 2-3036 etc.rnrn elsern queryword=""rn rnrnrn rn rnrn QueryLogOut(Option(queryword),Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rnrnrn rnrn def getType(query: String):String=rnrn if ((query.length==6 && query.startsWith("C") & isAllDigits(query.substring(1))) | (query.length==8 & query.substring(0,2)=="CC" & isAllDigits(query.substring(2))))rn return "2"rn elsern return "1"rnrn rnrn def isAllDigits(x: String) = x forall Character.isDigitrnrn def isValidDigitsRowNum(rowNum: String):Boolean = rn !Strings.isNullOrEmpty(rowNum) & isAllDigits(rowNum)rn rnrnrn def isInvalidQuery(query: String):Boolean = rn Strings.isNullOrEmpty(query) | isValidSMSEQ(query)rn rnrn def isValidSMSEQ(sm_seq: String):Boolean = rn (sm_seq.length == 17 & sm_seq.indexOf("CM") == 6) |rn (sm_seq.length >= 9 & isAllDigits(sm_seq))rn rnrn def isInvalidProcessLog(typeName: String,rowNum: String):Boolean = rn ( !isValidDigitsRowNum (rowNum) | Strings.isNullOrEmpty(typeName) | !dicRowNum.contains(typeName)rn | (rowNum.toInt - dicRowNum.getOrElse(typeName,"0").toInt) > SEARCH_TO_CART_RELATION_NUM)rn rnrnrnrn[/code] 论坛

没有更多推荐了,返回首页