spark RDD依赖类型

版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/qq_19006739/article/details/79301989

sparkRDD依赖

RDD的最重要的特性之一就是血缘关系,血缘关系描述了一个RDD是如何从父RDD计算得来的。其中Dependency的rdd方法返回一个RDD,及所依赖的RDD.

abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}
Dependency分为两种, narrow和shuffle

NarrowDependency

先看看比较简单的narrow
定义, parent RDD中的每个partition最多被child RDD中的一个partition使用, 即不需要shuffle
更直白点, 就是Narrow只有map, partition本身范围不会改变, 一个parititon经过transform还是一个partition, 虽然内容发生了变化, 所以可以在local完成
而wide就是, partition需要打乱从新划分, 存在shuffle的过程, partition的数目和范围都发生了变化

唯一的接口getParents, 即给定任一个partition-id, 得到所有依赖的parent partitions的id的seq

@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}
NarrowDependency又分为两种,

OneToOneDependency
最简单的依赖关系, 即parent和child里面的partitions是一一对应的, 典型的操作就是map, filter…

其实partitionId就是partition在RDD中的序号, 所以如果是一一对应, 那么parent和child中的partition的序号应该是一样的

RangeDependency

虽然仍然是一一对应, 但是是parent RDD中的某个区间的partitions对应到child RDD中的某个区间的partitions
典型的操作是union, 多个parent RDD合并到一个child RDD, 故每个parent RDD都对应到child RDD中的一个区间
需要注意的是, 这里的union不会把多个partition合并成一个partition, 而是的简单的把多个RDD中的partitions放到一个RDD里面, partition不会发生变化

由于是range, 所以直接记录起点和length就可以了, 没有必要加入每个中间rdd, 所以RangeDependency优化了空间效率

/**
  * Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
  * @param rdd the parent RDD
  * @param inStart the start of the range in the parent RDD, parent RDD中区间的起始点
  * @param outStart the start of the range in the child RDD, child RDD中区间的起始点 
  * @param length the length of the range
  */
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int) = {
    if (partitionId >= outStart && partitionId < outStart + length) { //判断partitionId的合理性,必须在child RDD的合理partition范围内
      List(partitionId - outStart + inStart) //算出parent RDD中对应的partition id
    } else {
      Nil
    }
  }
}

WideDependency

WideDependency, 也称为ShuffleDependency
首先需要基于PairRDD, 因为一般需要依据key进行shuffle, 所以数据结构往往是kv
即RDD中的数据是kv pair, [_ <: Product2[K, V]],

trait Product2[+T1, +T2] extends Product  // Product2 is a cartesian product of 2 components

Product2是trait, 这里实现了Product2可以用于表示kv pair? 不是很理解

其次, 由于需要shuffle, 所以当然需要给出partitioner, 如何完成shuffle

然后, shuffle不象map可以在local进行, 往往需要网络传输或存储, 所以需要serializerClass

最后, 每个shuffle需要分配一个全局的id, context.newShuffleId()的实现就是把全局id累加


/**
 * :: DeveloperApi ::
 * Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
 * the RDD is transient since we don't need it on the executor side.
 *
 * @param _rdd the parent RDD
 * @param partitioner partitioner used to partition the shuffle output
 * @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
 *                   explicitly then the default serializer, as specified by `spark.serializer`
 *                   config option, will be used.
 * @param keyOrdering key ordering for RDD's shuffles
 * @param aggregator map/reduce-side aggregator for RDD's shuffle
 * @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
 */
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,  //聚合器。
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {

  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  // Note: It's possible that the combiner class tag is null, if the combineByKey
  // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName)

  val shuffleId: Int = _rdd.context.newShuffleId()

  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}



展开阅读全文

spark RDD 局部变量问题

03-21

请问各位大神一个问题 使用spark 1.6rnrn目前的需求是要统计用户经过搜索把商品加入购物车的 logrn埋点中 购物车的埋点并没有记录搜索词rn所以必须记住上下文 确定用户浏览商品时 是经过搜索的rnrn我的做法是遇到浏览商品时 [ udid+埋点id ,搜索词 ] 放到局部变量 val dic=new mutable.HashMap[String,String]rn如果发现Map get有值 代表加入购物车前 有经过搜索浏览商品 就把结果放到rn rnQueryLogOut(Option(queryword),Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rnrn如果没有rnrn就写入rnQueryLogOut("",Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rnrn最后使用 .filter(_.q.getOrElse("")!="") 过滤rnrnrn但现在会卡在最后几task 一直做不完 rnrn不知道是哪边出了问题 是不是因局部变量再分布式时不能这样用? 请各位大神帮忙看下rn谢谢~rnrnrn代码如下rnrnrn[code=text]rncase class QueryLogOut(q: Option[String], smseq: Option[String],rn udid : Option[String], city: Option[String],rn mem_guid: Option[String], datetime: Option[String], typename: Option[String])rnrnobject SearchToItem rnrn val dic=new mutable.HashMap[String,String]rn val dicRowNum=new mutable.HashMap[String,String]rn val dicType=new mutable.HashMap[String,String]rn val SEARCH_TO_CART_RELATION_NUM=1rn rnrn def main(args:Array[String]) rn val sparkConf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")rn .set("spark.kryoserializer.buffer.max", "192M")rn val sc = new SparkContext(sparkConf)rn val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)rnrn val newSql=rn """select udid,rn | page_col,rn | col_content,rn | query,rn | city,rn | rownum,rn | mem_guid,rn | datetime,rn | typern | from temp.search2cartrn | """.stripMarginrnrn val searchToItemDF=hiveContext.sql(newSql)rnrnrn val searchToItemNewRDD=searchToItemDF.repartition(300,searchToItemDF.col("udid")).rn map( r => proessSearchToItemNew(Option(r.get(0)),Option(r.get(1)),Option(r.get(2)),Option(r.get(3))rn ,Option(r.get(4)),Option(r.get(5)),Option(r.get(6)),Option(r.get(7)),Option(r.get(8))))rn .filter(_.q.getOrElse("")!="")rnrn val searchToItemNewDF = hiveContext.createDataFrame(searchToItemNewRDD)rn .rdd.map r => r.mkString("\t") rn .saveAsTextFile("/data/search2item")rnrn sc.stop()rnrnrn rnrn def proessSearchToItemNew(udid_o: Option[Any], page_col_o: Option[Any], col_pos_content_o :rn Option[Any], query_o: Option[Any], city_o: Option[Any], rownum_o: Option[Any] ,rn mem_guid_o : Option[Any], datetime_o: Option[Any], typename_o: Option[Any]rn ):QueryLogOut=rnrnrn val udid=udid_o.getOrElse("").toStringrn val page_col=page_col_o.getOrElse("").toStringrn val col_pos_content=col_pos_content_o.getOrElse("").toStringrn val query=query_o.getOrElse("").toStringrn val city=city_o.getOrElse("").toStringrn val rowNum=rownum_o.getOrElse("0").toStringrn val mem_guid=mem_guid_o.getOrElse("").toStringrn val datetime=datetime_o.getOrElse("").toStringrn val typename=typename_o.getOrElse("").toStringrnrnrn var queryLogOut= QueryLogOut(Option(""),Option(""),Option(""),Option(""),Option(""),Option(""),Option(""))rnrnrn breakable rn page_col match rn case "3010" => //#点击「加入购物车」按钮,把商品加入购物车rn val userSearchType=udid + "3035"rn val userCartType=udid + page_colrnrn if (dic.contains(userSearchType) & dicRowNum.contains(userSearchType))rn var typename = dicType.getOrElse(userSearchType,"")rn var queryword= dic.getOrElse(userSearchType,"")rnrn if(isInvalidProcessLog(userSearchType,rowNum))rn if(dicRowNum.contains(userCartType) & (rowNum.toInt - dicRowNum.getOrElse(userCartType,"0").toInt) == SEARCH_TO_CART_RELATION_NUM)rn queryword=dic.getOrElse(userCartType,"")rn typename=dicType.getOrElse(userCartType,"")rn dic.put(userCartType,queryword)rn elsern queryword=""rn break()rn rn rnrn if (!Strings.isNullOrEmpty(col_pos_content))rn dic.put(userCartType,queryword)rn dicRowNum.put(userCartType,rowNum)rn dicType.put(userCartType,typename)rn queryLogOut=rn QueryLogOut(Option(queryword),Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rn rn rnrn rnrn case "3036" => //#浏览商品rn val itemType=udid + "3036"rn val typeNum="-3036"rn queryLogOut=processItemPage(udid,page_col,col_pos_content,query,city,rowNum,mem_guid,datetime,typename,itemType,typeNum)rn rnrnrn case _ =>rn rnrn rnrnrn queryLogOutrn rnrnrn def processItemPage(udid: String, page_col: String, col_pos_content : String, query: String,rn city: String, rownum: String, mem_guid : String, datetime: String, typename: Stringrn , ItemType: String, typeNum: String):QueryLogOut= rnrn val userItemType=udid + "3036"rn val userGoodsItemType=udid + "4011"rn val userSearchType=udid + "3035"rn val userCartType=udid + "3010"rnrn // var returnBoolean=falsern var typename=""rn var queryword=""rnrn breakable rn if(dic.contains(userSearchType) & dicRowNum.contains(userSearchType))rn queryword=dic.getOrElse(userSearchType,"")rn typename = dicType.getOrElse(userSearchType,"")rnrn if(Strings.isNullOrEmpty(queryword) | "NULL".equals(queryword) | "3".equals(typename) )rn queryword=""rn break()rn rnrn if (isInvalidProcessLog(userSearchType,rownum) )rn if (isInvalidProcessLog(userItemType,rownum))rn if (isInvalidProcessLog(userGoodsItemType,rownum))rn if(isInvalidProcessLog(userCartType,rownum))rn queryword=""rn break()rn elsern queryword=dic.getOrElse(userCartType,"")rn typename = dicType.getOrElse(userCartType,"")rn rn elsern queryword=dic.getOrElse(userGoodsItemType,"")rn typename = dicType.getOrElse(userGoodsItemType,"")rn rn elsern queryword=dic.getOrElse(userItemType,"")rn typename = dicType.getOrElse(userItemType,"")rn rn rnrn if (!Strings.isNullOrEmpty(col_pos_content))rn dic.put(ItemType,queryword)rn dicRowNum.put(ItemType,rownum)rn dicType.put(ItemType,typename)rnrn typename=typename+typeNum // 1-3036 2-3036 etc.rnrn elsern queryword=""rn rnrnrn rn rnrn QueryLogOut(Option(queryword),Option(col_pos_content),Option(udid),Option(city),Option(mem_guid),Option(datetime),Option(typename))rnrnrn rnrn def getType(query: String):String=rnrn if ((query.length==6 && query.startsWith("C") & isAllDigits(query.substring(1))) | (query.length==8 & query.substring(0,2)=="CC" & isAllDigits(query.substring(2))))rn return "2"rn elsern return "1"rnrn rnrn def isAllDigits(x: String) = x forall Character.isDigitrnrn def isValidDigitsRowNum(rowNum: String):Boolean = rn !Strings.isNullOrEmpty(rowNum) & isAllDigits(rowNum)rn rnrnrn def isInvalidQuery(query: String):Boolean = rn Strings.isNullOrEmpty(query) | isValidSMSEQ(query)rn rnrn def isValidSMSEQ(sm_seq: String):Boolean = rn (sm_seq.length == 17 & sm_seq.indexOf("CM") == 6) |rn (sm_seq.length >= 9 & isAllDigits(sm_seq))rn rnrn def isInvalidProcessLog(typeName: String,rowNum: String):Boolean = rn ( !isValidDigitsRowNum (rowNum) | Strings.isNullOrEmpty(typeName) | !dicRowNum.contains(typeName)rn | (rowNum.toInt - dicRowNum.getOrElse(typeName,"0").toInt) > SEARCH_TO_CART_RELATION_NUM)rn rnrnrnrn[/code] 论坛

没有更多推荐了,返回首页