spark transform系列__cogroup

Cogroup

cogroup的函数实现:

这个实现根据两个要进行合并的两个RDD操作,生成一个CoGroupedRDD的实例,这个RDD的返回结果是把相同的key中两个RDD分别进行合并操作,最后返回的RDD的value是一个Pair的实例,这个实例包含两个Iterable的值,第一个值表示的是RDD1中相同KEY的值,第二个值表示的是RDD2中相同key的值.

由于做cogroup的操作,需要通过partitioner进行重新分区的操作,因此,执行这个流程时,需要执行一次shuffle的操作(如果要进行合并的两个RDD的都已经是shuffle后的rdd,同时他们对应的partitioner相同时,就不需要执行shuffle,)

def cogroup[W](other: RDD[(KW)]partitioner: Partitioner)
    : RDD[(K(Iterable[V]Iterable[W]))] = self.withScope {
  if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
    throw new SparkException("Default partitioner cannot partition array keys.")
  }

生成CoGroupedRDD的实例,并根据这个实例执行mapValues的操作.这个的结果是一个Pair的实例,

Pair._1是左RDD对应key的结果集,Pair._2是右RDD对应key的结果集.
  val cg = new CoGroupedRDD[K](Seq(selfother)partitioner)
  cg.mapValues { case Array(vsw1s) =>
    (vs.asInstanceOf[Iterable[V]]w1s.asInstanceOf[Iterable[W]])
  }
}

 

这里说明下关于GoGroupedRDD的一些关键代码:

重写的getDependencies函数:

rdds.map { rdd: RDD[_] =>
  if (rdd.partitioner == Some(part)) {
    logDebug("Adding one-to-one dependency with " + rdd)
    new OneToOneDependency(rdd)
  } else {

针对这个CoGroupedRDD的实例上层依赖的两个RDD中,如果上层的RDD不是一个SHUFFLE的RDD,或者说上层的RDD的SHUFFLE算子与当前的RDD的算子不相同时,

     这个RDD在执行时,还需要进行SHUFFLE的操作.

在生成这个依赖时使用到的CoGroupCombiner类型=>

 

private type CoGroup = CompactBuffer[Any]
private type CoGroupValue = (Any, Int)  // Int is dependency number
private type CoGroupCombiner = Array[CoGroup]


    logDebug("Adding shuffle dependency with " + rdd)
    new ShuffleDependency[KAnyCoGroupCombiner](
      rdd.asInstanceOf[RDD[_ <: Product2[K_]]]partserializer)
  }
}

 

针对CoGroupedRDD的compute函数的关键部分代码:

这个函数中,首先根据CoGroupedRDD的上层的依赖的RDD,生成一个rddIterators的数组.每个RDD的结果的Iterator都存储到数组对应的index中.

val rddIterators = new ArrayBuffer[(Iterator[Product2[KAny]], Int)]

 

这里根据RDD对上层的依赖的RDD进行迭代,得到上层RDD中对应的部分的数据集的Iterator,并存储到对应的数组位置中.
for ((depdepNum) <- dependencies.zipWithIndex) dep match {
  case oneToOneDependency: OneToOneDependency[Product2[KAny]] @unchecked =>

上层依赖的RDD是一个OneToOneDependency的RDD时,表示这个RDD的SHUFFLE算子与上层的RDD的SHUFFLE的算子相同,不需要进行SHUFFLE的操作,直接得到上层RDD的iterator的结果集,存储到对应的数组index中.
    val dependencyPartition = split.narrowDeps(depNum).get.split
    // Read them from the parent
    val it = oneToOneDependency.rdd.iterator(dependencyPartitioncontext)
    rddIterators += ((itdepNum))

  case shuffleDependency: ShuffleDependency[___] =>

这种情况下,表示在执行这个RDD前,后一个SHUFFLE的执行,对上层依赖的RDD进行了重新分区的操作,通过shuffleManager的reader,得到上层RDD中重新对应此分区数据的数据集.存储到对应的数组index中.
    // Read map outputs of shuffle
    val it = SparkEnv.get.shuffleManager
      .getReader(shuffleDependency.shuffleHandlesplit.indexsplit.index 1context)
      .read()
    rddIterators += ((itdepNum))
}

 

接下来的代码中首先生成一个ExternalAppendOnlyMap[KCoGroupValueCoGroupCombiner]实例,在这个生成的实例中,K表示这个RDD的KEY的类型,

CoGroupValue类型为CoGroupValue = (Any, Int).这里的Any其实是上层RDD的VLAUE的类型.

CoGroupCombiner类型为Array[CoGroup].这个数组的长度为2.对应上层的两个RDD.

CoGroup类型为CompactBuffer[Any].这里存储的Any是对应具体的上层RDD的值.

                                           也就是key相同的value的集合体.

val map = createExternalMap(numRdds)
for ((itdepNum) <- rddIterators) {

迭代每个RDD的返回的iterator的结果,每次迭代时,把上层rdd的value包装成CoGroupValue的实例.

把每个RDD中的结果通过ExternalAppendOnlyMap把相同的key进行合并操作.这个操作与shuffle中的groupByKey与reduceByKey的操作中使用到的三个函数的流程完全相同,
  map.insertAll(it.map(pair => (pair._1new CoGroupValue(pair._2depNum))))
}

 

接下来看看在CoGroupedRDD中数据compute的计算时,ExternalAppendOnlyMap合并结果集时的三个操作函数的定义:

 

createCombiner函数,这个函数在key第一次被发现时,用于生成CoGroupCombiner的实例,并把这个value添加到这个实例数组中对应的位置上(根据上层RDD对应的位置)

val createCombiner: (CoGroupValue => CoGroupCombiner) = value => {
  val newCombiner = Array.fill(numRdds)(new CoGroup)
  newCombiner(value._2) += value._1
  newCombiner
}

 

mergeValue函数,这个函数把key相同的value合并对对应此上层RDD的位置的Buffer的集合中.这个主要是针对一个相同的partition地结果的合并.

val mergeValue: (CoGroupCombinerCoGroupValue) => CoGroupCombiner =
  (combinervalue) => {
  combiner(value._2) += value._1
  combiner
}

 

mergeCombiners函数,把两个buffer进行合并,最终生成一个buffer,主要用于多partition合并.

val mergeCombiners: (CoGroupCombinerCoGroupCombiner) => CoGroupCombiner =
  (combiner1combiner2) => {
    var depNum = 0
    while (depNum < numRdds) {
      combiner1(depNum) ++= combiner2(depNum)
      depNum += 1
    }
    combiner1
  }

发布了66 篇原创文章 · 获赞 11 · 访问量 24万+
展开阅读全文

SparkSQL Group by 语句报错

09-26

跪求各位大神。 代码如下所示: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import spark.implicits._ val testRDD = spark.sparkContext.textFile("hdfs://ip-172-31-26-254:9000/eth-data/done-eth-trx-5125092-5491171.csv"). filter(line=>line.split(",")(25)=="0xa74476443119a942de498590fe1f2454d7d4ac0d") val rdd = testRDD.map(line=>(line.split(",")(25),line.split(",")(15),line.split(",")(18).substring(0,10))) case class Row(fromadd: String, amount:Int, date:String) val rowRDD = rdd.map(p => Row(p._1,p._2.toInt,p._3)) val testDF=rowRDD.toDF() testDF.registerTempTable("test") #test 内容如下所示; | fromadd|amount| date| +--------------------+------+----------+ |0xa74476443119a94...| 28553|2018-02-20| |0xa74476443119a94...| 30764|2018-02-20| |0xa74476443119a94...| 32775|2018-02-20| |0xa74476443119a94...| 29439|2018-02-20| |0xa74476443119a94...| 35810|2018-02-20| |0xa74476443119a94...| 35810|2018-02-20| |0xa74476443119a94...| 35810|2018-02-20| |0xa74476443119a94...| 28926|2018-02-20| |0xa74476443119a94...| 36229|2018-02-20| |0xa74476443119a94...| 33235|2018-02-20| |0xa74476443119a94...| 34104|2018-02-20| |0xa74476443119a94...| 29425|2018-02-20| |0xa74476443119a94...| 29568|2018-02-20| |0xa74476443119a94...| 33473|2018-02-20| |0xa74476443119a94...| 31344|2018-02-20| |0xa74476443119a94...| 34399|2018-02-20| |0xa74476443119a94...| 34080|2018-02-20| |0xa74476443119a94...| 34080|2018-02-20| |0xa74476443119a94...| 27165|2018-02-20| |0xa74476443119a94...| 33512|2018-02-20| +--------------------+------+----------+ 运行SQL: val data=sqlContext.sql("select * from test where amount>27000").show() 语句ok. 但是运行: val res=sqlContext.sql("select count(amount) from test where group by date").show() 报错如下: org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in stage 5.0 failed 1 times, most recent failure: Lost task 55.0 in stage 5.0 (TID 82, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 25 at $anonfun$1.apply(<console>:27) at $anonfun$1.apply(<console>:27) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) at org.apache.spark.sql.Dataset.head(Dataset.scala:2150) at org.apache.spark.sql.Dataset.take(Dataset.scala:2363) at org.apache.spark.sql.Dataset.showString(Dataset.scala:241) at org.apache.spark.sql.Dataset.show(Dataset.scala:637) at org.apache.spark.sql.Dataset.show(Dataset.scala:596) at org.apache.spark.sql.Dataset.show(Dataset.scala:605) ... 50 elided Caused by: java.lang.ArrayIndexOutOfBoundsException: 25 at $anonfun$1.apply(<console>:27) at $anonfun$1.apply(<console>:27) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 感谢感谢 问答

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览