spark中cogroup用法

spark中cogroup用法
cogroup:对两个RDD中的KV元素,每个RDD中相同key中的元素分别聚合成一个集合。与reduceByKey不同的是针对两个RDD中相同的key的元素进行合并。

例一

[root@node111 ~]# spark-shell
28 一月 10:20:56 WARN [util.NativeCodeLoader] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://node111:4040
Spark context available as 'sc' (master = spark://mycluster:7077, app id = app-20190128102110-0031).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.
//定义集合
scala> val DBName=Array(Tuple2(1,"Spark"),Tuple2(2,"Hadoop"),Tuple2(3,"Kylin"),Tuple2(4,"Flink"))
DBName: Array[(Int, String)] = Array((1,Spark), (2,Hadoop), (3,Kylin), (4,Flink))

scala> val numType=Array(Tuple2(1,"String"),Tuple2(2,"int"),Tuple2(3,"byte"),Tuple2(4,"bollean"),Tuple2(5,"float"),Tuple2(1,"34"),Tuple2(2,"45"),Tuple2(3,"75"))
numType: Array[(Int, String)] = Array((1,String), (2,int), (3,byte), (4,bollean), (5,float), (1,34), (2,45), (3,75))
//集合并行化为RDD
scala> val names=sc.parallelize(DBName)
names: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> val types=sc.parallelize(numType)
types: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[1] at parallelize at <console>:26
//显示原始两个RDD中的数据内容
scala> names.collect.foreach(println)
(1,Spark)
(2,Hadoop)
(3,Kylin)
(4,Flink)

scala> types.collect.foreach(println)
(1,String)
(2,int)
(3,byte)
(4,bollean)
(5,float)
(1,34)
(2,45)
(3,75)

//cogroup用法
scala> val nameAndType=names.cogroup(types)
nameAndType: org.apache.spark.rdd.RDD[(Int, (Iterable[String], Iterable[String]))] = MapPartitionsRDD[3] at cogroup at <console>:27
//显示集合内容
scala> nameAndType.collect.foreach(println)
(1,(CompactBuffer(Spark),CompactBuffer(34, String)))                            
(2,(CompactBuffer(Hadoop),CompactBuffer(45, int)))
(3,(CompactBuffer(Kylin),CompactBuffer(75, byte)))
(4,(CompactBuffer(Flink),CompactBuffer(bollean)))
(5,(CompactBuffer(),CompactBuffer(float)))

//过滤用法
scala> val commonRdd=nameAndType.filter(t=>t._2._1.iterator.hasNext && t._2._2.iterator.hasNext);
commonRdd: org.apache.spark.rdd.RDD[(Int, (Iterable[String], Iterable[String]))] = MapPartitionsRDD[4] at filter at <console>:25
//再次显示过滤后集合内容
scala> commonRdd.collect.foreach(println)
(1,(CompactBuffer(Spark),CompactBuffer(String, 34)))
(2,(CompactBuffer(Hadoop),CompactBuffer(int, 45)))
(3,(CompactBuffer(Kylin),CompactBuffer(byte, 75)))
(4,(CompactBuffer(Flink),CompactBuffer(bollean)))

scala> 

例二
spark中关于cogroup的定义:

/**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
  : JavaPairRDD[K, (JIterable[V], JIterable[W])] =
    fromRDD(cogroupResultToJava(rdd.cogroup(other, partitioner)))

使用的例子

private JavaPairRDD<GeoWaveInputKey, ByteArrayId> joinAndCompareTiers(
			JavaPairRDD<ByteArrayId, Tuple2<GeoWaveInputKey, Geometry>> leftTier,
			JavaPairRDD<ByteArrayId, Tuple2<GeoWaveInputKey, Geometry>> rightTier,
			Broadcast<GeomFunction> geomPredicate,
			int highestPartitionCount, 
			HashPartitioner partitioner ) {
		// Cogroup groups on same tier ByteArrayId and pairs them into Iterable
		// sets.
		JavaPairRDD<
		     ByteArrayId, 
		     Tuple2<
		             Iterable<Tuple2<GeoWaveInputKey, Geometry>>, 
		             Iterable<Tuple2<GeoWaveInputKey, Geometry>>
		      >
		> joinedTiers = leftTier
				.cogroup(
						rightTier,
						partitioner);
		
		
		// Filter only the pairs that have data on both sides, bucket strategy
		// should have been accounted for by this point.
		// We need to go through the pairs and test each feature against each
		// other
		// End with a combined RDD for that tier.
		joinedTiers = joinedTiers.filter(t -> 
				t._2._1.iterator().hasNext() &&
				t._2._2.iterator().hasNext()
		);

		
		JavaPairRDD<GeoWaveInputKey, ByteArrayId> finalMatches = joinedTiers.flatMapValues(
		(Function<
		      Tuple2<
		            Iterable<Tuple2<GeoWaveInputKey, Geometry>>, 
		            Iterable<Tuple2<GeoWaveInputKey, Geometry>>
		      >, Iterable<GeoWaveInputKey>
		>) t -> {
            GeomFunction predicate = geomPredicate.value();

            HashSet<GeoWaveInputKey> results = Sets.newHashSet();
            for (Tuple2<GeoWaveInputKey, Geometry> leftTuple : t._1) {
                for (Tuple2<GeoWaveInputKey, Geometry> rightTuple : t._2) {
                    if (predicate.call(
                            leftTuple._2,
                            rightTuple._2)) {
                        results.add(leftTuple._1);
                        results.add(rightTuple._1);
                    }
                }
            }
            return results;
        }) .mapToPair(Tuple2::swap)
        .reduceByKey(partitioner,  (id1, id2) -> id1)
        .persist(StorageLevel.MEMORY_ONLY_SER());
		
		return finalMatches;
	}
发布了415 篇原创文章 · 获赞 109 · 访问量 215万+
展开阅读全文

SparkSQL Group by 语句报错

09-26

跪求各位大神。 代码如下所示: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import spark.implicits._ val testRDD = spark.sparkContext.textFile("hdfs://ip-172-31-26-254:9000/eth-data/done-eth-trx-5125092-5491171.csv"). filter(line=>line.split(",")(25)=="0xa74476443119a942de498590fe1f2454d7d4ac0d") val rdd = testRDD.map(line=>(line.split(",")(25),line.split(",")(15),line.split(",")(18).substring(0,10))) case class Row(fromadd: String, amount:Int, date:String) val rowRDD = rdd.map(p => Row(p._1,p._2.toInt,p._3)) val testDF=rowRDD.toDF() testDF.registerTempTable("test") #test 内容如下所示; | fromadd|amount| date| +--------------------+------+----------+ |0xa74476443119a94...| 28553|2018-02-20| |0xa74476443119a94...| 30764|2018-02-20| |0xa74476443119a94...| 32775|2018-02-20| |0xa74476443119a94...| 29439|2018-02-20| |0xa74476443119a94...| 35810|2018-02-20| |0xa74476443119a94...| 35810|2018-02-20| |0xa74476443119a94...| 35810|2018-02-20| |0xa74476443119a94...| 28926|2018-02-20| |0xa74476443119a94...| 36229|2018-02-20| |0xa74476443119a94...| 33235|2018-02-20| |0xa74476443119a94...| 34104|2018-02-20| |0xa74476443119a94...| 29425|2018-02-20| |0xa74476443119a94...| 29568|2018-02-20| |0xa74476443119a94...| 33473|2018-02-20| |0xa74476443119a94...| 31344|2018-02-20| |0xa74476443119a94...| 34399|2018-02-20| |0xa74476443119a94...| 34080|2018-02-20| |0xa74476443119a94...| 34080|2018-02-20| |0xa74476443119a94...| 27165|2018-02-20| |0xa74476443119a94...| 33512|2018-02-20| +--------------------+------+----------+ 运行SQL: val data=sqlContext.sql("select * from test where amount>27000").show() 语句ok. 但是运行: val res=sqlContext.sql("select count(amount) from test where group by date").show() 报错如下: org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in stage 5.0 failed 1 times, most recent failure: Lost task 55.0 in stage 5.0 (TID 82, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 25 at $anonfun$1.apply(<console>:27) at $anonfun$1.apply(<console>:27) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) at org.apache.spark.sql.Dataset.head(Dataset.scala:2150) at org.apache.spark.sql.Dataset.take(Dataset.scala:2363) at org.apache.spark.sql.Dataset.showString(Dataset.scala:241) at org.apache.spark.sql.Dataset.show(Dataset.scala:637) at org.apache.spark.sql.Dataset.show(Dataset.scala:596) at org.apache.spark.sql.Dataset.show(Dataset.scala:605) ... 50 elided Caused by: java.lang.ArrayIndexOutOfBoundsException: 25 at $anonfun$1.apply(<console>:27) at $anonfun$1.apply(<console>:27) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 感谢感谢 问答

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览