Spark之RDD算子操作

最新推荐文章于 2022-09-08 12:06:46 发布

SherlockYang、

最新推荐文章于 2022-09-08 12:06:46 发布

阅读量314

点赞数

文章标签： spark

概述

针对RDD的操作，分两种：一种是Transformation（变换），一种是Actions（执行）。

Transformation（变换）操作属于懒操作（算子），不会真正触发RDD的处理计算。

变换方法的共同点：1.不会马上触发计算 2.每当调用一次变换方法，都会产生一个新的RDD

Actions（执行）操作才会真正触发。

Transformations

Transformation	Meaning
map(func)	Return a new distributed dataset formed by passing each element of the source through a function func. 参数是函数，函数应用于RDD每一个元素，返回值是新的RDD 案例展示： map 将函数应用到rdd的每个元素中 val rdd = sc.makeRDD(List(1,3,5,7,9)) rdd.map(_*10)
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). 扁平化map，对RDD每个元素转换, 然后再扁平化处理案例展示： flatMap 扁平map处理 val rdd = sc.makeRDD(List("hello world","hello count","world spark"),2) rdd.map(_.split{" "})//Array(Array(hello, world), Array(hello, count), Array(world, spark)) rdd.flatMap(_.split{" "})//Array[String] = Array(hello, world, hello, count, world, spark) //Array[String] = Array(hello, world, hello, count, world, spark) 注：map和flatMap有何不同？ map: 对RDD每个元素转换 flatMap: 对RDD每个元素转换, 然后再扁平化（即去除集合）所以，一般我们在读取数据源后，第一步执行的操作是flatMap
filter(func)	Return a new dataset formed by selecting those elements of the source on which func returns true.参数是函数，函数会过滤掉不符合条件的元素，返回值是新的RDD 案例展示： filter 用来从rdd中过滤掉不符合条件的数据 val rdd = sc.makeRDD(List(1,3,5,7,9)); rdd.filter(_<5);
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. 该函数和map函数类似，只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。案例展示：： val rdd3 = rdd1.mapPartitions{ x => { val result = List[Int]() var i = 0 while(x.hasNext){ i += x.next() } result.::(i).iterator }} scala>rdd3.collect 补充：此方法可以用于某些场景的调优，比如将数据存储数据库，如果用map方法来存，有一条数据就会建立和销毁一次连接，性能较低所以此时可以用mapPartitions代替map
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T. 函数作用同mapPartitions，不过提供了两个参数，第一个参数为分区的索引。案例展示： var rdd1 = sc.makeRDD(1 to 5,2) var rdd2 = rdd1.mapPartitionsWithIndex{ (index,iter) => { var result = List[String]() var i = 0 while(iter.hasNext){ i += iter.next() } result.::(index + "\|" + i).iterator } } 案例展示： val rdd = sc.makeRDD(List(1,2,3,4,5),2); rdd.mapPartitionsWithIndex((index,iter)=>{ var list = List[String]() while(iter.hasNext){ if(index==0) list = list :+ (iter.next + "a") else { list = list :+ (iter.next + "b") } } list.iterator });
union(otherDataset)	Return a new dataset that contains the union of the elements in the source dataset and the argument. 案例展示： union 并集 -- 也可以用++实现 val rdd1 = sc.makeRDD(List(1,3,5)); val rdd2 = sc.makeRDD(List(2,4,6,8)); val rdd = rdd1.union(rdd2); val rdd = rdd1 ++ rdd2;
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in the source dataset and the argument. 案例展示： intersection 交集 val rdd1 = sc.makeRDD(List(1,3,5,7)); val rdd2 = sc.makeRDD(List(5,7,9,11)); val rdd = rdd1.intersection(rdd2);
subtract	案例展示： subtract 差集 val rdd1 = sc.makeRDD(List(1,3,5,7,9)); val rdd2 = sc.makeRDD(List(5,7,9,11,13)); val rdd = rdd1.subtract(rdd2);
distinct([numTasks]))	Return a new dataset that contains the distinct elements of the source dataset. 没有参数，将RDD里的元素进行去重操作案例展示： val rdd = sc.makeRDD(List(1,3,5,7,9,3,7,10,23,7)); rdd.distinct
groupByKey([numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks. 案例展示： scala>val rdd = sc.parallelize(List(("cat",2), ("dog",5),("cat",4),("dog",3),("cat",6),("dog",3),("cat",9),("dog",1)),2); scala>rdd.groupByKey() 注：groupByKey对于数据格式是有要求的，即操作的元素必须是一个二元tuple， tuple._1 是key， tuple._2是value 比如下面这种数据格式： sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)就不符合要求以及这种： sc.parallelize(List(("cat",2,1), ("dog",5,1),("cat",4,1),("dog",3,2),("cat",6,2),("dog",3,4),("cat",9,4),("dog",1,4)),2);
reduceByKey(func, [numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. 案例展示： scala>var rdd = sc.makeRDD( List( ("hello",1),("spark",1),("hello",1),("world",1) ) ) rdd.reduceByKey(_+_); 注：reduceByKey操作的数据格式必须是一个二元tuple
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. 使用方法及案例展示： aggregateByKey(zeroValue)(func1,func2) scala> val rdd = sc.parallelize(List(("cat",2),("dog",5),("cat",4),("dog",3),("cat",6),("dog",3),("cat",9),("dog",1)),2); 查看分区结果： partition:[0] (cat,2) (dog,5) (cat,4) (dog,3) partition:[1] (cat,6) (dog,3) (cat,9) (dog,1) scala> rdd.aggregateByKey(0)( _+_ , __); scala> rdd.aggregateByKey(0)(_+_,__); zeroValue表示初始值，初始值会参与func1的计算在分区内，按key分组，把每组的值进行fun1的计算再将每个分区每组的计算结果按fun2进行计算
sortByKey([ascending], [numTasks])	When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument. 案例展示： val d2 = sc.parallelize(Array(("cc",32),("bb",32),("cc",22),("aa",18),("bb",6),("dd",16),("ee",104),("cc",1),("ff",13),("gg",68),("bb",44))) d2.sortByKey(true).collect
join(otherDataset, [numTasks])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. 案例展示： val rdd1 = sc.makeRDD(List(("cat",1),("dog",2))) val rdd2 = sc.makeRDD(List(("cat",3),("dog",4),("tiger",9))) rdd1.join(rdd2);
cartesian(otherDataset)	When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). 参数是RDD，求两个RDD的笛卡儿积案例展示： cartesian 笛卡尔积 val rdd1 = sc.makeRDD(List(1,2,3)) val rdd2 = sc.makeRDD(List("a","b")) rdd1.cartesian(rdd2);
coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset. coalesce(n,true/false) 扩大或缩小分区案例展示： val rdd = sc.makeRDD(List(1,2,3,4,5),2) rdd.coalesce(3,true);//如果是扩大分区需要传入一个true 表示要重新shuffle rdd.coalesce(2);//如果是缩小分区默认就是false 不需要明确的传入
repartition(numPartitions)	Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network. repartition(n) 等价于上面的coalesce
partitionBy	要求是（k,v）形式通常我们在创建RDD时指定分区规则将会导致数据自动分区我们也可以通过partitionBy方法人为指定分区方式来进行分区常见的分区器有： HashPartitioner RangePartitioner 案例展示： import org.apache.spark._ val r1 = sc.makeRDD(List((2,"aaa"),(9,"bbb"),(7,"ccc"),(9,"ddd"),(3,"eee"),(2,"fff")),2); val r2=r1.partitionBy(new HashPartitioner(2))//按照键的 hash%分区数得到的编号去往指定的分区这种方式可以实现将相同键的数据分发给同一个分区的效果

Actions

Action	Meaning
reduce(func)	Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. 并行整合所有RDD数据，例如求和操作
collect()	Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. 返回RDD所有元素，将rdd分布式存储在集群中不同分区的数据获取到一起组成一个数组返回要注意这个方法将会把所有数据收集到一个机器内，容易造成内存的溢出在生产环境下千万慎用
count()	Return the number of elements in the dataset. 统计RDD里元素个数案例展示： val rdd = sc.makeRDD(List(1,2,3,4,5),2) rdd.count
first()	Return the first element of the dataset (similar to take(1)).
take(n)	Return an array with the first n elements of the dataset. 案例展示： take 获取前几个数据 val rdd = sc.makeRDD(List(52,31,22,43,14,35)) rdd.take(2)
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator. 案例展示： takeOrdered(n) 先将rdd中的数据进行升序排序然后取前n个 val rdd = sc.makeRDD(List(52,31,22,43,14,35)) rdd.takeOrdered(3)
top(n)	top(n) 先将rdd中的数据进行降序排序然后取前n个 val rdd = sc.makeRDD(List(52,31,22,43,14,35)) rdd.top(3)
saveAsTextFile(path)	Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. 案例示例： saveAsTextFile 按照文本方式保存分区数据 val rdd = sc.makeRDD(List(1,2,3,4,5),2); rdd.saveAsTextFile("/root/work/aaa")
countByKey()	Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func)	Run a function func on each element of the dataset. This is usually done for side effects such as updating an ﷟HYPERLINK "https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators"Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See ﷟HYPERLINK "https://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-a-nameclosureslinka"Understanding closures for more details.

案例：通过rdd实现统计文件中的单词数量

sc.textFile("/root/work/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/root/work/wcresult")

SherlockYang、

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark之RDD算子操作

概述针对RDD的操作，分两种：一种是Transformation（变换），一种是Actions（执行）。Transformation（变换）操作属于懒操作（算子），不会真正触发RDD的处理计算。变换方法的共同点：1.不会马上触发计算 2.每当调用一次变换方法，都会产生一个新的RDDActions（执行）操作才会真正触发。Transformations Transformation Meaning map(func)
复制链接

扫一扫