Spark之RDD算子操作

概述

针对RDD的操作,分两种:一种是Transformation(变换),一种是Actions(执行)。

Transformation(变换)操作属于懒操作(算子),不会真正触发RDD的处理计算。

变换方法的共同点:1.不会马上触发计算 2.每当调用一次变换方法,都会产生一个新的RDD

Actions(执行)操作才会真正触发。

Transformations

Transformation

Meaning

map(func)

Return a new distributed dataset formed by passing each element of the source through a function func.

参数是函数,函数应用于RDD每一个元素,返回值是新的RDD

案例展示:

map 将函数应用到rdd的每个元素中

val rdd = sc.makeRDD(List(1,3,5,7,9))

rdd.map(_*10)

 

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

扁平化map,对RDD每个元素转换, 然后再扁平化处理

案例展示:

flatMap 扁平map处理

val rdd = sc.makeRDD(List("hello world","hello count","world spark"),2)

 

rdd.map(_.split{" "})//Array(Array(hello, world), Array(hello, count), Array(world, spark))

 

rdd.flatMap(_.split{" "})//Array[String] = Array(hello, world, hello, count, world, spark)

//Array[String] = Array(hello, world, hello, count, world, spark)

 

注:map和flatMap有何不同?

map: 对RDD每个元素转换

flatMap: 对RDD每个元素转换, 然后再扁平化(即去除集合)

 

所以,一般我们在读取数据源后,第一步执行的操作是flatMap

 

filter(func)

Return a new dataset formed by selecting those elements of the source on which func returns true.参数是函数,函数会过滤掉不符合条件的元素,返回值是新的RDD

案例展示:

filter 用来从rdd中过滤掉不符合条件的数据

val rdd = sc.makeRDD(List(1,3,5,7,9));

rdd.filter(_<5);

 

mapPartitions(func)

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

该函数和map函数类似,只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。

案例展示::

val rdd3 = rdd1.mapPartitions{ x => {

val result = List[Int]()

var i = 0

while(x.hasNext){

i += x.next()

}

result.::(i).iterator

}}

 

scala>rdd3.collect

 

 

补充:此方法可以用于某些场景的调优,比如将数据存储数据库,

如果用map方法来存,有一条数据就会建立和销毁一次连接,性能较低
所以此时可以用mapPartitions代替map

 

mapPartitionsWithIndex(func)

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

 

函数作用同mapPartitions,不过提供了两个参数,第一个参数为分区的索引。

案例展示:

var rdd1 = sc.makeRDD(1 to 5,2)

 

var rdd2 = rdd1.mapPartitionsWithIndex{

(index,iter) => {

var result = List[String]()

var i = 0

while(iter.hasNext){

i += iter.next()

}

result.::(index + "|" + i).iterator

}

}

 

 

案例展示:

val rdd = sc.makeRDD(List(1,2,3,4,5),2);

rdd.mapPartitionsWithIndex((index,iter)=>{

var list = List[String]()

while(iter.hasNext){

if(index==0)

list = list :+ (iter.next + "a")

else {

list = list :+ (iter.next + "b")

}

}

list.iterator

});

 

 

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

案例展示:

union 并集 -- 也可以用++实现

val rdd1 = sc.makeRDD(List(1,3,5));

val rdd2 = sc.makeRDD(List(2,4,6,8));

val rdd = rdd1.union(rdd2);

val rdd = rdd1 ++ rdd2;

 

 

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

案例展示:

intersection 交集

val rdd1 = sc.makeRDD(List(1,3,5,7));

val rdd2 = sc.makeRDD(List(5,7,9,11));

val rdd = rdd1.intersection(rdd2);

 

 

subtract

案例展示:

subtract 差集

val rdd1 = sc.makeRDD(List(1,3,5,7,9));

val rdd2 = sc.makeRDD(List(5,7,9,11,13));

val rdd =  rdd1.subtract(rdd2);

distinct([numTasks]))

Return a new dataset that contains the distinct elements of the source dataset.

没有参数,将RDD里的元素进行去重操作

案例展示:

val rdd = sc.makeRDD(List(1,3,5,7,9,3,7,10,23,7));

rdd.distinct

 

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. 

Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 

Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.

 

案例展示:

scala>val rdd = sc.parallelize(List(("cat",2), ("dog",5),("cat",4),("dog",3),("cat",6),("dog",3),("cat",9),("dog",1)),2);

scala>rdd.groupByKey()

 

注:groupByKey对于数据格式是有要求的,即操作的元素必须是一个二元tuple,

tuple._1 是key, tuple._2是value

比如下面这种数据格式:

sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)就不符合要求

以及这种:

sc.parallelize(List(("cat",2,1), ("dog",5,1),("cat",4,1),("dog",3,2),("cat",6,2),("dog",3,4),("cat",9,4),("dog",1,4)),2);

 

reduceByKey(func, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

案例展示:

scala>var rdd = sc.makeRDD( List( ("hello",1),("spark",1),("hello",1),("world",1) ) )

rdd.reduceByKey(_+_);

 

注:reduceByKey操作的数据格式必须是一个二元tuple

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

 

使用方法及案例展示:

aggregateByKey(zeroValue)(func1,func2)

 

scala> val rdd = sc.parallelize(List(("cat",2),("dog",5),("cat",4),("dog",3),("cat",6),("dog",3),("cat",9),("dog",1)),2);

查看分区结果:

partition:[0]

(cat,2)

(dog,5)

(cat,4)

(dog,3)

 

 

 

partition:[1]

(cat,6)

(dog,3)

(cat,9)

(dog,1)

 

 

 

 

scala> rdd.aggregateByKey(0)( _+_  ,  _*_);

 

 

scala> rdd.aggregateByKey(0)(_+_,_*_);

 

  • zeroValue表示初始值,初始值会参与func1的计算
  • 在分区内,按key分组,把每组的值进行fun1的计算
  • 再将每个分区每组的计算结果按fun2进行计算

sortByKey([ascending], [numTasks])

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

案例展示:

val d2 = sc.parallelize(Array(("cc",32),("bb",32),("cc",22),("aa",18),("bb",6),("dd",16),("ee",104),("cc",1),("ff",13),("gg",68),("bb",44))) 

 

d2.sortByKey(true).collect

 

 

join(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

案例展示:

val rdd1 = sc.makeRDD(List(("cat",1),("dog",2)))

val rdd2 = sc.makeRDD(List(("cat",3),("dog",4),("tiger",9)))

rdd1.join(rdd2);

 

cartesian(otherDataset)

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

参数是RDD,求两个RDD的笛卡儿积

案例展示:

cartesian 笛卡尔积

val rdd1 = sc.makeRDD(List(1,2,3))

val rdd2 = sc.makeRDD(List("a","b"))

rdd1.cartesian(rdd2);

 

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

 

coalesce(n,true/false) 扩大或缩小分区

 

案例展示:

val rdd = sc.makeRDD(List(1,2,3,4,5),2)

rdd.coalesce(3,true);//如果是扩大分区 需要传入一个true 表示要重新shuffle

rdd.coalesce(2);//如果是缩小分区 默认就是false 不需要明确的传入

 

repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

repartition(n) 等价于上面的coalesce

partitionBy

要求是(k,v)形式

通常我们在创建RDD时指定分区规则 将会导致 数据自动分区

我们也可以通过partitionBy方法人为指定分区方式来进行分区

常见的分区器有:

  • HashPartitioner
  • RangePartitioner

案例展示:

import org.apache.spark._

val r1 = sc.makeRDD(List((2,"aaa"),(9,"bbb"),(7,"ccc"),(9,"ddd"),(3,"eee"),(2,"fff")),2);

val r2=r1.partitionBy(new HashPartitioner(2))//按照键的 hash%分区数 得到的编号去往指定的分区 这种方式可以实现将相同键的数据 分发给同一个分区的效果

 

 

Actions

Action

Meaning

reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

并行整合所有RDD数据,例如求和操作

collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

返回RDD所有元素,将rdd分布式存储在集群中不同分区的数据 获取到一起组成一个数组返回

要注意 这个方法将会把所有数据收集到一个机器内,容易造成内存的溢出 在生产环境下千万慎用

 

count()

Return the number of elements in the dataset.

统计RDD里元素个数

案例展示:

val rdd = sc.makeRDD(List(1,2,3,4,5),2)

rdd.count

 

first()

Return the first element of the dataset (similar to take(1)).

take(n)

Return an array with the first n elements of the dataset.

案例展示:

take 获取前几个数据

val rdd = sc.makeRDD(List(52,31,22,43,14,35))

rdd.take(2)

takeOrdered(n, [ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

案例展示:

takeOrdered(n) 先将rdd中的数据进行升序排序 然后取前n个

val rdd = sc.makeRDD(List(52,31,22,43,14,35))

rdd.takeOrdered(3)

 

top(n)

top(n) 先将rdd中的数据进行降序排序 然后取前n个

val rdd = sc.makeRDD(List(52,31,22,43,14,35))

rdd.top(3)

saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

案例示例:

saveAsTextFile 按照文本方式保存分区数据

val rdd = sc.makeRDD(List(1,2,3,4,5),2);

rdd.saveAsTextFile("/root/work/aaa")

 

countByKey()

Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.

foreach(func)

Run a function func on each element of the dataset. This is usually done for side effects such as updating an ﷟HYPERLINK "https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators"Accumulator or interacting with external storage systems. 

Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See ﷟HYPERLINK "https://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-a-nameclosureslinka"Understanding closures for more details.

 

案例:通过rdd实现统计文件中的单词数量

sc.textFile("/root/work/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/root/work/wcresult")

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值