Spark RDD Operations(操作)转换算子与动作算子

1只筷子

已于 2023-02-16 14:48:02 修改

阅读量733

点赞数

分类专栏： spark 文章标签： spark 大数据

于 2020-02-23 19:20:10 首次发布

本文链接：https://blog.csdn.net/qq_42346966/article/details/104464591

版权

spark 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

RDD支持两种类型的操作： transformations-转换算子，将⼀个已经存在的RDD转换为一个新的RDD，另外⼀种称为actions-动作算子，动作算子一般在执行结束以后，会将结果返回给Driver。在Spark中所有的transformations
都是lazy的，所有转换算子并不会立即执行，它们仅是记录对当前RDD的转换逻辑。仅当 Actions 算子要求将结果返回给Driver程序时 transformations 才开始真正的进行转换计算。这种设计Spark可以更⾼效地运行。

默认情况下，每次在其上执行操作时，都可能会重新计算每个转换后的RDD。但是，您也可以使⽤persist（或cache）方法将RDD保留在内存中，在这种情况下，Spark会将元素保留在群集中，以便下次查询时可以更快地进行访问。

scala> var rdd1=sc.textFile("hdfs:///demo/words/t_word",1).map(line=>line.split("
").length)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24

scala> rdd1.cache
res54: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[117] at map at <console>:24

scala> rdd1.reduce(_+_)
res55: Int = 15

scala> rdd1.reduce(_+_)
res56: Int = 15

Spark还支持将RDD持久存储在磁盘上，或在多个节点之间复制。⽐如用户可调用persist(StorageLevel.DISK_ONLY_2) 将RDD存储在磁盘上，并且存储2份。

Transformations(转换算子)

1.map( func )变形

Return a new distributed dataset formed by passing each element of the
source through a function func

将一个RDD[U] 转换为 RDD[T]类型。在转换的时候需要用户提供一个匿名函数 func: U => T

    scala> var rdd:RDD[String]=sc.makeRDD(List("a","b","c","a"))
    rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[120] at makeRDD at
    <console>:25
    
    scala> val mapRDD:RDD[(String,Int)] = rdd.map(w => (w, 1))
    mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[121] at map at
    <console>:26

2.filter( func )过滤

Return a new dataset formed by selecting those elements of the source
on which func returns true.

将对一个RDD[U]类型元素进行过滤，过滤产生新的RDD[U],但是需要用户提供 func:U => Boolean 系统仅会保留返回true的元素。


    scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[122] at makeRDD at
    <console>:25
    
    scala> val mapRDD:RDD[Int]=rdd.filter(num=> num %2 == 0)
    mapRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[123] at filter at
    <console>:26
    
    scala> mapRDD.collect
    res63: Array[Int] = Array(2, 4)

3.flatMap( func )变类型

Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).

和map类似，也是将一个RDD[U] 转换为 RRD[T]类型。但是需要用户提供一个方法 func:U => Seq[T]

scala> var rdd:RDD[String]=sc.makeRDD(List("this is","good good"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[124] at makeRDD at
<console>:25

scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap(line=> for(i<- line.split("\\s+"))
yield (i,1))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[125] at flatMap
at <console>:26

scala> var flatMapRDD:RDD[(String,Int)]=rdd.flatMap( line=>
line.split("\\s+").map((_,1)))
flatMapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[126] at flatMap
at <console>:26

scala> flatMapRDD.collect
res64: Array[(String, Int)] = Array((this,1), (is,1), (good,1), (good,1))

4.mapPartitions( func )判断元素分区

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.

和map类似，但是该方法的输入时一个分区的全量数据，因此需要用户提供一个分区的转换方法：func：Iterator => Iterator

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[128] at makeRDD at
<console>:25

scala> var mapPartitionsRDD=rdd.mapPartitions(values => values.map(n=>(n,n%2==0)))
mapPartitionsRDD: org.apache.spark.rdd.RDD[(Int, Boolean)] = MapPartitionsRDD[129] at
mapPartitions at <console>:26

scala> mapPartitionsRDD.collect
res70: Array[(Int, Boolean)] = Array((1,false), (2,true), (3,false), (4,true),
(5,false))

5.mapPartitionsWithIndex( func)显示元素和分区

Similar to mapPartitions, but also provides func with an integer
value representing the index of thepartition, so func must be of type
(Int, Iterator) => Iterator when running on an RDD of type T.

和mapPartitions类似，但是该方法会提供RDD元素所在的分区编号。因此 func:(Int, Iterator)=> Iterator

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6),2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[139] at makeRDD at
<console>:25

scala> var mapPartitionsWithIndexRDD=rdd.mapPartitionsWithIndex((p,values) =>
values.map(n=>(n,p)))
mapPartitionsWithIndexRDD: org.apache.spark.rdd.RDD[(Int, Int)] =
MapPartitionsRDD[140] at mapPartitionsWithIndex at <console>:26

scala> mapPartitionsWithIndexRDD.collect
res77: Array[(Int, Int)] = Array((1,0), (2,0), (3,0), (4,1), (5,1), (6,1))

6.sample( withReplacement , fraction , seed )随机抽样

Sample a fraction fraction of the data, with or without replacement,
using a given random number generator seed.

抽取RDD中的样本数据，可以通过
withReplacement ：是否允许重复抽样.
fraction :控制抽样大致比例.
seed :控制的是随机抽样过程中产生随机数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[150] at makeRDD at
<console>:25

scala> var simpleRDD:RDD[Int]=rdd.sample(false,0.5d,1L)
simpleRDD: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[151] at sample at
<console>:26

scala> simpleRDD.collect
res91: Array[Int] = Array(1, 5, 6)

种⼦不⼀样，会影响最终的抽样结果！

7.union( otherDataset )合并

Return a new dataset that contains the union of the elements in the
source dataset and the argument.

是将两个同种类型的RDD的元素进行合并。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
<console>:25
scala> rdd.union(rdd2).collect
res95: Array[Int] = Array(1, 2, 3, 4, 5, 6, 6, 7)

8.intersection( otherDataset )取交集

Return a new RDD that contains the intersection of elements in the
source dataset and the argument.

是将两个同种类型的RDD的元素进行计算交集。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> var rdd2:RDD[Int]=sc.makeRDD(List(6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[155] at makeRDD at
<console>:25
scala> rdd.intersection(rdd2).collect
res100: Array[Int] = Array(6)

9.distinct([ numPartitions ]))去重

Return a new dataset that contains the distinct elements of the source
dataset.

去除RDD中重复元素,其中 numPartitions 是一个可选参数，是否修改RDD的分区数，一般是在当数据集经过去重之后，如果数据量级大规模降低，可以尝试传递 numPartitions 减少分区数。

scala> var rdd:RDD[Int]=sc.makeRDD(List(1,2,3,4,5,6,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[154] at makeRDD at
<console>:25
scala> rdd.distinct(3).collect
res106: Array[Int] = Array(6, 3, 4, 1, 5, 2)

10.join( otherDataset , [ numPartitions ])连接

When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (V, W)) pairs with allpairs of elements for each key. Outer
joins are supported through le"OuterJoin, rightOuterJoin, and
fullOuterJoin.

当调用RDD[(K,V)]和RDD[(K,W)]系统可以返回⼀个新的RDD[(k,(v,w))]（默认内连接）,目前支持leftOuterJoin, rightOuterJoin, 和 fullOuterJoin>>>>RDD[(1,2)] RDD[(1,3)] ==RDD[(1,(2,3))]

scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at
makeRDD at <console>:25
scala> case class OrderItem(name:String,price:Double,count:Int)
defined class OrderItem
scala> var
orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[206]
at makeRDD at <console>:27
scala> userRDD.join(orderItemRDD).collect
res107: Array[(Int, (String, OrderItem))] = Array((1,
(zhangsan,OrderItem(apple,4.5,2))))
scala> userRDD.leftOuterJoin(orderItemRDD).collect
res108: Array[(Int, (String, Option[OrderItem]))] = Array((1,
(zhangsan,Some(OrderItem(apple,4.5,2)))), (2,(lisi,None)))

11,cogroup( otherDataset , [ numPartitions ])-了解

When called on datasets of type (K, V) and (K, W), returns a dataset
of (K, (Iterable,Iterable)) tuples. This operation is also
called groupWith .

scala> var userRDD:RDD[(Int,String)]=sc.makeRDD(List((1,"zhangsan"),(2,"lisi")))
userRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[204] at
makeRDD at <console>:25
scala> var
orderItemRDD:RDD[(Int,OrderItem)]=sc.makeRDD(List((1,OrderItem("apple",4.5,2)),
(1,OrderItem("pear",1.5,2))))
orderItemRDD: org.apache.spark.rdd.RDD[(Int, OrderItem)] = ParallelCollectionRDD[215]
at makeRDD at <console>:27
scala> userRDD.cogroup(orderItemRDD).collect
res110: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,
(CompactBuffer(zhangsan),CompactBuffer(OrderItem(apple,4.5,2),
OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))
scala> userRDD.groupWith(orderItemRDD).collect
res119: Array[(Int, (Iterable[String], Iterable[OrderItem]))] = Array((1,
(CompactBuffer(zhangsan),CompactBuffer(OrderItem(apple,4.5,2),
OrderItem(pear,1.5,2)))), (2,(CompactBuffer(lisi),CompactBuffer())))

12.cartesian( otherDataset )-了了解笛卡尔积

When called on datasets of types T and U, returns a dataset of (T, U)
pairs (all pairs of elements).

计算集合笛卡尔积

scala> var rdd1:RDD[Int]=sc.makeRDD(List(1,2,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[238] at makeRDD at
<console>:25
scala> var rdd2:RDD[String]=sc.makeRDD(List("a","b","c"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[239] at makeRDD at
<console>:25
scala> rdd1.cartesian(rdd2).collect
res120: Array[(Int, String)] = Array((1,a), (1,b), (1,c), (2,a), (2,b), (2,c), (4,a),
(4,b), (4,c))

13.coalesce( numPartitions )分区的缩放《不可放大》

Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large
dataset.

当经过大规模的过滤数据以后，可以使 coalesce 对RDD进行分区的缩放（只能减少分区，不可以增加）。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
<console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).coalesce(3).getNumPartitions
res127: Int = 3
scala> rdd1.filter(n=> n%2 == 0).coalesce(12).getNumPartitions
res128: Int = 6

14.repartition( numPartitions )分区缩放《可放大》

Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shu!les all data
over the network.

和 coalesce 相似，但是该算子能够变大或者缩小RDD的分区数。

scala> var rdd1:RDD[Int]=sc.makeRDD(0 to 100)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[252] at makeRDD at
<console>:25
scala> rdd1.getNumPartitions
res129: Int = 6
scala> rdd1.filter(n=> n%2 == 0).repartition(12).getNumPartitions
res130: Int = 12
scala> rdd1.filter(n=> n%2 == 0).repartition(3).getNumPartitions
res131: Int = 3

15.repartitionAndSortWithinPartitions( partitioner )-了了解

Repartition the RDD according to the given partitioner and, within
each resulting partition, sortrecords by their keys. This is more
e!icient than calling repartition and then sorting within each
partition because it can push the sorting down into the shu!le
machinery.

该算子能够使⽤用户提供的 partitioner 实现对RDD中数据分区，然后对分区内的数据按照他们key进行排序。

scala> case class User(name:String,deptNo:Int)
defined class User
var empRDD:RDD[User]= sc.parallelize(List(User("张
三",1),User("lisi",2),User("wangwu",1)))
empRDD.map(t => (t.deptNo, t.name)).repartitionAndSortWithinPartitions(new Partitioner
{
override def numPartitions: Int = 4
override def getPartition(key: Any): Int = {
key.hashCode() & Integer.MAX_VALUE % numPartitions
}
}).mapPartitionsWithIndex((p,values)=> {
println(p+"\t"+values.mkString("|"))
values
}).collect()

思考

1、如果有两个超大型文件需要join，有何优化策略？
在这里插入图片描述

xxxByKey-算子(重点)

在Spark中专门针对RDD[(K,V)]类型数据集提供了xxxByKey算子实现对RDD[(K,V)]类型针对性实现计算。

一.groupByKey([ numPartitions ])值为对个值（数组|集合）

When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs.

类似于MapReduce计算模型。将RDD[(K, V)] 转换为RDD[ (K, Iterable)]

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupByKey.collect
res3: Array[(String, Iterable[Int])] = Array((this,CompactBuffer(1)),
(is,CompactBuff)), (good,CompactBuffer(1, 1)))

groupBy(f:(k,v)=> T)

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1)
res5: org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])] = ShuffledRDD[18] at
groupBy at <console>:26
scala> lines.flatMap(_.split("\\s+")).map((_,1)).groupBy(t=>t._1).map(t=>
(t._1,t._2.size)).collect
res6: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

二.reduceByKey( func , [ numPartitions分区数可省 ])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V)
pairs where the values for each keyare aggregated using the given
reduce function func , which must be of type (V,V) => V. Like in
groupByKey , the number of reduce tasks is configurable through an
optional second argument.

当我们调用（k,v）时，返回（k，v)此时k相同的v进行聚合,必须给出（v,v）=>v的求和函数

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).collect
res8: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

三.aggregateByKey柯里化<求V和>( zeroValue初始值 )( seqOp 局部求和, combOp最终求和 , [ numPartitions ])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U)
pairs where the values for each key are aggregated using the given
combine functions and a neutral “zero” value. Allows an aggregated
value type that is di!erent than the input value type, while avoiding
unnecessary allocations. Like in groupByKey , the number of reduce tasks is configurable through an optional second argument.

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24
scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)(_+_,_+_).collect
res9: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

四.sortByKey([ ascending ], [ numPartitions ])

ascending 为true时倒叙排列，为false时正序

When called on a dataset of (K, V) pairs where K implements Ordered,
returns a dataset of (K, V) pairs sorted by keys in ascending or
descending order, as specified in the boolean ascending argument.

以相同的k,算出的v,进行排序

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24 

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortByKey(true).collect
res13: Array[(String, Int)] = Array((good,2), (is,1), (this,1))

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortByKey(false).collect
res14: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

sortBy(T=>U,ascending,[ numPartitions ])

scala> var lines=sc.parallelize(List("this is good good"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at
<console>:24

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortBy(_._2,false).collect
res18: Array[(String, Int)] = Array((good,2), (this,1), (is,1))

scala> lines.flatMap(_.split("\\s+")).map((_,1)).aggregateByKey(0)
(_+_,_+_).sortBy(t=>t,false).collect
res19: Array[(String, Int)] = Array((this,1), (is,1), (good,2))

Actions(动作算子)

Spark任何一个计算任务，有且仅有一个动作算子，用于触发job的执行。将RDD中的数据写出到外围系统或者传递给Driver主程序。

1.reduce( func )

Aggregate the elements of the dataset using a function func (which
takes two arguments and returns one). The function should be
commutative and associative so that it can be computed correctly in
parallel.

该算子能够对远程结果进行计算，然后将计算结果返回给Driver。计算文件中的字符数。

scala> sc.textFile("file:///root/t_word").map(_.length).reduce(_+_)
res3: Int = 64

2.collect()

Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation that
returns a suficiently small subset of the data.

将远程RDD中数据传输给Driver端。通常用于测试环境或者RDD中数据非常的小的情况才可以使用Collect算子，否则Driver可能因为数据太大导致内存溢出。

scala> sc.textFile("file:///root/t_word").collect
res4: Array[String] = Array(this is a demo, hello spark, "good good study ", "day day up ", come on baby)

3.foreach( func )

Run a function func on each element of the dataset. This is usually
done for side e!ects such as updating an Accumulator or interacting
with external storage systems.

在数据集的每个元素上运行函数func。通常这样做是出于副作用，例如更新累加器或与外部存储系统交互。

scala> sc.textFile("file:///root/t_word").foreach(line=>println(line))

4.count()

Return the number of elements in the dataset.

返回RDD中元素的个数

scala> sc.textFile("file:///root/t_word").count()
res7: Long = 5

5.first()<返回第一个元素>|6.take( n )<返回指定个数元素>

Return the first element of the dataset (similar to take(1)). take(n)
Return an array with the first n elements of the dataset.

scala> sc.textFile("file:///root/t_word").first
res9: String = this is a demo

scala> sc.textFile("file:///root/t_word").take(1)
res10: Array[String] = Array(this is a demo)

scala> sc.textFile("file:///root/t_word").take(2)
res11: Array[String] = Array(this is a demo, hello spark)

7.takeSample( withReplacement , num , [ seed ])<随机抽取>

Return an array with a random sample of num elements of the dataset,
with or without replacement, optionally pre-specifying a random number
generator seed.

随机的从RDD中采样num个元素，并且将采样的元素返回给Driver主程序。因此这和sample转换算子有很大的区别。

scala> sc.textFile("file:///root/t_word").takeSample(false,2)
res20: Array[String] = Array("good good study ", hello spark)

8. takeOrdered( n , [ordering] )

Return the first n elements of the RDD using either their natural
order or a custom comparator.

返回RDD中前N个元素，用户可以指定比较规则

scala> case class User(name:String,deptNo:Int,salary:Double)
defined class User

scala> var
userRDD=sc.parallelize(List(User("zs",1,1000.0),User("ls",2,1500.0),User("ww",2,1000.0
)))
userRDD: org.apache.spark.rdd.RDD[User] = ParallelCollectionRDD[51] at parallelize at
<console>:26

scala> userRDD.takeOrdered
	def takeOrdered(num: Int)(implicit ord: Ordering[User]): Array[User]

scala> userRDD.takeOrdered(3)
<console>:26: error: No implicit Ordering defined for User.
userRDD.takeOrdered(3)

scala> implicit var userOrder=new Ordering[User]{
| override def compare(x: User, y: User): Int = {
| if(x.deptNo!=y.deptNo){
| x.deptNo.compareTo(y.deptNo)
| }else{
| x.salary.compareTo(y.salary) * -1
| }
| }
| }
userOrder: Ordering[User] = $anon$1@7066f4bc

scala> userRDD.takeOrdered(3)
res23: Array[User] = Array(User(zs,1,1000.0), User(ls,2,1500.0), User(ww,2,1000.0))

9.saveAsTextFile( path )<存储文本>

Write the elements of the dataset as a text file (or set of text
files) in a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element
to convert it to a line of text in the file.

Spark会调用RDD中元素的toString方法将元素以文本行的形式写入到文件中。

scala> sc.textFile("file:///root/t_word").flatMap(_.split("")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).map(t=>t._1+"\t"+t._2).saveAsTextFile("hdfs:///demo/results02")

10.saveAsSequenceFile( path )<存储二进制>

Write the elements of the dataset as a Hadoop SequenceFile in a given
path in the local filesystem,HDFS or any other Hadoop-supported file
system. This is available on RDDs of key-value pairs that implement
Hadoop’s Writable interface. In Scala, it is also available on types
that are implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String, etc).

该方法只能用于RDD[(k,v)]类型。并且K/v都必须实现Writable接口，由于使用Scala编程，Spark已经实现隐式转换将Int, Double, String, 等类型可以自动的转换为Writable

scala> sc.textFile("file:///root/t_word").flatMap(_.split("
")).map((_,1)).reduceByKey(_+_).sortBy(_._1,true,1).saveAsSequenceFile("hdfs:///demo/r
esults03")
scala> sc.sequenceFile[String,Int]("hdfs:///demo/results03").collect
res29: Array[(String, Int)] = Array((a,1), (baby,1), (come,1), (day,2), (demo,1),
(good,2), (hello,1), (is,1), (on,1), (spark,1), (study,1), (this,1), (up,1))

1只筷子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark RDD Operations(操作)转换算子与动作算子

RDD支持两种类型的操作： transformations-转换算子，将⼀个已经存在的RDD转换为一个新的RDD，另外⼀种称为actions-动作算子，动作算子一般在执行结束以后，会将结果返回给Driver。在Spark中所有的transformations都是lazy的，所有转换算子并不会立即执行，它们仅是记录对当前RDD的转换逻辑。仅当 Actions 算子要求将结果返回给Driver程...
复制链接

扫一扫