我的大数据之旅-Spark RDD操作

 

Spark RDD操作的两种类型:

  • transformations

         在Spark里所有RDD的转换都是延迟加载的,不会马上计算结果,Spark只是记住要应用于基础数据集的一些转换操作。

只有当一个动作要求返回给Driver时,计算才会给执行。

 

常用的transformations
Transformation含义
map(func)返回一个新的RDD,该RDD是由每一个元素经过func函数转换后组成
filter(func)返回一个新的RDD,该RDD是经过func函数计算之后返回true的元素组成
flatMap(func)类似于map(func),但是每个元素可以被映射成0到多个输出元素(所以func应该返回一个序列,二不是一个单一元素)
sample(withReplacementfractionseed)以指定的随机种子随机抽样出数量为fraction的数据
union(otherDataset)返回一个新的并集数据集(原集合A、参数集合B的并集)
intersection(otherDataset)返回一个新的交集数据集(原集合A、参数集合B的交集)

 

 

 

 

 

 

 

 

 

map(func):

scala> val source = sc.parallelize(1 to 10)
source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> source.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> val mrdd = source.map(_*3)
mrdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25

scala> mrdd.collect
res1: Array[Int] = Array(3, 6, 9, 12, 15, 18, 21, 24, 27, 30)

filter(func):

scala> val sourceFilter = sc.parallelize(Array("GuangDong" , "GuangXi","XiZang","HuNan","HuBei"))
sourceFilter: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at <console>:24                    

scala> val filtered = sourceFilter.filter(_.contains("Guang"))
filtered: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at filter at <console>:25

scala> filtered.collect
res2: Array[String] = Array(GuangDong, GuangXi)

flatMap(func)

scala> val flatSource = sc.parallelize(1 to 5)
flatSource: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> flatSource.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val flatMap = flatSource.flatMap(x => (1 to x))
flatMap: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at flatMap at <console>:25

scala> flatMap.collect
res1: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)

sample(withReplacementfractionseed):

scala> val sampleData = sc.parallelize(1 to 100)
sampleData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> sampleData.sample(false, 0.1, 5)
res1: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[1] at sample at <console>:26

scala> res1.collect
res2: Array[Int] = Array(20, 21, 48, 62, 64, 89, 90, 99, 100)

scala> sampleData.sample(true,0.1,5)
res3: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[2] at sample at <console>:26

scala> res3.collect
res4: Array[Int] = Array(21, 49, 49, 63, 74, 90, 91, 93, 96)

union(otherDataset):

scala> val srcData = sc.parallelize(1 to 5)
scrData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> srcData.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val targetData = sc.parallelize(4 to 8)
targetData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> targetData.collect
res1: Array[Int] = Array(4, 5, 6, 7, 8)

scala> val unionResult = srcData.union(targetData)
unionResult: org.apache.spark.rdd.RDD[Int] = UnionRDD[2] at union at <console>:27

scala> unionResult.collect
res2: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)

 

  • actions

Actions含义
reduce(func)通过函数func聚集RDD中所有的元素,函数必须是可交换的、可并联的
collect()将数据集中所有的元素以数组的形式返回给驱动程序
count()返回数据集中元素个数
first()返回RDD的第一个元素(类似于take(1))
take(n)以数组的形式返回数据集中的前n个元素
takeSample(withReplacementnum, [seed])返回一个数组,该数组由随机取样num个元素组成
takeOrdered(n[ordering])返回前n个排序的元素
saveAsTextFile(path)以text文件的形式将数据集元素保存到本地、HDFS系统或者其他支持的文件系统
countByKey()针对类型为(K,V)的RDD,返回一个(K,Int)的Map,表示每个K的个数

 

 

 

 

 

 

 

 

 

 

 

reduce(func):

scala> val data = sc.makeRDD(1 to 10)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:24

scala> data.reduce(_+_)
res0: Int = 55

collect():

scala> val data = sc.makeRDD(1 to 10)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:24

scala> data.collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

count():

scala> val data = sc.makeRDD(1 to 10)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:24

scala> data.count()
res2: Long = 10

first()、take(n):

scala> data.first()
res3: Int = 1

scala> data.take(5)
res4: Array[Int] = Array(1, 2, 3, 4, 5)

takeSample(withReplacementnum, [seed]):

scala> val rdd = sc.parallelize(1 to 100)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24

scala> rdd.takeSample(true,4,100)
res7: Array[Int] = Array(23, 68, 44, 98)

scala> rdd.takeSample(false,4,100)
res8: Array[Int] = Array(87, 77, 80, 7)

takeOrdered(n[ordering]):

scala> val randomOrd = sc.makeRDD(Seq(10,1,3,5,95,18,12))
randomOrd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at makeRDD at <console>:24

scala> randomOrd.takeOrdered(5)
res16: Array[Int] = Array(1, 3, 5, 10, 12)

saveAsTextFile(path):

scala> val testData = sc.makeRDD(1 to 10,2)
testData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at makeRDD at <console>:24

scala> testData.saveAsTextFile("hdfs://hadoop129:9000/fengling/testData")

countByKey()

scala> val rdd = sc.parallelize(List(("a",1),("b",10),("a",5),("b",9),("c",95)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[16] at parallelize at <console>:24

scala> rdd.countByKey
res20: scala.collection.Map[String,Long] = Map(b -> 2, a -> 2, c -> 1)

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值