Spark RDD API 参考示例（三）

最新推荐文章于 2020-04-04 09:25:11 发布

xumuteqiu

最新推荐文章于 2020-04-04 09:25:11 发布

阅读量1.6k

点赞数 1

分类专栏： Spark 文章标签： spark api rdd 函数

本文链接：https://blog.csdn.net/u010472512/article/details/72923278

版权

Spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

本文参考Zhen He

28、getCheckpointFile

原型
def getCheckpointFile: Option[String]

含义
getCheckpointFile 返回RDD的checkpoint 文件的路径，主要用于对大型计算中恢复到指定的节点

示例

//设置CheckPoint的路径，前提是路径一定要存在
sc.setCheckpointDir("hdfs://192.168.10.71:9000/wc")
val a = sc.parallelize(1 to 500, 5)
val b = a++a++a++a++a

//获取b的历史 checkpoint 文件路径
b.getCheckpointFile
//目前没有checkpoint文件
res5: Option[String] = None

//设置checkpoint，但不会立马提交，rdd具有延迟的特点
b.checkpoint
b.getCheckpointFile
res10: Option[String] = None

//使用action算子时，才会真正提交checkpoint
b.collect

//获取上面提交的checkpoint文件路径
b.getCheckpointFile
res15: Option[String] = Some(hdfs://192.168.10.71:9000/wc/e7f2340a-b37b-4d97-8b48-58253e6e4464/rdd-133)

29、getStorageLevel

原型
def getStorageLevel

含义
getStorageLevel 返回RDD当前的存储级别，存储级别一旦确定，就不能再修改了。

示例

val a = sc.parallelize(1 to 100000, 2)
//表示目前RDD使用的存储级别是存储在内存中，未序列化，存储1份
a.getStorageLevel
res1: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)

//可以事先指定存储级别
val a = sc.parallelize(1 to 100000, 2)
a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
a.getStorageLevel
//表示存储在磁盘中，存储1份
res2: org.apache.spark.storage.StorageLevel = StorageLevel(disk, 1 replicas)

30、glom

原型
def glom(): RDD[Array[T]]

含义
glom 将RDD的每一个分区作为一个单独的包装，然后分区之间再包装起来

示例

val a = sc.parallelize(1 to 10, 3)
a.glom.collect
//每一个分区作为一个单独的包装，然后分区之间再包装起来
res1:  Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))

31、groupBy

原型
def groupBy[K: ClassTag](f: T => K, numPartitions: Int): RDD[(K, Iterable[T])]

含义
groupBy 将RDD中的数据按照指定的函数和分区数量，来进行分组。

示例

val a = sc.parallelize(1 to 9, 3)
//groupBy的第一个参数是一个函数，用于指定分组条件。分类标签由条件返回值给定
//这里会根据条件返回 "even" 和 "odd"
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
res1： Array((even,CompactBuffer(2, 8, 4, 6)), (odd,CompactBuffer(5, 1, 3, 7, 9)))

//这里的返回标签为 0 ，1 ，2
a.groupBy(x =>(x % 3)).collect
res2:Array((0,CompactBuffer(3, 9, 6)), (1,CompactBuffer(4, 1, 7)), (2,CompactBuffer(2, 8, 5)))

//自定义函数进行分组
val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
  a % 2
}
//groupBy中的第二个参数是指定，分组后将结果存储在几个分区中，默认分区数量和RDD元素分区数量相等
a.groupBy(x => myfunc(x), 3).collect
res2: Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))
a.groupBy(x => myfunc(x), 3).partitions.length
res4: Int = 3

//指定结果分区数量为1
a.groupBy(myfunc(_), 1).collect
res3: Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))
a.groupBy(myfunc(_), 1).partitions.length
res5: Int = 1

32、groupByKey [Pair]

原型
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

含义
groupByKey 和 groupBy 非常相似，不提供函数功能，只是按照key来进行分组，相同的key分在一组，相比于groupBy 要简单

示例

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
//生成一个以单词长度作为key，单词作为value的 元组
val b = a.keyBy(_.length)
//groupByKey不提供函数功能，直接按照Key进行分类
b.groupByKey.collect
res1: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))

33、histogram [Double]

原型
def histogram(bucketCount: Int): Pair[Array[Double], Array[Long]]
def histogram(buckets: Array[Double], evenBuckets: Boolean = false): Array[Long]

含义
histogram 根据RDD中的数据生成一个随机的直方图，RDD中的数据作为横坐标，系统自动生成一个纵坐标，有两种方式生成横坐标，第一种指定需要几个柱，第二种，给定横坐标个数。

示例

//根据给定的柱子数量来确定坐标
val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)
a.histogram(6)
//表示需要7个横坐标点，生成6个柱
res1: (Array[Double], Array[Long]) = (Array(1.0, 2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0, 1, 1, 3, 4))

//根据用户指定的横坐标来确定
val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)
a.histogram(Array(0.0, 3.0, 8.0))
res2: Array[Long] = Array(5, 3)

34、id

原型
val id: Int

含义
id 获取系统分配给RDD的编号，这个编号可以用于查找指定的的RDD

示例

val y = sc.parallelize(1 to 10, 10)
y.id
res1: Int = 19

35、intersection

原型
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T]): RDD[T]

含义
intersection 求两个集合中相同的元素，也就是求二者的交集

示例

//普通元素求交集
val x = sc.parallelize(1 to 20)
val y = sc.parallelize(10 to 30)
val z = x.intersection(y)
//求两个集合的交集
z.collect
res1: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)

//两个元组求交集
val x = sc.parallelize(List(("cat",2),("wolf",1),("gnu",1)))
val y = sc.parallelize(List(("cat",1),("wolf",1),("mouse",1)))
val z = x.intersection(y)
z.collect
//只有完全相同的元组才算相同元素
res2: Array[(String, Int)] = Array((wolf,1))

36、isCheckpointed

原型
def isCheckpointed: Boolean

含义
isCheckpointed 检测一个RDD是否已经存在检查点

示例

//设置检查点
val c = sc.parallelize(1 to 10)
sc.setCheckpointDir("hdfs://192.168.10.71:9000/wc")
c.isCheckpointed
res1: Boolean = false

//延迟执行，只有执行action算子时，才会执行checkpoint
c.checkpoint
c.isCheckpointed
res2: Boolean = false

//执行action算子，生成checkpoint
c.collect
c.isCheckpointed
res3: Boolean = true

37、join [Pair]

原型
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

含义
join 用于两个key-value类型的RDD的内连接操作，类似于数据库中的内连接。只有两者的key相同时，才会连接

示例

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
//相同的key，就能连接在一起
val d = c.keyBy(_.length)
b.join(d).collect 

res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

38、keyBy

原型
def keyBy[K](f: T => K): RDD[(K, T)]

含义
keyBy 指定一个函数产生特定的数据作为RDD的key，这个函数可以自定义，主要目的是产生一个元组。

示例

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
//指定每个单词的长度作为RDD中元素的Key
val b = a.keyBy(_.length)
b.collect
res1: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))

39、keys [Pair]

原型
def keys: RDD[K]

含义
keys 获取RDD中元组的key，这些key可以重复出现

示例

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.keys.collect
//可以重复出现
res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

40、leftOuterJoin [Pair]

原型
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]

含义
leftOuterJoin 类似于数据库中的左外连接，以左边作为标准，右边没有的填缺失值，左边没有的右边有，舍弃掉。

示例

val a = sc.parallelize(List(("dog",2),("salmon",2),("rat",1),("elephant",10)),3)
val b = sc.parallelize(List(("dog",2),("salmon",2),("rabbit",1),("cat",7)), 3)
a.leftOuterJoin(b).collect

//左边有的，在结果集中都有，左边没有的，右边都舍弃掉。以左边作为参考标准
res1:Array((rat,(1,None)), (salmon,(2,Some(2))), (elephant,(10,None)), (dog,(2,Some(2))))

41、lookup

原型
def lookup(key: K): Seq[V]

含义
lookup 查看指定key的value值，通过全表扫描来实现

示例

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.lookup(5)
//通过全表扫描来查找 key=5 的值
res1: Seq[String] = WrappedArray(tiger, eagle)