spark系列(三)spark RDD编程/算子介绍

程序员劝退师丶

已于 2022-08-22 11:40:26 修改

阅读量467

点赞数

分类专栏：大数据 spark 文章标签： spark 大数据分布式

于 2022-07-24 11:23:05 首次发布

本文链接：https://blog.csdn.net/qq_38130094/article/details/125881425

版权

大数据同时被 2 个专栏收录

22 篇文章 2 订阅

订阅专栏

spark

3 篇文章 0 订阅

订阅专栏

1. 什么是spark RDD

RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是 Spark 中最基本的数据处理模型。代码中是一个抽象类，它代表一个弹性的、不可变、可分区、里面的元素可并行计算的集合:

1：弹性
- 存储的弹性：内存与磁盘的自动切换；
- 容错的弹性：数据丢失可以自动恢复；
- 计算的弹性：计算出错重试机制；
- 分片的弹性：可根据需要重新分片。
分布式：数据存储在大数据集群不同节点上
数据集：RDD 封装了计算逻辑，并不保存数据
数据抽象：RDD 是一个抽象类，需要子类具体实现
不可变：RDD 封装了计算逻辑，是不可以改变的，想要改变，只能产生新的 RDD，在新的 RDD 里面封装计算逻辑
可分区、并行计算

org.apache.spark.rdd.RDD类注释说明了五个核心属性

Internally, each RDD is characterized by five main properties:
    A list of partitions
    A function for computing each split
    A list of dependencies on other RDDs
    Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
    Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

分区列表：RDD 数据结构中存在分区列表，用于执行任务时并行计算，是实现分布式计算的重要属性
分区计算函数：Spark 在计算时，是使用分区函数对每一个分区进行计算
RDD 之间的依赖关系：RDD 是计算模型的封装，当需求中需要将多个计算模型进行组合时，就需要将多个 RDD 建立依赖关系
分区器（可选）：当数据为 KV 类型数据时，可以通过设定分区器自定义数据的分区
首选位置（可选）：计算数据时，可以根据计算节点的状态选择不同的节点位置进行计算

spark提供了三种创建RDD的方式：

集合
本地文件
HDFS文件

1.1 本地集合创建方式

如果要通过集合来创建RDD，需要针对程序中的集合，调用SparkContext的parallelize()方法。Spark会将集合中的数据拷贝到集群上，形成一个分布式的数据集合，也就是一个RDD。相当于，集合中的部分数据会到一个节点上，而另一部分数据会到其它节点上。然后就可以用并行的方式来操作这个分布式数据集合了。调用parallelize()时，有一个重要的参数可以指定，就是将集合切分成多少个partition。 Spark会为每一个partition运行一个task来进行处理。 Spark默认会根据集群的配置来设置partition的数量。我们也可以在调用parallelize()方法时，传入第二个参数，来设置RDD的partition数量，例如：parallelize(arr, 5)


    //1:准备环境
    val sparkConf =
      new SparkConf().setMaster("local[*]").setAppName("spark")

    val sc = new SparkContext(sparkConf)
    //使用集合创建 RDD
    val rdd1 = sc.parallelize(List(1, 2, 3, 4)
    )
    val rdd2 = sc.makeRDD(List(1, 2, 3, 4)
    )
    rdd1.collect().foreach(println)
    rdd2.collect().foreach(println)

    //关闭环境
    sc.stop();

注意： val arr = Array(1,2,3,4,5)还有println(sum)代码是在driver进程中执行的，这些代码不会并行执行 parallelize还有reduce之类的操作是在worker节点中执行的

2. 使用本地文件和HDFS文件创建RDD：

var path = "D:\\hello.txt"
 path = "hdfs://bigdata01:9000/test/hello.txt"
 //读取文件数据，可以在textFile中指定生成的RDD的分区数量
 val rdd = sc.textFile(path,2)

通过SparkContext的textFile()方法，可以针对本地文件或HDFS文件创建RDD，RDD中的每个元素就是文件中的一行文本内容 textFile()方法支持针对目录、压缩文件以及通配符创建RDD Spark默认会为HDFS文件的每一个Block创建一个partition，也可以通过textFile()的第二个参数手动设置分区数量，只能比Block数量多，不能比Block数量少，比Block数量少的话你的设置是不生效的

3. 从其他 RDD 创建

主要是通过一个 RDD 运算完后，再产生新的 RDD。详情请参考后续章节

4. 直接创建 RDD（new）

使用 new 的方式直接构造 RDD，一般由 Spark 框架自身使用。

2. spark RDD算子操作

从数据处理方式角度来讲操作算子分为

数值类型的算子：
双数值类型：
Key-Value类型：

Transformtion和Action

Spark对RDD的操作可以整体分为两类： Transformation和Action

Transformation：可以翻译为转换，表示是针对RDD中数据的转换操作，主要会针对已有的RDD创建一个新的RDD：常见的有map、flatMap、filter等等

Action：可以翻译为执行，表示是触发任务执行的操作，主要对RDD进行最后的操作，比如遍历、 reduce、保存到文件等，并且还可以把结果返回给Driver程序

Transformation特性：lazy

如果一个spark任务中只定义了transformation算子，那么即使你执行这个任务，任务中的算子也不会执行。
也就是说，transformation是不会触发spark任务的执行，它们只是记录了对RDD所做的操作，不会执行。只有当transformation之后，接着执行了一个action操作，那么所有的transformation才会执行。
Spark通过lazy这种特性，来进行底层的spark任务执行的优化，避免产生过多中间结果

Action的特性：

执行Action操作才会触发一个Spark 任务的运行，从而触发这个Action之前所有的 Transformation的执行

2.1 Transformtion算子简介

https://spark.apache.org/docs/2.3.4/rdd-programming-guide.html#transformations

Transformation算子	含义
map(func)	通过函数func传递源的每个元素，返回一个新的分布式数据集，将RDD中的每个元素进行处理，一进一出
filter(func)	对RDD中每个元素进行判断，返回true则保留
flatMap(func)	与map类似，但是每个元素都可以返回一个或多个新元素
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
sample(withReplacement, fraction, seed)	Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset)	Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numPartitions]))	对RDD中的元素进行全局去重
groupByKey([numPartitions])	根据key进行分组，每个key对应一个Iterable
reduceByKey(func, [numPartitions])	对每个相同key对应的value进行reduce操作
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])	When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in `groupByKey`, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numPartitions])	对每个相同key对应的value进行排序操作(全局排序)
join(otherDataset, [numPartitions])	对两个包含对的RDD进行join操作 `n`.
cogroup(otherDataset, [numPartitions])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called `groupWith`.
cartesian(otherDataset)	When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command, [envVars])	Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions)	Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner)	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.

2.2 Action 算子简介

Action	Meaning
reduce(func)	将RDD中的所有元素进行聚合操作
collect()	将RDD中所有元素获取到本地客户端(Driver)
count()	获取RDD中元素总数
first()	相当于take(1) 获取第一个
take(n)	获取RDD中前n个元素
takeSample(withReplacement, num, [seed])	Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering])	Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path)	将RDD中元素保存到文件中，对每个元素调用toString
saveAsSequenceFile(path) (Java and Scala)	Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
saveAsObjectFile(path) (Java and Scala)	Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using `SparkContext.objectFile()`.
countByKey()	对每个key对应的值进行count计数
foreach(func)	遍历RDD中的每个元素

3 算子示例

    //1:准备环境
    val sparkConf =
      new SparkConf().setMaster("local").setAppName("spark")
    val sc = new SparkContext(sparkConf)

    //关闭环境
    sc.stop();

3.1 map

Map 算子主要目的将数据源中的数据进行转换和改变。但是不会减少或增多数据

    val rdd1 = sc.makeRDD(List(1, 2, 3, 4))
    //    val value: RDD[Int] = rdd1.map(num =>{
    //      num+10
    //    })
    def sum1(num: Int): Int = {
      num + 10
    }

    val value: RDD[Int] = rdd1.map(sum1)
    value.collect().foreach(println)

3.2 MapPartitions

MapPartitions 算子需要传递一个迭代器，返回一个迭代器，没有要求的元素的个数保持不变，所以可以增加或减少数据

3.3 mapPartitionsWithIndex

将待处理的数据以分区为单位发送到计算节点进行处理，这里的处理是指可以进行任意的处理，哪怕是过滤数据，在处理时同时可以获取当前分区索引

    val seq = Seq[Int](1,2,3,4,5,6,7,8,9,10)
    //parallelize 并行
    val value: RDD[Int] = sc.makeRDD(seq)

    val value1: RDD[(Int, Int)] = value.mapPartitionsWithIndex(
      (index, iter) => {
        iter.map(
          num => {
            (index, num)
          }
        )
      }
    )

    value1.foreach(println)

3.4 flatMap（扁平化处理）

将处理的数据进行扁平化后再进行映射处理，所以算子也称之为扁平映射

    //parallelize 并行
    val value: RDD[List[Int]] = sc.makeRDD(List(List(1, 2), List(3, 4)))
    val value1: RDD[Int] = value.flatMap(
      list => {
        list
      }
    )

    value1.foreach(println)

3.5 glom

将同一个分区的数据直接转换为相同类型的内存数组进行处理，分区不变

    //parallelize 并行
    val value: RDD[Any] = sc.makeRDD(List(List(1, 2), 8, List(3, 4)))

    val glomRDD = value.glom()
    //    glomRDD.collect().foreach(println)
    glomRDD.collect().foreach(data => println(data.mkString(",")))

3.6 groupBy

将数据根据指定的规则进行分组, 分区默认不变，但是数据会被打乱重新组合，我们将这样的操作称之为 shuffle。极限情况下，数据可能被分在同一个分区中

一个组的数据在一个分区中，但是并不是说一个分区中只有一个组

val dataRDD = sparkContext.makeRDD(List(1,2,3,4),1) 
val dataRDD1 = dataRDD.groupBy(
    _%2
)

3.7 filter

将数据根据指定的规则进行筛选过滤，符合规则的数据保留，不符合规则的数据丢弃。当数据进行筛选过滤后，分区不变，但是分区内的数据可能不均衡，生产环境下，可能会出现数据倾斜

val dataRDD = sparkContext.makeRDD(List( 1,2,3,4
),1)
val dataRDD1 = dataRDD.filter(_%2 == 0)

3.8 sample

根据指定的规则从数据集中抽取数据

val dataRDD = sparkContext.makeRDD(List( 1,2,3,4
),1)
// 抽取数据不放回（伯努利算法）
// 伯努利算法：又叫 0、1 分布。例如扔硬币，要么正面，要么反面。
// 具体实现：根据种子和随机算法算出一个数和第二个参数设置几率比较，小于第二个参数要，大于不要
// 第一个参数：抽取的数据是否放回，false：不放回
// 第二个参数：抽取的几率，范围在[0,1]之间,0：全不取；1：全取；
// 第三个参数：随机数种子
val dataRDD1 = dataRDD.sample(false, 0.5)
// 抽取数据放回（泊松算法）
// 第一个参数：抽取的数据是否放回，true：放回；false：不放回
// 第二个参数：重复数据的几率，范围大于等于 0.表示每一个元素被期望抽取到的次数
// 第三个参数：随机数种子
val dataRDD2 = dataRDD.sample(true, 2)

3.9 distinct去重

val dataRDD = sparkContext.makeRDD(List( 1,2,3,4,1,2
),1)
val dataRDD1 = dataRDD.distinct()

3.10 coalesce

根据数据量缩减分区，用于大数据集过滤后，提高小数据集的执行效率

当 spark 程序中，存在过多的小任务的时候，可以通过 coalesce 方法，收缩合并分区，减少分区的个数，减小任务调度成本

默认只能缩减分区，不能打散数据

val dataRDD = sparkContext.makeRDD(List( 1,2,3,4,1,2
),6)

3.11 repartition

该操作内部其实执行的是 coalesce 操作，参数 shuffle 的默认值为 true。无论是将分区数多的

RDD 转换为分区数少的 RDD，还是将分区数少的 RDD 转换为分区数多的 RDD，repartition操作都可以完成，因为无论如何都会经 shuffle 过程

val dataRDD = sparkContext.makeRDD(List( 1,2,3,4,1,2
),2)

3.12 sortBy

该操作用于排序数据。在排序之前，可以将数据通过 f 函数进行处理，之后按照 f 函数处理的结果进行排序，默认为升序排列。排序后新产生的 RDD 的分区数与原 RDD 的分区数一致。中间存在 shuffle 的过程