Spark系列 —— 算子详解（二）-CSDN博客

本文链接：https://blog.csdn.net/q322625/article/details/97984449

前言

本文接上一篇 Spark系列 —— 各类算子详解（一）
这篇主要来讲讲 Action 算子以及 Cache 算子。

Action 算子

Spark 的执行算子，一个 Action算子会触发一次 job 的生成。
这里需要注意的是，
Action 算子要么没有返回值，
如果有返回值，那么这个值是会被拉取到driver端的，
如果数据过大，你就得考虑下你的driver端是否装的下了...

reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
将RDD的数据进行聚合，并返回聚合后的值。
执行逻辑类似于 reduceByKey

collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
返回RDD中所有的数据，即将RDD 的所有数据原封不动的拉回到 Driver 端
count()
Return the number of elements in the dataset.
返回该 RDD 中的数据的条数。
first()
Return the first element of the dataset (similar to take(1)).
返回 RDD 中的第一条数据。
take(n)
Return an array with the first n elements of the dataset.
返回 RDD 中的前 N 条数据。
n: 需要拿取数据的条数
takeSample(withReplacement, num, [seed])
Return an array with a random sample
of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
随机返回 NUM 条数据。
withReplacement：是否有放回抽样
num: 抽取数据的条数
seed：随机种子，相同的种子会有相同的随机数据
takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator.
根据 ordering 排序，然后返回前 n 条数据。
n：返回的数据
ordering：排序函数
saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
将RDD的数据以 txt 格式保存到指定路径 path，
path：保存路径，该路径可以是 local filesystem 或者 HDFS 或者 any other Hadoop-supported file system
saveAsSequenceFile(path)
(Java and Scala) | Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
类似于 saveAsTextFile，不过格式是 Sequence
Sequence 这是一种特殊的压缩格式。
saveAsObjectFile(path)
(Java and Scala)
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile().
同上，不过一把用做保存 RDD 的里面的数据是 object 类型的数据，
这样加载的时候可以直接转换成对应的对象
countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
这个就是个 Wordcount，不赘述
foreach(func)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.
遍历RDD的每一条数据。
func ：遍历数据的逻辑。

Control 算子

控制类算子，也就是我们常说的缓存类算子

persist(StorageLevel)
缓存算子，懒执行，返回一个缓存类型的 RDD。
当缓存 RDD 被 Action 算子执行后，
该缓存RDD 会被储存起来，
当再次需要该 RDD 执行其他job 的时候，
就可以通过缓存直接读取数据了。
StorageLevel：缓存级别
- 关于缓存级别：我们可以来看下其构造函数,
  可以看到，其分为：
  _useDisk：是否使用磁盘
  _useMemory：是否使用内存
  _useOffHeap：是否使用堆外内存
  _deserialized：是否反序列化
  _replication：副本数
```
class StorageLevel private(
    private var _useDisk: Boolean,
    private var _useMemory: Boolean,
    private var _useOffHeap: Boolean,
    private var _deserialized: Boolean,
    private var _replication: Int = 1)
```
- 内置的缓存级别
  val NONE = new StorageLevel(false, false, false, false)
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  val OFF_HEAP = new StorageLevel(false, false, true, false)
  可以根据构造函数来理解下这些内置的缓存级别。
  补充：一般我们常用的缓存级别是：DISK_ONLY，MEMORY_ONLY_SER。
  副本数一般来说并没有很大的作用，当然如果你内存非常充足另说
  使用磁盘储存的话，会对效率有比较大的影响，
  当然如果你计算链确实很长，数据确实很多，那另说。
cache
其实就是 persist(MEMORY_ONLY),
没什么好说的，一般缓存用这个就好了....