Spark算子总结

最新推荐文章于 2021-01-12 05:55:19 发布

goldlone

最新推荐文章于 2021-01-12 05:55:19 发布

阅读量1.2k

点赞数 1

分类专栏：大数据 Spark

本文链接：https://blog.csdn.net/goldlone/article/details/83868822

版权

大数据同时被 2 个专栏收录

13 篇文章 1 订阅

订阅专栏

Spark

4 篇文章 0 订阅

订阅专栏

Spark 算子

RDD支持两种类型的算子， transformation （从现有的数据集创建新的数据集）和 action （从数据集上运行计算后将值返回到驱动程序）

transformation算子并不会立即进行计算，只记录依赖于哪个数据集，仅当需要将结果返回驱动程序时才进行计算转换（即遇到action算子）。这种设计使Spark能够更有效地运行。

默认情况下，每次对其执行操作时，都可以重新计算每个转换后的RDD。但是也可以使用persist（或cache）方法在内存中持久化RDD。在这种情况下，Spark会在群集上保留元素，以便在下次查询时更快地访问。还支持在磁盘上保留RDD，或在多个节点之间进行备份。

类型一：Transformation

map

def map[U: ClassTag](f: T => U): RDD[U]

返回一个新RDD，由原RDD中每个元素经过f函数转换后组成

filter

def filter(f: T => Boolean): RDD[T]

返回一个新RDD，由原RDD中经过f函数返回值为true的元素组成

flatMap

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

类似map，但每个输出可以是0个或多个item组成（因此返回值应该是一个Seq）

mapPartitions

def mapPartitions[U: ClassTag](
    f: Iterator[T] => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U]

类似map，但是在每个RDD分区上单独运行，因此func的类型为Iterator[T]=>Iterator[U]

mapPartitionsWithIndex

def mapPartitionsWithIndex[U: ClassTag](
    f: (Int, Iterator[T]) => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U]

类似mapPartitions，同时提供了分区的索引，因此func的类型为(Int, Iterator[T]) => Iterator[U]

sample

def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T]

使用给定的随机数生成器种子，在又放回或无放回情况下对数据进行采样

采样方式：

不放回: 每个元素被抽中的概率; fraction 必须在 [0, 1] 之间
有放回: 每个元素被抽中的次数; fraction 必须 >= 0

union(otherDataset)

def union(other: RDD[T]): RDD[T]

将两个RDD合并成一个新的RDD，并不进行去重，可能出现相同的元素。

++

def ++(other: RDD[T]): RDD[T] = withScope {
    this.union(other)
}

通union，底层也调用union方法

intersection(otherDataset)

def intersection(other: RDD[T]): RDD[T]

求交集，不会出现重复的元素。注意，这个方法在内部进行shuffle

distinct([numTasks]))

def distinct(numPartitions: Int)

元素去重，底层使用reduceByKey实现

groupByKey

def groupByKey(): RDD[(K, Iterable[V])]

def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

根据RDD中的中的key进行分组，在(K, V)键值对的数据集上调用时，返回(K, Iterable)键值对的数据集。
注意：

如果要对每个键执行聚合（例如总和或平均值）进行分组，则使用reduceByKey或aggregateByKey将产生更好的性能。
默认情况下，输出中的并行级别取决于父RDD的分区数。可以传递可选的numPartitions参数来设置不同数量的分区数。

reduceByKey

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

当在(K, V)键值对的数据集上调用时，返回(K, V)键值对的数据集，其中使用给定的reduce函数func聚合每个键的值，该函数必须是类型(V, V)=> V。与groupByKey类似，reduce任务的数量可通过可选的第二个参数进行配置。

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)]

def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)]

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
    combOp: (U, U) => U): RDD[(K, U)]

当在(K, V)键值对的数据集上调用时，返回(K, U)键值对的数据集，其中使用给定的组合函数和中性“零”值聚合每个键的值。先将每个分区内的元素根据seqOp函数进行聚合，再根据combOp函数对各个分区的最终结果进行聚合。允许与输入值类型不同的聚合值类型，同时避免不必要的分配。与groupByKey类似，reduce任务的数量可通过可选的第二个参数进行配置。

sortByKey([ascending], [numTasks])

def sortBy[K](
    f: (T) => K,
    ascending: Boolean = true,
    numPartitions: Int = this.partitions.length)

Return this RDD sorted by the given key function.
当调用K实现Ordered的（K，V）对数据集时，
当在(K, V)键值对且K实现Ordered的数据集上调用时，返回按键升序或降序排序的(K, V)键值对数据集。

join

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

当在(K, V)和(K, W)键值对的数据集调用时，返回(K, (V, W))对的数据集以及每个键的所有元素对。通过leftOuterJoin，rightOuterJoin和fullOuterJoin支持外连接。
注意：

仅保留两个RDD共有的key
不对key进行去重，多个相同的key进行join将产生多条记录

cogroup(otherDataset, [numTasks])

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions: Int)
    : RDD[(K, (Iterable[V], Iterable[W]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)])
    : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int)
    : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]

def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)])
    : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
    other2: RDD[(K, W2)],
    other3: RDD[(K, W3)],
    partitioner: Partitioner)
    : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
    other2: RDD[(K, W2)],
    other3: RDD[(K, W3)],
    numPartitions: Int)
    : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]

根据key合并(K, V)和(K, U)键值对数据集，返回(K, (Iterable[V], Iterable[U]))。
注意：

元组的顺序与调用顺序一致
当另一个RDD无匹配key时，默认为空
RDD会先进行groupByKey

cartesian(otherDataset)

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

两个RDD的笛卡尔积

pipe

def pipe(command: String): RDD[String]

def pipe(command: String, env: Map[String, String]): RDD[String]

在RDD的每个分区执行shell命令进行管道化，例如一个Perl或bash脚本。RDD元素被写入进程的stdin，并且输出到其stdout的行将作为字符串的RDD返回。

coalesce

def coalesce(numPartitions: Int, shuffle: Boolean = false)

将RDD中的分区数减少为numPartitions。在过滤大型数据集后，可以更有效地运行操作。默认不执行shuffle操作。如果需要增大分区数量，则必须使用开启shuffle

repartition(numPartitions)

def repartition(numPartitions: Int)

重新进行分区，底层实现调用coalesce，使用shuffle

repartitionAndSortWithinPartitions(partitioner)

def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

根据给定的分区重新分区RDD，并在每个生成的分区中按键进行排序。这比调用重新分区，然后在每个分区内排序更有效，因为它可以将排序推送到shuffle机器中。

glom

def glom(): RDD[Array[T]]

返回每个分区内的所有元素合并为一个数组组成的RDD

groupBy

def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]

def groupBy[K](
    f: T => K,
    numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]

def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
    : RDD[(K, Iterable[T])]

返回根据指定元素分组后的RDD。每个组由一个键和一系列映射到该键的元素组成。不保证每个组内元素的排序，并且RDD每次分组结果甚至可能不同。

注意：
使用这个算子的代价可能很大。如果分组是为了进行聚合（求总和或者平均值），使用PairRDDFunctions.aggregateByKey 或 PairRDDFunctions.reduceByKey 将更高效。

zip

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]

将两个RDD组合在一起返回，要求两个RDD的长度相同，否则在运行时会抛出异常。

类型二： Action

reduce(func)

def reduce(f: (T, T) => T): T

使用函数f（它接受两个参数并返回一个）来聚合数据集的元素。该函数应该是可交换的和关联的，以便可以并行正确计算。

collect()

def collect(): Array[T]

将所有分区上的元素收集至Driver机器上，如果数据量很大将会内存溢出，因此慎用

count

def count(): Long

返回RDD中元素的个数

first

def first(): T

返回RDD中的第一个元素

take

def take(num: Int): Array[T]

返回RDD中的前num个元素

takeSample(withReplacement, num, [seed])

def takeSample(
    withReplacement: Boolean,
    num: Int,
    seed: Long = Utils.random.nextLong): Array[T]

返回一个数组，其中包含数据集的num个元素的随机样本，又放回或无放回的进行采样，可选地预先指定随机数生成器种子。

takeOrdered(n, [ordering])

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

在使用implicit Ordering[T]排序之后，返回前num个元素

saveAsTextFile(path)

def saveAsTextFile(path: String): Unit

将RDD中每个元素按字符串的方式保存为文本文件

saveAsSequenceFile

def saveAsSequenceFile(
    path: String,
    codec: Option[Class[_ <: CompressionCodec]] = None): Unit

使用从RDD的key和value类型推断出的Writable类型将RDD输出为Hadoop SequenceFile。如果key或value是Writable的子类，那么直接使用它们类本身；否则我们将原始类型（如Int和Double）映射到IntWritable，DoubleWritable等，将字节数组映射到BytesWritable，将字符串映射到Text。

saveAsObjectFile

def saveAsObjectFile(path: String): Unit

将RDD中每个元素按序列化对象保存为序列化文件

countByKey

def countByKey(): Map[K, Long]

统计每个key对应的元素个数，并收集到driver端的Map对象中

foreach

def foreach(f: T => Unit): Unit

使用f函数遍历RDD中的所有元素

randomSplit

def randomSplit(
    weights: Array[Double],
    seed: Long = Utils.random.nextLong): Array[RDD[T]]

根据提供的权重数组随机的切分RDD，如果权重数组的总和不为1，则会进行归一化。

aggregate

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

使用给定的组合函数和中性“零值”，聚合每个分区的元素，然后聚合所有分区的结果。