Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
使用函数func聚合数据集的元素(函数func接受两个参数并返回一个参数)
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
将数据集的所有元素作为数组返回到driver,返回的数据小可以,大了会OOM
count()
Return the number of elements in the dataset.
返回数据的个数
first()
Return the first element of the dataset (similar to take(1)).
返回数据集的第一个元素
take(n)
Return an array with the first n elements of the dataset.
返回数据集的前n个元素
foreach(func)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.
遍历每个元素
takeSample(withReplacement, num, [seed])
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
返回一个随机样本
takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator.
返回排序后的前N个
DEMO
object ActionApp {
def main(args: Array[String]): Unit = {
val sparkConf= new SparkConf().setAppName("ActionApp").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val data1 = sc.parallelize(Array(1,2,3,4,5))
val reduceData=data1.reduce(_+_)
println(reduceData)//15
//需要注意 -操作对每个分区进行相减,想得到预期值需要设置1个分区
//2个分区时(1,2)(3,4,5)=> (-1),(-6) => -1 -(-6)=5 or -6 -(-1)=5
println(data1.reduce(_-_))//5 or -5
//所有数据汇集到driver
//只有当结果数组很小时才应使用此方法,因为所有数据都加载driver内存中。
data1.collect().foreach(println(_))
//返回数据个数
println( data1.count())//5
//返回第一个元素
println( data1.first())//1
//只有当结果数组很小时才应使用此方法,因为所有数据都加载driver内存中。
//返回前2个
data1.take(2).foreach(println(_))//1 2
//
//this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
//只有当结果数组很小时才应使用此方法,因为所有数据都加载driver内存中。
data1.takeSample(true,2).foreach(println(_))//
//只有当结果数组很小时才应使用此方法,因为所有数据都加载driver内存中。
data1.takeOrdered(2).foreach(println(_))//1,2
val data2 = sc.parallelize(Array("a","b","c","d","e"))
sc.stop()
}
}