两个创建RDD的方法是完全一样的
val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
val sc = new SparkContext(conf)
//创建RDD
val rdd1 = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3)
//makeRDD的源码
/** Distribute a local Scala collection to form an RDD.
*
* This method is identical to `parallelize`.
* @param seq Scala collection to distribute
* @param numSlices number of partitions to divide the collection into
* @return RDD representing distributed collection
*/
def makeRDD[T: ClassTag](
seq: Seq[T],
numSlices: Int = defaultParallelism): RDD[T] = withScope {
parallelize(seq, numSlices)
}
//两个本地创建方法完全等价 makeRDD底层调用的parallelize方法
//在这个方法中,如果不指定分区数量,就调用所有分区
val rdd2 = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 3)
textFile
sc.textFile("hdfs://linux01:9000/data",2)
当这个方法总hadoop中拉取数据,比如说当前目录下有四个block块,大小分别是(99,99,99,799),如果参数列表第二个参数是2,那么就将较大文件拆成两个分区,第二个参数是3是就会拆成三个分区,以此类推(如果不传,默认为2)
sc.textFile("hdfs://linux01:9000/data",2)
//源码
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
* The text files must be encoded as UTF-8.
*
* @param path path to the text file on a supported file system
* @param minPartitions suggested minimum number of partitions for the resulting RDD
* @return RDD of lines of the text file
*/
*从HDFS、本地文件系统(在所有节点上可用)或任何
*Hadoop支持的文件系统URI,并将其作为字符串的RDD返回。
*文本文件必须编码为UTF-8。
*
*@param path受支持文件系统上文本文件的路径
*@param minPartitions建议生成的RDD的最小分区数
*@return文本文件行的RDD
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
其实这个方法的分区数量是有规律可循的,这个方法分区数量调用了 TextInputFormat类 这个类继承了
FileInputFormat类,父类中的方法中有getSplits方法
filter
实质上是对每一个分区进行操作,底层调用MapPartitionsRDD
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(_, _, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
/** Returns an iterator over all the elements of this iterator that satisfy the predicate `p`.
* The order of the elements is preserved.
*
* @param p the predicate used to test values.
* @return an iterator which produces those values of this iterator which satisfy the predicate `p`.
* @note Reuse: $consumesAndProducesIterator
*/
def filter(p: A => Boolean): Iterator[A] = new AbstractIterator[A] {
// TODO 2.12 - Make a full-fledged FilterImpl that will reverse sense of p
private var hd: A = _
private var hdDefined: Boolean = false
def hasNext: Boolean = hdDefined || {
do {
if (!self.hasNext) return false
hd = self.next()
} while (!p(hd))
hdDefined = true
true
}
def next() = if (hasNext) { hdDefined = false; hd } else empty.next()
}
当使用filter方法时会将父RDD,和要操作的 分区 和 方法 传进去 ,调用其中compute方法 对父RDD中的每一个分区进行传入的方法的操作
/**
* An RDD that applies the provided function to every partition of the parent RDD.
*
* @param prev the parent RDD.
* @param f The function used to map a tuple of (TaskContext, partition index, input iterator) to
* an output iterator.
* @param preservesPartitioning Whether the input function preserves the partitioner, which should
* be `false` unless `prev` is a pair RDD and the input function
* doesn't modify the keys.
* @param isFromBarrier Indicates whether this RDD is transformed from an RDDBarrier, a stage
* containing at least one RDDBarrier shall be turned into a barrier stage.
* @param isOrderSensitive whether or not the function is order-sensitive. If it's order
* sensitive, it may return totally different result when the input order
* is changed. Mostly stateful functions are order-sensitive.
*/
/ ** *一个RDD,它将提供的功能应用于父RDD的每个分区。 * * @param前置父RDD。 * @param f该函数用于将(TaskContext,分区索引,输入迭代器)的元组映射到*输出迭代器。 * @param preparesPartitioning输入函数是否保留分区程序,除非`prev`是一对RDD并且输入函数*不会修改键,否则应该为false。 * @param isFromBarrier指示此RDD是否从RDDBarrier转换而来,包含至少一个RDDBarrier的阶段*将变为障碍阶段。 * @param isOrderSensitive函数是否对顺序敏感。如果订单*敏感,则更改输入订单*时可能返回完全不同的结果。有状态功能通常是顺序敏感的。 * /
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
var prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false,
isFromBarrier: Boolean = false,
isOrderSensitive: Boolean = false)
extends RDD[U](prev) {
override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
override def getPartitions: Array[Partition] = firstParent[T].partitions
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
override def clearDependencies(): Unit = {
super.clearDependencies()
prev = null
}
@transient protected lazy override val isBarrier_ : Boolean =
isFromBarrier || dependencies.exists(_.rdd.isBarrier())
override protected def getOutputDeterministicLevel = {
if (isOrderSensitive && prev.outputDeterministicLevel == DeterministicLevel.UNORDERED) {
DeterministicLevel.INDETERMINATE
} else {
super.getOutputDeterministicLevel
}
}
}
map
map方法本质是和filter是一样的 都是在底层new了一个MapPartitionsRDD类,本质上都是在map端做的处理,所以是transformation方法
mapPartitions
mapPartitions方法中传入的是迭代器,其实就是传入整个job中的的每个分区中的RDD,因为其实每个RDD就是个迭代器,比如整个job中有三个分区,这个mapPartitions方法只会宝贝调用三次,但是map方法是没有一条数据调用一次
package com.doit.spark.restart
import org.apache.spark.{SparkConf, SparkContext, TaskContext}
object MapPartitionsDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
val sc = new SparkContext(conf)
//创建RDD
val nums = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
//调用map方法:没处理一条数据就调用一次map方法中的函数
val result = nums.map(x=>{
val index = TaskContext.getPartitionId()
(index,x*100)
})
//调用mapPartitions方法,可以讲数据以分区为单位取出来,一个分区就是一个迭代器
//与map不同的是,mapPartitions方法中穿进去的是个迭代器,返回值也是个迭代器
nums.mapPartitions(it=>{
val index = TaskContext.getPartitionId()
val nIt: Iterator[(Int, Int)] = it.map(x => {
(index, x * 100)
})
nIt
})
result.saveAsTextFile("zzz")
}
}
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(_: TaskContext, _: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
mapPartitionsWithIndex
这个方法底层传入了一个函数和一个 preservesPartitioning
传入的第一个函数中的参数列表是Int类型和迭代器
传入的第二个从参数 preservesPartitioning 如果不传,默认是false,至于这个参数有什么效果呢,可以理解为后期的shuffle后不会改变分区,就是保存了这种分区方式有效避免数据倾斜
val res1 = nums.mapPartitionsWithIndex((index: Int, it) => {
it.map(e => {
s"index:$index,element:$e"
})
})
/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(_: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
preservesPartitioning)
}
flatMap
这个方法和之前的transformation方法是非常类似的,传进去一个迭代器返回一个迭代器,对迭代器中的每一个数据进行flatMap处理
package com.doit.spark.restart
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object FlatMap {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MapPartitionsDemo").setMaster("local[*]")
val sc = new SparkContext(conf)
//创建RDD
val arr = Array("spark hadoop flink spark", "hadoop flink spark", "spark hadoop flink")
val lines: RDD[String] = sc.makeRDD(arr)
val value = lines.flatMap(x => x)
println(value.collect().toBuffer)
//ArrayBuffer(s, p, a, r, k, , h, a, d, o, o, p, , f, l, i, ....)
sc.stop()
}
}
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (_, _, iter) => iter.flatMap(cleanF))
}