什么是RDD
RDD就是Spark的input,知道input是啥吧,就是输入的数据。
RDD的全名是Resilient Distributed Dataset,意思是容错的分布式数据集,每一个RDD都会有5个特征:
1、有一个分片列表。就是能被切分,和hadoop一样的,能够切分的数据才能并行计算。
2、有一个函数计算每一个分片,这里指的是下面会提到的compute函数。
3、对其他的RDD的依赖列表,依赖还具体分为宽依赖和窄依赖,但并不是所有的RDD都有依赖。
4、可选:key-value型的RDD是根据哈希来分区的,类似于mapreduce当中的Paritioner接口,控制key分到哪个reduce。
5、可选:每一个分片的优先计算位置(preferred locations),比如HDFS的block的所在位置应该是优先计算的位置。
对应着上面这几点,我们在RDD里面能找到这4个方法和1个属性:
/**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
* 通过子类在RDD返回分区的set。只会调用一次,所以在它中实现一个耗时的计算是安全的
*/
protected def getPartitions: Array[Partition]
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
* 由子类实现计算给定的分区
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
* 通过子类返回这个RDD。只会调用一次,所以在它中实现一个耗时的计算是安全的
*/
protected def getDependencies: Seq[Dependency[_]] = deps
/**
* Optionally overridden by subclasses to specify how they are partitioned.
* 可由子类重写指定如何划分(可选的)
*/
@transient val partitioner: Option[Partitioner] = None
/**
* Optionally overridden by subclasses to specify placement preferences.
* 可由子类重写指定优先位置(可选的), 输入参数是split分片,输出结果是一组优先的节点位置
*/
protected def getPreferredLocations(split: Partition): Seq[String] = Nil
多种RDD之间的转换
实例:
val hdfsFile = sc.textFile(args(1))
val flatMapRdd = hdfsFile.flatMap(s => s.split(" "))
val filterRdd = flatMapRdd.filter(_.length == 2)
val mapRdd = filterRdd.map(word => (word, 1))
val reduce = mapRdd.reduceByKey(_ + _)
textFile是一个HadoopRDD经过map后的MapPartitionsRDD,经过flatMap是一个FlatMappedRDD,经过filter方法之后生成了一个FilteredRDD…,最后经过reduceByKey
textFile(…)
【源码】SparkContext.scala
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
/** Get an RDD for a Hadoop file with an arbitrary InputFormat
*
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
* operation will create many references to the same object.
* If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
* copy them using a `map` function.
*/
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
参数:
1、hdfs的地址
2、InputFormat的类型
3、Mapper的第一个类型
4、Mapper的第二类型
操作:
1、把hadoop的配置文件保存到广播变量里。
2、设置路径的方法
3、new了一个HadoopRDD返回
追踪发现默认的defaultMinPartitions为2,我们用的时候还是设置大一点。
如果大家要用的不是textFile,而是一个别的hadoop文件类型,那么就使用别的对应的方法。
HadoopRDD
【源码】HadoopRDD.scala
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
// 计算分片
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
//把分片HadoopPartition包装到到array里面返回
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
val iter = new NextIterator[(K, V)] {
// 转换成HadoopPartition
val split = theSplit.asInstanceOf[HadoopPartition]
logInfo("Input split: " + split.inputSplit)
val jobConf = getJobConf()
val inputMetrics = context.taskMetrics.getInputMetricsForReadMethod(DataReadMethod.Hadoop)
// Sets the thread local variable for the file's name
split.inputSplit.value match {
case fs: FileSplit => SqlNewHadoopRDDState.setInputFileName(fs.getPath.toString)
case _ => SqlNewHadoopRDDState.unsetInputFileName()
}
// Find a function that will return the FileSystem bytes read by this thread. Do this before
// creating RecordReader, because RecordReader's constructor might read some bytes
val bytesReadCallback = inputMetrics.bytesReadCallback.orElse {
split.inputSplit.value match {
case _: FileSplit | _: CombineFileSplit =>
SparkHadoopUtil.get.getFSBytesReadOnThreadCallback()
case _ => None
}
}
inputMetrics.setBytesReadCallback(bytesReadCallback)
var reader: RecordReader[K, V] = null
val inputFormat = getInputFormat(jobConf)
HadoopRDD.addLocalConfiguration(new SimpleDateFormat("yyyyMMddHHmm").format(createTime),
context.stageId, theSplit.index, context.attemptNumber, jobConf)
// 通过Inputform的getRecordReader来创建这个InputSpit的Reader
reader = inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
// Register an on-task-completion callback to close the input stream.
context.addTaskCompletionListener{ context => closeIfNeeded() }
val key: K = reader.createKey()
val value: V = reader.createValue()
// 重写Iterator的getNext方法,通过创建的reader调用next方法读取下一个值
override def getNext(): (K, V) = {
try {
finished = !reader.next(key, value)
} catch {
case eof: EOFException =>
finished = true
}
if (!finished) {
inputMetrics.incRecordsRead(1)
}
(key, value)
}
...
}
new InterruptibleIterator[(K, V)](context, iter)
}
override def getPreferredLocations(split: Partition): Seq[String] = {
val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
val locs: Option[Seq[String]] = HadoopRDD.SPLIT_INFO_REFLECTIONS match {
case Some(c) =>
try {
val lsplit = c.inputSplitWithLocationInfo.cast(hsplit)
val infos = c.getLocationInfo.invoke(lsplit).asInstanceOf[Array[AnyRef]]
Some(HadoopRDD.convertSplitLocationInfo(infos))
} catch {
case e: Exception =>
logDebug("Failed to use InputSplitWithLocations.", e)
None
}
case None => None
}
// 直接调用InputSplit的getLocations方法获得所在的位置
locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
}
可以看得出来compute方法是通过分片来获得Iterator接口,以遍历分片的数据
map(…)
【源码】RDD.scala
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
【源码】MapPartitionsRDD.scala
/**
* An RDD that applies the provided function to every partition of the parent RDD.
*/
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false)
extends RDD[U](prev) {
override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
override def getPartitions: Array[Partition] = firstParent[T].partitions
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
}
getPartitions和compute给重写了,而且都用到了firstParent[T], firstParent[T]在继承的父构造函数确定了
【源码】RDD.scala
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
/** Construct an RDD with just a one-to-one dependency on one parent */
def this(@transient oneParent: RDD[_]) =
this(oneParent.context , List(new OneToOneDependency(oneParent)))
/** Returns the first parent RDD */
protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
dependencies.head.rdd.asInstanceOf[RDD[U]]
}
...
}
它把RDD复制给了deps,HadoopRDD成了MapPartitionsRDD的父依赖了,这个OneToOneDependency是一个窄依赖,子RDD直接依赖于父RDD
结论:
1、getPartitions直接沿用了父RDD的分片信息
2、compute函数是在父RDD遍历每一行数据时套一个匿名函数f进行处理
compute函数的两个显著作用:
1、在没有依赖的条件下,根据分片的信息生成遍历数据的Iterable接口
2、在有前置依赖的条件下,父RDD每遍历一个元素时套一个匿名函数f进行处理
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) 中的 map(cleanF):
【源码】Iterator.scala
/** Creates a new iterator that maps all produced values of this iterator
* to new values using a transformation function.
*
* @param f the transformation function
* @return a new iterator which transforms every value produced by this
* iterator by applying the function `f` to it.
* @note Reuse: $consumesAndProducesIterator
*/
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
def hasNext = self.hasNext
def next() = f(self.next())
}
。。。更新中。。。