1.读取文件
从本地文件读取sparkcontext.textFile(“abc.txt”)
从hdfs文件读取sparkcontext.textFile("hdfs://s1:8020/user/hdfs/input”)
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path,
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
textFile第2个参数是partition个数,默认defaultMinPartitions为2,分2个区。也可以自定义最小分区数。
接下来就是看hadoopFile实现:
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// This is a hack to enforce loading hdfs-site.xml.
// See SPARK-11227 for details.
FileSystem.
getLocal(hadoopConfiguration)
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(
new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.
setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
这里返回的就是HadoopRDD,RDD最关键的2个信息就是如何分区,以及PreferredLocation。
关于第一个问题分区,我们看看getPartitions
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext