需求:spark读取文件生成RDD,需要在RDD中加上一列文件名
第一种:wholeTextFiles
sc.textFiles() 与 sc.wholeTextFiles() 的区别
sc.textFiles(path) 能将path 里的所有文件内容读出,以文件中的每一行作为一条记录的方式
wholeTextFiles(path)的源码介绍
* Read a directory of text files from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI. Each file is read as a single record and returned in a
* key-value pair, where the key is the path of each file, the value is the content of each file.
* <p> For example, if you have the following files:
* {{{
* hdfs://a-hdfs-path/part-00000
* hdfs://a-hdfs-path/part-00001
* ...
* hdfs://a-hdfs-path/part-nnnnn
* }}}
*
* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
eg.
val rdd = sc.wholeTextFiles("./data/cloud/")
def myFun(iter: Iterator[(String, String)]): Iterator[(String, String)] = {
var res = List[(String, String)]()
while (iter.hasNext) {
val file = iter.next
val fileName = file._1
val lines = file._2.split("\n").toList
for (line <- lines) {
res = res.::((fileName, line))
}
res.iterator
}
}
val rdd2 = rdd.mapPartitions(myFun)
第二种:mapPartitionsWithInputSplit
val input = "C:\\Users\\dell\\Desktop\\data"
val fileRDD = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](input)
val hadoopRDD = fileRDD.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
val fileAdnLine = hadoopRDD.mapPartitionsWithInputSplit((inputSplit: InputSplit, iterator: Iterator[(LongWritable, Text)]) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(x => {
//file.getPath.toString 文件的全路径
//file.getPath.getName 文件名
file.getPath.getName().split("_")(0) + "," + x._2.toString()
})
})