RDD -- RDD中获取文件名并加上一列文件名

最新推荐文章于 2020-10-30 19:03:35 发布

游九河

最新推荐文章于 2020-10-30 19:03:35 发布

阅读量982

点赞数 1

分类专栏： spark core 文章标签： spark 读取文件名

本文链接：https://blog.csdn.net/qq_40337206/article/details/94008639

版权

spark core 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

需求：spark读取文件生成RDD，需要在RDD中加上一列文件名

第一种：wholeTextFiles

sc.textFiles() 与 sc.wholeTextFiles() 的区别

sc.textFiles(path) 能将path 里的所有文件内容读出，以文件中的每一行作为一条记录的方式

wholeTextFiles(path)的源码介绍

   * Read a directory of text files from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI. Each file is read as a single record and returned in a
   * key-value pair, where the key is the path of each file, the value is the content of each file.
   * <p> For example, if you have the following files:
   * {{{
   *   hdfs://a-hdfs-path/part-00000
   *   hdfs://a-hdfs-path/part-00001
   *   ...
   *   hdfs://a-hdfs-path/part-nnnnn
   * }}}
   *
   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,

eg.

    val rdd = sc.wholeTextFiles("./data/cloud/")
    def myFun(iter: Iterator[(String, String)]): Iterator[(String, String)] = {
	    var res = List[(String, String)]()
	    while (iter.hasNext) {
	      val file = iter.next
	      val fileName = file._1
	      val lines = file._2.split("\n").toList
	
	      for (line <- lines) {
	        res = res.::((fileName, line))
	      }
	      res.iterator
	  	}
	  }
    val rdd2 = rdd.mapPartitions(myFun)

第二种：mapPartitionsWithInputSplit

    val input = "C:\\Users\\dell\\Desktop\\data"
    val fileRDD = sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](input)
    val hadoopRDD = fileRDD.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
    val fileAdnLine = hadoopRDD.mapPartitionsWithInputSplit((inputSplit: InputSplit, iterator: Iterator[(LongWritable, Text)]) => {
      val file = inputSplit.asInstanceOf[FileSplit]
      iterator.map(x => {
        //file.getPath.toString   文件的全路径
        //file.getPath.getName  文件名
        file.getPath.getName().split("_")(0) + "," + x._2.toString()
      })
    })