spark读写文件修改换行符

wang972779876

已于 2022-04-24 10:19:06 修改

阅读量2.2k

点赞数

分类专栏： spark RDD hadoop 大数据文章标签： hadoop spark java scala

于 2022-04-24 10:15:37 首次发布

本文链接：https://blog.csdn.net/wang972779876/article/details/124375440

版权

大数据同时被 3 个专栏收录

12 篇文章 0 订阅

订阅专栏

hadoop

7 篇文章 0 订阅

订阅专栏

spark RDD

4 篇文章 0 订阅

订阅专栏

读取

spark 读取文件是有固定的\n作为换行符的，但是再日常使用过程中，我们可能需要其他的字符作为换行怎么修改换行符呢。

1. 查看源码，sparkcontext有很多方法，看newAPIHadoopFile这个方法，通过改方法的最后一行，可以知道最终生成NewHadoopRDD的类，NewHadoopRDD是RDD的子类，也有RDD的五大特性，具体的操作可见源码，现在只需要关注fClass、kClass和vClass，fClass是一个inputformat的子类，学过MR的应该了解InputFormat是MR读取的文件的主要类，hive、spark也实现了很多的INputFormat，想ORC和parquet读取等。因为读取的是普通的文件，我们可以使用TextInputFormat类。

assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(hadoopConfiguration)

    // The call to NewHadoopJob automatically adds security credentials to conf,
    // so we don't need to explicitly add them ourselves
    val job = NewHadoopJob.getInstance(conf)
    // Use setInputPaths so that newAPIHadoopFile aligns with hadoopFile/textFile in taking
    // comma separated files as input. (see SPARK-7155)
    NewFileInputFormat.setInputPaths(job, path)
    val updatedConf = job.getConfiguration
    new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)

TextInputFormat类很短，但是有一个参数很显眼，textinputformat.record.delimiter，这个就是换行符的参数，有兴趣的话可以看下inputfromat的实现和LineRecordReader的实现。

 public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
        reporter.setStatus(genericSplit.toString());
        String delimiter = job.get("textinputformat.record.delimiter");
        byte[] recordDelimiterBytes = null;
        if (null != delimiter) {
            recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
        }

        return new LineRecordReader(job, (FileSplit)genericSplit, recordDelimiterBytes);
    }

所以想改换行符读取文件就很简单了

spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "|-|")
val rdd = spark.sparkContext.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], spark.sparkContext.hadoopConfiguration)
    val rdd1 = rdd.map { case (_, text) => text.toString }

写入

1.写入较为复杂，应为默认的TextOutputForamt是不支持换行符的，只能自定义换行符解决，查看TextOutPutFormat类，在内部类中有static的初始化方法，定义了\n为newLine的换行符，所以只需要修改newLine就可以了.

static {
            try {
                newline = "\n".getBytes("UTF-8");
            } catch (UnsupportedEncodingException var1) {
                throw new IllegalArgumentException("can't find UTF-8 encoding");
            }
        }

再看下rdd怎么写入的，查看rdd的写入文件方法，只需要修改TextOutputFormat为定义的OutputFormat即可。

val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
    val textClassTag = implicitly[ClassTag[Text]]
    val r = this.mapPartitions { iter =>
      val text = new Text()
      iter.map { x =>
        text.set(x.toString)
        (NullWritable.get(), text)
      }
    }
    RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
      .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)