读取
spark 读取文件是有固定的\n作为换行符的,但是再日常使用过程中,我们可能需要其他的字符作为换行怎么修改换行符呢。
1. 查看源码,sparkcontext有很多方法,看newAPIHadoopFile这个方法,通过改方法的最后一行,可以知道最终生成NewHadoopRDD的类,NewHadoopRDD是RDD的子类,也有RDD的五大特性,具体的操作可见源码,现在只需要关注fClass、kClass和vClass,fClass是一个inputformat的子类,学过MR的应该了解InputFormat是MR读取的文件的主要类,hive、spark也实现了很多的INputFormat,想ORC和parquet读取等。因为读取的是普通的文件,我们可以使用TextInputFormat类。
assertNotStopped()
// This is a hack to enforce loading hdfs-site.xml.
// See SPARK-11227 for details.
FileSystem.getLocal(hadoopConfiguration)
// The call to NewHadoopJob automatically adds security credentials to conf,
// so we don't need to explicitly add them ourselves
val job = NewHadoopJob.getInstance(conf)
// Use setInputPaths so that newAPIHadoopFile aligns with hadoopFile/textFile in taking
// comma separated files as input. (see SPARK-7155)
NewFileInputFormat.setInputPaths(job, path)
val updatedConf = job.getConfiguration
new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf).setName(path)
TextInputFormat类很短,但是有一个参数很显眼,textinputformat.record.delimiter,这个就是换行符的参数,有兴趣的话可以看下inputfromat的实现和LineRecordReader的实现。
public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
String delimiter = job.get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter) {
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
}
return new LineRecordReader(job, (FileSplit)genericSplit, recordDelimiterBytes);
}
所以想改换行符读取文件就很简单了
spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "|-|")
val rdd = spark.sparkContext.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], spark.sparkContext.hadoopConfiguration)
val rdd1 = rdd.map { case (_, text) => text.toString }
写入
1.写入较为复杂,应为默认的TextOutputForamt是不支持换行符的,只能自定义换行符解决,查看TextOutPutFormat类,在内部类中有static的初始化方法,定义了\n为newLine的换行符,所以只需要修改newLine就可以了.
static {
try {
newline = "\n".getBytes("UTF-8");
} catch (UnsupportedEncodingException var1) {
throw new IllegalArgumentException("can't find UTF-8 encoding");
}
}
再看下rdd怎么写入的,查看rdd的写入文件方法,只需要修改TextOutputFormat为定义的OutputFormat即可。
val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
val textClassTag = implicitly[ClassTag[Text]]
val r = this.mapPartitions { iter =>
val text = new Text()
iter.map { x =>
text.set(x.toString)
(NullWritable.get(), text)
}
}
RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)