hive自定义行分隔符

最新推荐文章于 2023-10-08 15:54:22 发布

呵呵小短腿

最新推荐文章于 2023-10-08 15:54:22 发布

阅读量2.2k

点赞数

分类专栏： hive 文章标签： hive hdfs 大数据

本文链接：https://blog.csdn.net/weixin_37519367/article/details/106408869

版权

本文介绍了在处理从RMQ到HDFS的数据时，为避免行错乱，尝试使用Parquet格式但产生大量小文件，最后选择通过自定义Hive的TextFile InputFormat来设置特定的行分隔符。详细讨论了Hive默认的TextInputFormat和LineRecordReader的工作原理，并解释了为何需要自定义InputFormat以避免全局影响。文中还给出了自定义DIYInputFormat和DIYLineRecordReader的代码片段，并分享了如何打包和在Hive中使用这些自定义类。

摘要由CSDN通过智能技术生成

首先交代一下背景:

通过spring消费RMQ的数据写到hdfs,从一开始就预料到直接写textfile会有错行乱行的问题，所以一开始的方案是写parquet，经过验证后发现写parquet会有很多小文件（parquet文件落地后不能修改，不能追加），会对name node造成额外的压力，所以最终妥协写textfile 加自定义行分割符

查看hive默认的textfile 的inputformat

默认format
默认的TextInputFormat在hadoop-mapreduce-client-core包里面，主要代码：

public RecordReader<LongWritable, Text> getRecordReader(
                                          InputSplit genericSplit, JobConf job,
                                          Reporter reporter)
    throws IOException {
   
    
    reporter.setStatus(genericSplit.toString());
    String delimiter = job.get("textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter) {
   
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    }
    return new LineRecordReader(job, (FileSplit) genericSplit,
        recordDelimiterBytes);
  }

通过源码发现可以通过textinputformat.record.delimiter这个参数指定行分隔符，经过测试发现也能实现（至于为什么还要自定义inputformat，我们后面再说）

通过set实现自定义行分割符
继续往下看LineRecordReader，主要代码

 public LineRecordReader(Configuration job, FileSplit split,
      byte[] recordDelimiter) throws IOException {
   
    this.maxLineLength = job.getInt(org.apache.hadoop.mapreduce.lib.input.
      LineRecordReader.MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();
    compressionCodecs = new CompressionCodecFactory(job);
    codec = compressionCodecs.getCodec(fil