hive2.1.1读取spark写入的orc:ORC split generation failed with exception:ArrayIndexOutOfBoundsException: 6

最新推荐文章于 2024-06-28 17:38:32 发布

beetle_lzk

最新推荐文章于 2024-06-28 17:38:32 发布

阅读量7.7k

点赞数 2

分类专栏： hivebug 文章标签： bug

本文链接：https://blog.csdn.net/lixiaoksi/article/details/106855509

版权

hivebug 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

问题描述：使用spark读取kafka数据写入hive orc格式表时，数据能正确写入，但是当在hive客户端查询的时候出现错误

Failed with exception java.io.IOException:java.lang.RuntimeException: ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException: 6

cdh version：6.1.1

spark version： 2.4.0+cdh6.1.1

kafka version： 2.0.0+cdh6.1.1

hive version： 2.1.1+cdh6.1.1

hdfs version： 3.0.0+cdh6.1.1

hive 建表语句：

create table test_orc(
name string,
age int,
sex string,
birth string) stored as orc;

启动spark程序写入数据，在hdfs上可见已有文件生成

hive客户端查询：报错

使用hive --orcfiledump 查看orc文件：报错

百度搜寻解决办法，尝试了各种参数调整，然而并没有什么作用，偶然翻到一篇orc里提交的bug fix ，具体信息请点这里

该bug fix 中如果当前orc版本号等于 FUTURE.id 时才返回 FUTURE

WriterVersion源码如下：

  /**
   * Records the version of the writer in terms of which bugs have been fixed.
   * For bugs in the writer, but the old readers already read the new data
   * correctly, bump this version instead of the Version.
   */
  public enum WriterVersion {
    ORIGINAL(0),
    HIVE_8732(1), // corrupted stripe/file maximum column statistics
    HIVE_4243(2), // use real column names from Hive tables
    HIVE_12055(3), // vectorized writer
    HIVE_13083(4), // decimal writer updating present stream wrongly

    // Don't use any magic numbers here except for the below:
    FUTURE(Integer.MAX_VALUE); // a version from a future writer

    private final int id;

    public int getId() {
      return id;
    }

    WriterVersion(int id) {
      this.id = id;
    }

    private static final WriterVersion[] values;
    static {
      // Assumes few non-negative values close to zero.
      int max = Integer.MIN_VALUE;
      for (WriterVersion v : WriterVersion.values()) {
        if (v.id < 0) throw new AssertionError();
        if (v.id > max && FUTURE.id != v.id) {
          max = v.id;
        }
      }
      values = new WriterVersion[max + 1];
      for (WriterVersion v : WriterVersion.values()) {
        if (v.id < values.length) {
          values[v.id] = v;
        }
      }
    }

    public static WriterVersion from(int val) {
      if (val == FUTURE.id) return FUTURE; // Special handling for the magic value.
      return values[val];
    }
  }

根据报错信息可以知道错误发生在OrcFile文件中的writeVersion.from 方法中

打开源码开始debug，搭建源码环境参考 hive2.1.1源码编译及调试这篇文章

经过层层查找，在Orctail.java 文件中找到了具体调用

public OrcFile.WriterVersion getWriterVersion() {
OrcProto.PostScript ps = fileTail.getPostscript();
return (ps.hasWriterVersion()
? OrcFile.WriterVersion.from(ps.getWriterVersion()) : OrcFile.WriterVersion.ORIGINAL);
}

找到OrcProto.PostScript 的getWriterVersion() 方法，等价于调用 PostScript 类的getWriterVersion()方法, 此时 writerVersion_ = input.readUInt32();

/**
   * Read a raw Varint from the stream.  If larger than 32 bits, discard the
   * upper bits.
   */
  public int readRawVarint32() throws IOException {
    byte tmp = readRawByte();
    if (tmp >= 0) {
      return tmp;
    }
    int result = tmp & 0x7f;
    if ((tmp = readRawByte()) >= 0) {
      result |= tmp << 7;
    } else {
      result |= (tmp & 0x7f) << 7;
      if ((tmp = readRawByte()) >= 0) {
        result |= tmp << 14;
      } else {
        result |= (tmp & 0x7f) << 14;
        if ((tmp = readRawByte()) >= 0) {
          result |= tmp << 21;
        } else {
          result |= (tmp & 0x7f) << 21;
          result |= (tmp = readRawByte()) << 28;
          if (tmp < 0) {
            // Discard upper 32 bits.
            for (int i = 0; i < 5; i++) {
              if (readRawByte() >= 0) {
                return result;
              }
            }
            throw InvalidProtocolBufferException.malformedVarint();
          }
        }
      }
    }
    return result;
  }

该方法的返回值为6，具体怎么计算我也是一头雾水，没弄明白。总之是找到了这个 6 的来源，我们只需要去处理 WriterVersion 枚举类中不存在的值所导致的异常就行了。

修改同 orc 的bug fix中提交的代码一样，

if (val == FUTURE.id) return FUTURE; =====》 if (val >= values.length) { return FUTURE; }

改动获取到的当前版本值只要大于等于 WriterVersion数组的长度则认定为是未知版本，而不是去判断当前版本号与WriterVersion枚举类的id号的大小，这样便可解决WriterVersion数组越界的异常

到此需要修改的代码已修改完毕。

接下来需要编译修改过的class文件并打包，此处受影响的有hive-exec 和 hive-orc ，需编译并打包这俩项目。

分别在各module下执行 mvn clean install -DskipTests 跳过编译测试文件，生成jar包替换集群hive/lib下的jar文件。

查询

问题解决

beetle_lzk

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
17
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录