问题描述:
Spark2.4写入的非分区表无法使用Hive2.1.1版本的引擎去读取,报错:`Failed with exception java.io.IOException:java.lang.ArrayIndexOutOfBoundsException: 6`
原因分析:
一、非分区表测试
--1. 新建测试表
create table tmp.orc(id int, name string) stored as orc;
--2.sparksql写入
insert into table tmp.orc values (1,'aaa');
insert into table tmp.orc values (2,'bbb');
--3.hive查询
hive> select * from tmp.orc;
OK
Failed with exception java.io.IOException:java.lang.ArrayIndexOutOfBoundsException: 6
--4.查看表文件
hive> dfs -ls -h /user/hive/warehouse/tmp.db/orc;
Found 3 items
-rw-r--r-- 3 zm_app_prd hive 0 2020-10-31 11:27 /user/hive/warehouse/tmp.db/orc/_SUCCESS
-rw-r--r-- 3 zm_app_prd hive 314 2020-10-31 11:27 /user/hive/warehouse/tmp.db/orc/part-00000-60a13e5b-0f78-411a-8ba4-16394775224d-c000.snappy.orc
-rw-r--r-- 3 zm_app_prd hive 314 2020-10-31 11:27 /user/hive/warehouse/tmp.db/orc/part-00000-a54e5a56-e858-4d5a-88e6-08a7b8401078-c000.snappy.orc
--5.删除_SUCCESS文件
问题依旧~
二、分区表测试
-- 2.1 新建表
create table tmp.orc_pt(id int, name string)
partitioned by (pt string)
stored as orc;
-- 2.2 sparksql写入
insert into table tmp.orc_pt partition (pt = '2020-10-31') select 1,'aaa';
insert into table tmp.orc_pt partition (pt = '2020-10-31') select 2,'bbb';
-- 2.3 hive查询
hive> select * from tmp.orc_pt;
OK
1 aaa 2020-10-31
2 bbb 2020-10-31
-- 2.4 查看表文件类型
hive> dfs -ls -h /user/hive/warehouse/tmp.db/orc_pt/pt=2020-10-31;
Found 2 items
-rwxrwxrwx+ 3 zm_app_prd hive 318 2020-10-31 11:33 /user/hive/warehouse/tmp.db/orc_pt/pt=2020-10-31/part-00000-8435f05f-f03c-4247-98a5-bd9a6b9bf089-c000
-rwxrwxrwx+ 3 zm_app_prd hive 318 2020-10-31 11:33 /user/hive/warehouse/tmp.db/orc_pt/pt=2020-10-31/part-00000-acc0ed78-1471-4a1a-9339-43ae176df0ed-c000
解决方案:
由上面可以看出,这是引擎在处理非分区表的时候发生了问题,这就要细细探究hive源码了,好在已经有前辈给出了经验,如下:
1、Spark非分区表的写入统一使用Text格式类型
2、修改Hive源码重新编译,具体参考:https://blog.csdn.net/lixiaoksi/article/details/106855509