表格式改为orc之后,对于array类型的数据查询结果不正确 问题排查和解决方案

Description:

We create a partitioned text format table with one partition, after we change the format of table to orc, then the array type field may output error.

The step to reproduce the result.

  1. First crate a text format table with array type field in hive.
 create table test_text_orc (
  col_int bigint,
  col_text string, 
  col_array array<string>, 
  col_map map<string, string>
  ) 
  PARTITIONED BY (
   day string
   )
   ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ',' 
 collection items TERMINATED  BY ']'
 map keys TERMINATED BY ':'
  ;
  1. Create new text file hive-orc-text-file-array-error-test.txt.
1,text_value1,array_value1]array_value2]array_value3, map_key1:map_value1,map_key2:map_value2
2,text_value2,array_value4, map_key1:map_value3
,text_value3,, map_key1:]map_key3:map_value3
  1. Load the data into one partition.
 LOAD DATA local INPATH '.hive-orc-text-file-array-error-test.txt' overwrite into table test_text_orc partition(day=20170329)
  1. select the data to verify the result.
hive> select * from test.test_text_orc;
OK
1   text_value1 ["array_value1","array_value2","array_value3"]  {" map_key1":"map_value1","map_key2":"map_value2"}  20170329
2   text_value2 ["array_value4"]    {"map_key1":"map_value3"}   20170329
NULL    text_value3 []  {" map_key1":"","map_key3":"map_value3"}    20170329
  1. Alter table format of table to orc;
 alter table test_text_orc set fileformat orc;
  1. Check the result again, and you can see the error result.
hive> select * from test.test_text_orc;
OK
1   text_value1 ["array_value1","array_value2","array_value3"]  {" map_key1":"map_value1","map_key2":"map_value2"}  20170329
2   text_value2 ["array_value4","array_value2","array_value3"]  {"map_key1":"map_value3"}   20170329
NULL    text_value3 ["array_value4","array_value2","array_value3"]  {"map_key3":"map_value3"," map_key1":""}    20170329

Reason Analysis

ObjectInspectorConverters$ListConverter instance does not clean the data of previous record,
When the size of array of current row is less than that of previous row, it data of list will not be fully overwrite
and the not overwrited data will be output.

Code Analysis

In FetchOperator.nextRow. At first, it deserializes the value using the currSerDe, currSerDe is the SerDe of partition.
Second, ObjectConverter is an instance of ObjectInspectorConverters$StructConverter

         Object deserialized = currSerDe.deserialize(value);
          if (ObjectConverter != null) {
            deserialized = ObjectConverter.convert(deserialized);
          }

In method convert, it read out every field value in turn, and it uses the consponding converter to convert the field value.
After change the format of table to orc, with the type field array, the consponding convert is ObjectInspectorConverters$ListConverter.

@Override
    public Object convert(Object input) {
      if (input == null) {
        return null;
      }
      int minFields = Math.min(inputFields.size(), outputFields.size());
      // Convert the fields
      for (int f = 0; f < minFields; f++) {
        Object inputFieldValue = inputOI.getStructFieldData(input, inputFields.get(f));
        Object outputFieldValue = fieldConverters.get(f).convert(inputFieldValue);
        outputOI.setStructFieldData(output, outputFields.get(f), outputFieldValue);
      }
      // set the extra fields to null
      for (int f = minFields; f < outputFields.size(); f++) {
        outputOI.setStructFieldData(output, outputFields.get(f), null);
      }
      return output;
    }
  }

In Method ObjectInspectorConverters$ListConverter.convert, it first creates separate element converter for each element.
Then, it call outputIO.resize(output,size).
Finally, it set every converted element to outputOI.

@Override
public Object convert(Object input) {
if (input == null) {
return null;
}
// Create enough elementConverters
// NOTE: we have to have a separate elementConverter for each element,
// because the elementConverters can reuse the internal object.
// So it's not safe to use the same elementConverter to convert multiple
// elements.
int size = inputOI.getListLength(input);
while (elementConverters.size() < size) {
elementConverters.add(getConverter(inputElementOI, outputElementOI));
}
// Convert the elements
outputOI.resize(output, size);
for (int index = 0; index < size; index++) {
Object inputElement = inputOI.getListElement(input, index);
Object outputElement = elementConverters.get(index).convert(
inputElement);
outputOI.set(output, index, outputElement);
}
return output;
}
}

The problem is in method resize, it does not clear all the data of previous, simply calls ensureCapacity.
When the size of array of current row is less than that of previous row, it data of list will not be fully overwrite and the not overwrited data will be output.


public Object resize(Object list, int newSize) {
((ArrayList) list).ensureCapacity(newSize);
return list;
}

## The method of amending.
Replace the previous method with the following code.

  public Object resize(Object list, int newSize) {
  ((ArrayList) list).clear();
  return list;
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值