HiveHBaseTableInputFormat代码分析

最新推荐文章于 2023-03-17 18:30:03 发布

vah101

最新推荐文章于 2023-03-17 18:30:03 发布

阅读量3.7k

点赞数

分类专栏： Linux开发

本文链接：https://blog.csdn.net/vah101/article/details/22854541

版权

Linux开发专栏收录该内容

129 篇文章 3 订阅

订阅专栏

在hive源代码的src/hbase-handler/src/java/org/apache/hadoop/hive/hbase路径下，包含了用于进行hive、hbase结合的代码。主要功能是由：HBaseStorageHandler.java、HiveHBaseTableInputFormat.java、HiveHBaseTableOutputFormat.java、LazyHBaseRow.java、LazyHBaseCellMap.java、HiveHFileOutputFormat.java实现。

其中，HBaseStorageHandler.java主要用来跟完成hive调用hbase接口功能，主要实现的调用hbase接口操作hbase；HBaseSerDe.java用来完成序列化、反序列化，完成hbase数据形式与hive的数据的映射；HiveHBaseTableInputFormat.java完成hive对hbase数据的读取操作；HiveHBaseTableOutputFormat.java完成hive对hbase数据的写入操作；HiveHFileOutputFormat.java则可以将hive的结果输出为hfile文件。

重点介绍HiveHBaseTableInputFormat类

public class HiveHBaseTableInputFormat extends TableInputFormatBase

该类的父类TableInputFormatBase，是由hbase代码中提供的org.apache.hadoop.hbase.mapreduce.TableInputFormatBase，主要是用于map过程中，scan获得数据。

在这个类中，主要的几个函数为：

l getRecordReader(InputSplit, JobConf, Reporter)

l convertFilter(JobConf, Scan, TableSplit, int, boolean)

l getConstantVal(Object, PrimitiveObjectInspector, boolean)

l getNextBA(byte[])

l newIndexPredicateAnalyzer(String, String, boolean)

l getSplits(JobConf, int)

l getStorageFormatOfKey(String, String)

public RecordReader<ImmutableBytesWritable, Result> getRecordReader()函数中构造了一个RecordReader内部类，该函数返回RecordReader类的成员给上层的调用函数，在map过程中，使用该成员来完成对hbase的scan操作。

private TableSplit convertFilter(

JobConf jobConf,

Scan scan,

TableSplit tableSplit,

int iKey, boolean isKeyBinary)

从hql命令中where后的条件语句中，提取出rowkey所在字段的约束语句，将其转换成filter，并转换成hbase的startkey、endkey，加速hbase的scan操作，用术语就是谓词下推，但是现阶段（0.12）只能支持>、<、=、!=操作。

static IndexPredicateAnalyzer newIndexPredicateAnalyzer(

String keyColumnName, String keyColType, boolean isKeyBinary)

用来生成IndexPredicateAnalyzer。IndexPredicateAnalyzer 是用来处理hive的谓词下推，将hive的filter转换为hbase能够识别的startRow、endRow方式。

public InputSplit[] getSplits(JobConf jobConf, int numSplits) throws IOException

getSplits是一个关键函数，在map之前，用来对hbase的scan操作进行划分。主要是根据查询的rowkey字段进行startRow、endRow的分配，如果是划分了多个region，会将startRow、endRow分配到不同的region中。

private boolean getStorageFormatOfKey(String spec, String defaultFormat) throws IOException

判断列的类型，是string类型还是binary类型

vah101

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HiveHBaseTableInputFormat代码分析

在hive源代码的src/hbase-handler/src/java/org/apache/hadoop/hive/hbase路径下，包含了用于进行hive、hbase结合的代码。主要功能是由：HBaseStorageHandler.java、HiveHBaseTableInputFormat.java、HiveHBaseTableOutputFormat.java、LazyHBaseRow.j
复制链接

扫一扫