Hive索引

Hive是支持索引的,但基本没用过,只做了下试验。
为什么大家都不用,肯定有它的弊端。

Hive索引机制:

在指定列上建立索引,会产生一张索引表(Hive的一张物理表),里面的字段包括,索引列的值、该值对应的HDFS文件路径、该值在文件中的偏移量;

在执行索引字段查询时候,首先额外生成一个MR job,根据对索引列的过滤条件,从索引表中过滤出索引列的值对应的hdfs文件路径及偏移量,输出到hdfs上的一个文件中,然后根据这些文件中的hdfs路径和偏移量,筛选原始input文件,生成新的split,作为整个job的split,这样就达到不用全表扫描的目的。

 

Hive索引建立过程:

创建索引:

 

 
  1. create index lxw1234_index on table lxw1234(key)
  2. as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
  3. with deferred rebuild;

之后在Hive中会创建一张索引表,也是物理表:

Hive index Hive索引

Hive index Hive索引

 

其中,索引表中key字段,就是原表中key字段的值,_bucketname 字段,代表数据文件对应的HDFS文件路径,_offsets 代表该key值在文件中的偏移量,有可能有多个偏移量,因此,该字段类型为数组。

其实,索引表就相当于一个在原表索引列上的一个汇总表。

生成索引数据

 
  1. alter index lxw1234_index on lxw1234 rebuild;

用一个MR任务,以table lxw1234的数据作为input,将索引字段key中的每一个值及其对应的HDFS文件和偏移量输出到索引表中。

自动使用索引

 
  1. SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
  2. SET hive.optimize.index.filter=true;
  3. SET hive.optimize.index.filter.compact.minsize=0;

查询时候索引如何起效:

 
  1. select * from lxw1234 where key = '13400000144_1387531071_460606566970889';
  • 首先用一个job,从索引表中过滤出key = ‘13400000144_1387531071_460606566970889’的记录,将其对应的HDFS文件路径及偏移量输出到HDFS临时文件中
  • 接下来的job中以临时文件为input,根据里面的HDFS文件路径及偏移量,生成新的split,作为查询job的map任务input
  • 不使用索引时候,如下图所示:
    Hive index Hive索引

    Hive index Hive索引

    • table lxw1234的每一个split都会用一个map task去扫描,但其实只有split2中有我们想要的结果数据,map task1和map task3造成了资源浪费。
  • 使用索引后,如下图所示:
    Hive index Hive索引

    Hive index Hive索引

    • 查询提交后,先用一个MR,扫描索引表,从索引表中找出key=’xx’的记录,获取到HDFS文件名和偏移量;
    • 接下来,直接定位到该文件中的偏移量,用一个map task即可完成查询,其最终目的就是为了减少查询时候的input size

手动使用索引

    • 其实就是手动完成从索引表中过滤数据的部分,将过滤出来的数据load    到HDFS临时文件,供查询任务使用
 
  1.  
  2. SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
  3.  
  4. Insert overwrite directory "/tmp/lxw1234_index_data"
  5. select `_bucketname`, `_offsets`
  6. from default__lxw1234_lxw1234_index__
  7. where key = '13400000144_1387531071_460606566970889';
  8. ##指定索引数据文件
  9. SET hive.index.compact.file=/tmp/ll1_index_data;
  10. SET hive.optimize.index.filter=false;
  11. SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
  12.  
  13. select * from lxw1234
  14. where key = '13400000144_1387531071_460606566970889';
  15.  

从以上过程可以看出,Hive索引的使用过程比较繁琐:

  • 每次查询时候都要先用一个job扫描索引表,如果索引列的值非常稀疏,那么索引表本身也会非常大;

  • 索引表不会自动rebuild,如果表有数据新增或删除,那么必须手动rebuild索引表数据;

Skip to end of metadata
Go to start of metadata

= Bitmap Indexing =

Introduction

This document explains the proposed design for adding a bitmap index handler (https://issues.apache.org/jira/browse/HIVE-1803).
Bitmap indexing (http://en.wikipedia.org/wiki/Bitmap_index) is a standard technique for indexing columns with few distinct 
values, such as gender.

Approach

We want to develop a bitmap index that can reuse as much of the existing Compact Index code as possible.

Proposal

First implementation

This implementation confers some of the benefits of bitmap indexing and should be easy to implement given the already existing compact index, but it does few of the optimizations such as compression that a really good bitmap index should do.

Like the complex index, this implementation uses an index table. The index table on a column "key" has four or more columns: first, the columns that are being indexed, then _bucketname, _offset, and _bitmaps. _bucketname is a string pointing to the hadoop file that is storing this block in the table, _offset is the block offset of a block, and _bitmaps is an uncompressed bitmap encoding (an Array of bytes) of the bitmap for this column value, bucketname, and row offset. Each bit in the bitmap corresponds to one row in the block. The bit is 1 if that row has the value of the values in the columns being indexed, and a 0 if not. If a key value does not appear in a block at all, the value is not stored in the map.

When querying this index, if there are boolean AND or OR operations done on the predicates with bitmap indexes, we can use bitwise operations to try to eliminate blocks as well. We can then eliminate blocks that do not contain the value combinations we are interested in. We can use this data to generate the filename, array of block offsets format that the compact index handler uses and reuse that in the bitmap index query.

Second iteration

The basic implementation's only compression is eliminating blocks where all rows are 0s. This is unlikely to happen for larger blocks, so we need a better compression format. What we can do is do byte-aligned bitmap compression, where the bitmap is an array of bytes, and a byte of all 1s or all 0s implies one or more bytes where every value is 0 or 1. Then, we would just need to add another column in the bitmap index table that is an array of Ints that describe how long the gaps are and logic to expand the compression.

Example

Suppose we have a bitmap index on a key where, on the first block, value "a" appears in rows 5, 12, and 64, and value "b" appears in rows 7, 8, and 9. Then, for the preliminary implementation, the first entry in the index table will be:

https://issues.apache.org/jira/secure/attachment/12460083/bitmap_index_1.png

The values in the array represent the bitmap for each block, where each 32-bit BigInt value stores 32 rows.

For the second iteration, the first entry will be:

https://issues.apache.org/jira/secure/attachment/12460124/bitmap_index_2.png

This one uses 1-byte array entries, so each value in the array stores 8 rows. If an entry is 0x00 or 0xFF, it represents 1 or more consecutive bytes of zeros, (in this case 5 and 4, respectively)


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值