当逻辑分区实际上太多太细而几乎无法使用时,建立索引也就成为分区的另一个选择。建立索引可以帮助裁剪掉一张表的一些数据块,这样能够减少MapReduce的输入数据量。
创建索引
先创建一个employees表:
hive> create table employees(
name string,
salary float,
subordinates array<string>,
address struct<street:string,city:string,state:string,zip:int>
)
partitioned by (country string,state string);
下面我们仅对分区字段country建立索引:
hive> create index employees_index
on table employees(country)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild
idxproperties ('creator'='me','created_at'='2017-2-13')
in table employees_index_table
partition by (country,name)
comment 'Employees indexed by country and name.';
as…语句指定了索引处理器,也就是一个实现了索引接口的Java类。
Bitmap索引
bitmap索引普遍应用于排重后值较少的列:
hive> create index employees_index
on table employees (country)
as 'Bitmap'
idxproperties ('creator'='me','created_at'='2017-02-13')
in table employees_index_table
partitioned by (country,name)
comment 'Employees indexed by country and name.';
重建索引
使用alter index可以对索引进行重建(如果重建索引失败,在重建开始之前,索引将提留在之前的版本状态):
hive> alter index employees_index
on table employees
partition (country='US')
rebuild;
显示索引
hive> show formatted index on employees;
删除索引
hive> drop index if exists employees_index on table employees;
附我在开源中国的原文:
https://my.oschina.net/lonelycode/blog/837420