Hive只有有限的索引功能,没有普通关系型数据库中键的概念,但是可以对一些字段来建立索引,加速某些操作,一张表的索引数据存储在另外一张表中。维护索引需要额外的存储空间,同时创建索引也需要消耗计算量。
1、创建索引:
CREATE TABLE employees(
name string,
salary float,
subordinates ARRAY<string>,
deductions MAP<string,float>,
address struct<street:STRING, city:STRING, state:STRING, zip:INT>
)
partition by(country string,state string);
#对上表的分区字段country建立索引
CREATE TABLE employees_index
on table employees(country)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH deferred rebuild
idxproperties('create ='me','create_at'='some_time')
in table employees_index_table
partition by(country,name)
comment 'employees indexed by country and name.';
上面这种情况,我们所创建的索引表没有使用同原表一样的分区划分。如果省略partition by语句的话,那么索引将会包含原始表所有分区。
AS…语句指定了索引处理器,这里的CompactIndexHandler是其中的一个实现。
WITH deferred rebuild
2、Bitmap索引
bitmap索引普遍用于重排后值较小的列,下面是对前面例子使用bitmap索引处理器重写后的语句:
Create index employees_index
on table employees(country)
AS 'BITMAP'
with deferred rebuild
idxproperties('create ='me','create_at'='some_time')
in table employees_index_table
partition by(country,name)
comment 'employees indexed by country and name.';
3、重建索引
ALTER INDEX employees_index
on TABLE employees
Partition(country='US')
rebuild;
4、显示索引
show formatted index on employees;
5、删除索引
DROP INDEX IF EXISTS employees_index on TABLE employees;