Big Data learning - HIVE - bucket, partition, index

最新推荐文章于 2021-06-10 10:04:08 发布

黄驼君

最新推荐文章于 2021-06-10 10:04:08 发布

阅读量190

点赞数

本文链接：https://blog.csdn.net/vivasugar/article/details/96130488

版权

Partition

Partition stored in table layout as char. It can be queried with command describe table, It only store the partition information instead of any pysical data.

1. static partition table

create table if not exists hive1.test1
(id int,name string,tel string) 
partitioned by(age int) 
row format delimited fields 
terminated by ',' 
stored as textfile;

insert (overwrite) into table hive1.test1 partition(age='25');

select id,name,tel from hive1.test1;

2. dynamic partition table

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table hive1.test1 partition(age);
select id,name,tel,age from hive1.test1;

Bucket

CREATE TABLE hive1.test2 (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;

For each table or partition table, hive can orgnize it into bucket further. hash key dividing bucket amounts decide which bucket the data stored in.

Reason for orgnizing table into bucket

It can provide more efficient query. Hive can handle the query with the specially structure provided by bucket. Hive can use Map-side-join to achieve high effciency, for example, it can be joined with same hash key between two tables while we have buckets on them.
It brings better performance while we fetch a sample data

Index
index can bring a faster query speed while using group by

create index employees_index on table employees(country)
as  'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild
in table employees_index_table;

Different between bucket, partition and index

1. Index and partition

Index won’t split files, partition will split files.
Index is one thing that using extra disk space exchanged query cost.
Partition will split one big table into different files.

2. Partition and bucket
Bucket will split table randomly but can spread the data on average.
Partition will split table with specific keys so it could cause the data skew。

Bucket will split table into different files but in only one folder. Partition will split table into different folders.