Partition
Partition stored in table layout as char. It can be queried with command describe table, It only store the partition information instead of any pysical data.
1. static partition table
create table if not exists hive1.test1
(id int,name string,tel string)
partitioned by(age int)
row format delimited fields
terminated by ','
stored as textfile;
insert (overwrite) into table hive1.test1 partition(age='25');
select id,name,tel from hive1.test1;
2. dynamic partition table
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table hive1.test1 partition(age);
select id,name,tel,age from hive1.test1;
Bucket
CREATE TABLE hive1.test2 (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
For each table or partition table, hive can orgnize it into bucket further. hash key dividing bucket amounts decide which bucket the data stored in.
Reason for orgnizing table into bucket
- It can provide more efficient query. Hive can handle the query with the specially structure provided by bucket. Hive can use Map-side-join to achieve high effciency, for example, it can be joined with same hash key between two tables while we have buckets on them.
- It brings better performance while we fetch a sample data
Index
index can bring a faster query speed while using group by
create index employees_index on table employees(country)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild
in table employees_index_table;
Different between bucket, partition and index
1. Index and partition
Index won’t split files, partition will split files.
Index is one thing that using extra disk space exchanged query cost.
Partition will split one big table into different files.
2. Partition and bucket
Bucket will split table randomly but can spread the data on average.
Partition will split table with specific keys so it could cause the data skew。
Bucket will split table into different files but in only one folder. Partition will split table into different folders.