CREATE [EXTERNAL] TABLE <table_name>
(<col_name> <data_type> [, <col_name> <data_type> ...])]
[PARTITIONED BY ...]
CLUSTERED BY (<col_name>)
[SORTED BY (<col_name> [ASC|DESC] [, <col_name> [ASC|DESC]...])]
INTO <num_buckets> BUCKETS
[ROW FORMAT <row_format>]
[STORED AS TEXTFILE|ORC|CSVFILE]
[LOCATION '<file_path>']
[TBLPROPERTIES ('<property_name>'='<property_value>', ...)];
具体示例
分桶关键字CLUSTERED BY (<col_name>)
只能指定一个列名,会根据指定的列名的hash进行分桶
create table users(id int,name string) clustered by (<id>) into 3 buckets
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;

Hive中的分桶表通过CLUSTERED BY关键字基于特定列的哈希值进行分桶。数据写入时,直接使用LOAD DATA LOCAL不支持自动分区。要保持数据与分桶表定义一致,可以开启enforce bucketing或设置reducer数量等于桶数。如果分桶表有排序键,数据需同时分桶和排序。在Inceptor的ORC事务表中,强制要求进行分桶,每个桶的文件大小建议在100~200MB之间。
最低0.47元/天 解锁文章
652

被折叠的 条评论
为什么被折叠?



