hive动态分区

最新推荐文章于 2024-03-07 15:04:54 发布

浮生若梦1379

最新推荐文章于 2024-03-07 15:04:54 发布

阅读量281

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/weixin_37796929/article/details/90236745

版权

hive 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

动态分区：分区的值是非确定的，由输入数据来确定

如果用上述的静态分区，插入的时候必须首先要知道有什么分区类型，而且每个分区写一个load data，太烦人。使用动态分区可解决以上问题，其可以根据查询得到的数据动态分配到分区里。其实动态分区就是不指定分区目录，由系统自己选择。

首先Hive 有一张表 person_par，如下：

hive> select * from person_par;
OK
lily china man 2013-03-28
nancy china woman 2013-03-28
hanmei america man 2013-03-28
jan china woman 2013-03-29
mary america man 2013-03-29
lilei china man 2013-03-29
动态分区的字段，需要写在select语句中所有字段的最后
hive需要设置set hive.exec.dynamic.partition=true;(默认值是false，表示是否开启动态分区)
[可选]hive需要设置set hive.exec.dynamic.partition.mode=nonstrict;(默认是strict模式，表示至少需要指定一个静态分区；nonstrict模式表示不需要指定静态分区)
设置动态分区
hive> set hive.exec.dynamic.partition=true;

创建新表person_par_dnm
hive> create table person_par_dnm ( name string, nation string) partitioned by (sex string, dt string)
> row format delimited fields terminated by ‘,’;
OK
Time taken: 0.334 seconds

现在查询分区，并没有
hive> show partitions person_par_dnm;
OK
Time taken: 0.073 seconds

从旧表person_par导入数据到新表中person_par_dnm，自动实现分区sex=“man”,dt，按照最后的dt分区
hive> insert overwrite table person_par_dnm partition(sex=“man”,dt) select name, nation, dt from person_par;

现在查询分区，有以下分区sex=man/dt=2013-03-28和sex=man/dt=2013-03-29
hive> show partitions person_par_dnm;
OK
sex=man/dt=2013-03-28
sex=man/dt=2013-03-29
查看HDFS上面的目录，有 /user/hive/warehouse/person_par_dnm/sex=man，说明系统按照时间自动分区了

分桶表
Hive采用对列值哈希来组织数据的方式, 称之为分桶, 适合采样和map-join. 使用用户ID来确定如何划分桶(Hive使用对值进行哈希并将结果除以桶的个数取余数。这样，任何一桶里都会有一个随机的用户集合（PS：其实也能说是随机）

桶则是按照数据内容的某个值进行分桶，把一个大文件散列称为一个个小文件

建立原表person_srt
hive> create table person_srt (srtid int, name string, nation string, sex string, dt string)
> row format delimited fields terminated by ‘,’;

装入数据
hive>load data local inpath ‘/home/hadoop/Data/person_srt.txt’ overwrite into table person_srt;

查看数据
hive> select * from person_srt;
OK
1 lily china man 2013-03-28
2 nancy china woman 2013-03-28
3 hanmei america man 2013-03-28
4 jan china woman 2013-03-29
5 mary america man 2013-03-29
6 lilei china man 2013-03-29

建立新的分桶表person_srt2，要求：
1.指定根据哪一列来划分桶： clustered by (srtid)
2. 以srtid降序排列：sorted by(srtid desc)
3. 指定划分几个桶： into 2 buckets
distribute by 类似于mapreduce中分区partition，对数据进行分区，结合sort by进行使用
cluster by 当distribute by和sort by字段相同时可以用cluster by代替

hive> create table person_srt2( srtid int, name string, nation string, sex string, dt string)
> clustered by (srtid) sorted by(srtid desc) into 2 buckets
> row format delimited fields terminated by ‘,’;

设置相关参数
hive> set hive.enforce.bucketing=true;
hive> set mapreduce.job.reduces=2;

把旧表person_srt的数据装入分桶表person_srt2中
hive> insert into table person_srt2 select srtid,name,nation,sex,dt from person_srt
distribute by(srtid) sort by(srtid asc);

查询分桶表，以降序排列
hive> select * from person_srt2;
OK
6 lilei china man 2013-03-29
4 jan china woman 2013-03-29
2 nancy china woman 2013-03-28
5 mary america man 2013-03-29
3 hanmei america man 2013-03-28
1 lily china man 2013-03-28

对桶中的数据进行采样
从2个桶的第1个中获取所有的用户
hive> select * from person_srt2 tablesample(bucket 1 out of 2);
OK
6 lilei china man 2013-03-29
4 jan china woman 2013-03-29
2 nancy china woman 2013-03-28
从2个桶的第2个中获取所有的用户
hive> select * from person_srt2 tablesample(bucket 2 out of 2);
OK
5 mary america man 2013-03-29
3 hanmei america man 2013-03-28
1 lily china man 2013-03-28
创建分桶表成功，HDFS有如下目录： /user/hive/warehouse/srt.db/person_srt2

查看HDFS上面的两个文件的数据：

hadoop@Master:~/Data$ hadoop fs -cat /user/hive/warehouse/srt.db/person_srt2/000000_0
6,lilei,china,man,2013-03-29
4,jan,china,woman,2013-03-29
2,nancy,china,woman,2013-03-28
hadoop@Master:~/Data$ hadoop fs -cat /user/hive/warehouse/srt.db/person_srt2/000001_0
5,mary,america,man,2013-03-29
3,hanmei,america,man,2013-03-28
1,lily,china,man,2013-03-28

浮生若梦1379

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
hive动态分区

动态分区：分区的值是非确定的，由输入数据来确定如果用上述的静态分区，插入的时候必须首先要知道有什么分区类型，而且每个分区写一个load data，太烦人。使用动态分区可解决以上问题，其可以根据查询得到的数据动态分配到分区里。其实动态分区就是不指定分区目录，由系统自己选择。首先Hive 有一张表 person_par，如下：hive> select * from person_par;...
复制链接

扫一扫

专栏目录