hive中分区表，桶的使用

最新推荐文章于 2024-08-15 17:54:06 发布

zhangbaoming815

最新推荐文章于 2024-08-15 17:54:06 发布

阅读量101

点赞数

分类专栏： hadoop 文章标签： hive分区表 hive桶 hive的使用 hive

本文链接：https://blog.csdn.net/zhangbaoming815/article/details/84248807

版权

hadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

hive中分区表的使用：

1. 创建一个分区表，以 ds 为分区列：

create table invites (id int, name string) partitioned by (ds string) row format delimited fields terminated by '\t' stored as textfile;

2. 将数据添加到时间为 2012-10-12 这个分区中：

load data local inpath '/home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2012-10-12');

3. 将数据添加到时间为 2012-10-20 这个分区中：

load data local inpath '/home/hadoop/Desktop/data.txt' overwrite into table invites partition (ds='2012-10-20');

4. 从一个分区中查询数据：

select * from invites where ds ='2012-10-12';

5. 往一个分区表的某一个分区中添加数据：

insert overwrite table invites partition (ds='2012-10-12') select id,max(name) from test group by id;

可以查看分区的具体情况，使用命令：

hadoop fs -ls /home/hadoop.hive/warehouse/invites

如果想在 eslipse 下面看效果，也是需要开启 hadoop 的， start-all.sh 。

hive 中桶的使用：

1. 创建带桶的 table ：

create table bucketed_user(id int,name string) clustered by (id) sorted by(name) into 4 buckets row format delimited fields terminated by '\t' stored as textfile;

2. 强制多个 reduce 进行输出：

set hive.enforce.bucketing=true;

3. 往表中插入数据：

insert overwrite table bucketed_user select * from test;

4. 查看表的结构，会发现当前表下有四个文件：

dfs -ls /home/hadoop/hive/warehouse/bucketed_user;

5. 读取数据，看没一个文件的数据：

dfs -cat /home/hadoop/hive/warehouse/bucketed_user/000000_0;

桶使用 hash 来实现，所以每个文件拥有的数据的个数都有可能不相等。

6. 对桶中的数据进行采样：

select * from bucketed_user tablesample(bucket 1 out of 4 on name);

桶的个数从 1 开始计数，前面的查询从 4 个桶中的第一个桶获取数据。其实就是四分之一。

7. 查询一半返回的桶数：

select * from bucketed_user tablesample(bucket 1 out of 2 on name);

zhangbaoming815

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive中分区表，桶的使用

hive中分区表的使用： 1.创建一个分区表，以ds为分区列：create table invites (id int, namestring) partitioned by (ds string) row format delimited fieldsterminated by '\t' stored as textfile;2...
复制链接

扫一扫