hive的分桶表

最新推荐文章于 2023-01-17 00:41:17 发布

北京小峻

最新推荐文章于 2023-01-17 00:41:17 发布

阅读量229

点赞数 1

分类专栏：大数据文章标签： hive

本文链接：https://blog.csdn.net/weixin_45896475/article/details/105837774

版权

大数据专栏收录该内容

118 篇文章 5 订阅

订阅专栏

hive的分桶表

分区表是针对数据的储存路径
分通表是针对数据文件

步骤
创建一个普通表;
开启分桶设置;
创建一个分通表;
目的
提高索引效率,节省底层资源
实例
创建一个普通表并传入数据

create table stu(
name  string,
course  string,
grade  int
)
row format delimited fields terminated by " ";
load data local inpath "/root/student.txt" into table stu;

开启分桶设置

开启分桶设置
set hive.enforce.bucketing=ture;
设置默认的reduce个数
set mapreduce.job.reduces=-1;

创建一个分桶表

create table bu(
name  string,
course  string,
grade  int
)
clustered by (name)
into 4 buckets
row format delimited fields terminated by " ";

在客户端查看分区数量命令

desc formatted bu;

导入数据到分桶表

insert into table bu
select * from stu;

在客户端查看数据是没有任何变化的

0: jdbc:hive2://doit01:10000> select * from bu;
+----------+------------+-----------+
| bu.name  | bu.course  | bu.grade  |
+----------+------------+-----------+
| 张三       | 数学         | 75        |
| 张三       | 语文         | 81        |
| 王五       | 英语         | 90        |
| 王五       | 数学         | 100       |
| 王五       | 语文         | 81        |
| 李四       | 数学         | 90        |
| 李四       | 语文         | 76        |
+----------+------------+-----------+

主要变化就是在50070端口的变化
在这里插入图片描述
数据就被分了4个桶,这样的话,我们再根据这个name的字段进行join的时候就会,节省资源了

抽样调查(不只是针对分桶表呦,普通表也是可以的)

--取其中的8分之3来抽样检查
select * from bu tablesample(bucket 3 out of 8 on name);
0: jdbc:hive2://doit01:10000> select * from bu tablesample(bucket 1 out of 4 on name);
+----------+------------+-----------+
| bu.name  | bu.course  | bu.grade  |
+----------+------------+-----------+
| 张三       | 数学         | 75        |
| 张三       | 语文         | 81        |
+----------+------------+-----------+