Hive分桶表入门

最新推荐文章于 2024-09-08 18:21:03 发布

鸭梨山大哎

最新推荐文章于 2024-09-08 18:21:03 发布

阅读量266

点赞数

分类专栏： hive 文章标签： hive 分桶表 ntile

本文链接：https://blog.csdn.net/u010711495/article/details/111869366

版权

hive 专栏收录该内容

114 篇文章 14 订阅

订阅专栏

分桶的概述

在这里插入图片描述

为什么要分桶

数据分区可能导致有些分区数据过多，有些分区数据极少。分桶是将数据集分解为若干部分(数据文件)的另一种技术。
分区和分桶其实都是对数据更细粒度的管理。当单个分区或者表中的数据越来越大，分区不能细粒度的划分数据时，我们就采用分桶技术将数据更细粒度的划分和管理
[CLUSTERED BY (col_name, col_name, ...)

分桶的原理

与MapReduce中的HashPartitioner的原理一模一样

MapReduce：使用key的hash值对reduce的数量进行取模(取余)
hive：使用分桶字段的hash值对分桶的数量进行取模(取余)。针对某一列进行分桶存储。每一条记录都是通过分桶字段的值的hash对分桶个数取余，然后确定放入哪个桶。

分桶的意义

为了保存分桶查询的分桶结构（数据已经按照分桶字段进行了hash散列）
分桶表合进行数据抽样
抽样更高效。处理大数据时，如果能在数据集的一部分上运行查询进行测试会带来很多方便
join操作时可以提高MR的查询效率
连接查询两个在相同列上划分了桶的表，可以在map端进行高效的连接操作。比如join操作。对于两个表都有一个相同的列，如果对两个表都进行桶操作，那么hive底层会对相同列值的桶进行join操作。效率很高

分桶表创建

测试数据

0001	liming	male	23	history
0002	wangwu	female	21	chinese
0003	dufu	male	20	art
0004	liudehua	male	23	math
0005	dongmei	female	20	art
0006	ziri	male	21	history
0007	tianming	female	23	art

案例

第一步：建表

drop table student;
create table if not exists student
(
    sno     int,
    name    string,
    sex     string,
    age     int,
    academy string
)
    clustered by (sno) sorted by (age desc) into 4 buckets
    row format delimited
        fields terminated by ','
;

--分桶字段和排序字段可以不一致

第二步：准备数据(创建临时表)

drop table temp_student;
create table temp_student
(
    sno     int,
    name    string,
    sex     string,
    age     int,
    academy string
)
    clustered by (sno) sorted by (age desc) into 4 buckets
    row format delimited
        fields terminated by '\t'
;

load data local inpath '/data/ss.txt' into table temp_student;
select * from temp_student;

第三步：从临时表中查询并导入数据

insert into table student
select *
from temp_student
    distribute by sno sort by age desc;

或者
insert overwrite table student
select * from temp_student
distribute by(sno) 
sort by (age desc)
;

注意加载数据时，绝对不能使用load或者是上传方式，没有分桶效果。

注意事项

2.1.1版本设置了强制分桶操作，因此人为的修改reduce的个数不会影响最终文件的个数(文件个数由桶数决定)

如果是低版本，比如1.2.1版本可以修改下面的属性

1. 需要设置reduce数量和分桶数量相同：
set mapreduce.job.reduces=4;
2.如果数据量比较大，我们可以使用MR的本地模式：
set hive.exec.mode.local.auto=true;  
3.强行分桶设置：set hive.enforce.bucketing=true; 默认是false
4.强行排序设置：set hive.enforce.sorting=true;

分桶表查询

语法：

语法:tablesample(bucket x out of y on sno)
x:代表从第几桶开始查询，x不能大于y

2.1.1版本的y:代表查询的总的桶数 y值可以自定义。
低版本，比如1.2.1的y必须是表的桶数的因子或者是倍数。

查询全部

select * from student;
select * from student tablesample ( bucket  1 out of 1);
+---+--------+------+---+-------+
|sno|name    |sex   |age|academy|
+---+--------+------+---+-------+
|6  |ziri    |male  |21 |history|
|3  |dufu    |male  |20 |art    |
|7  |tianming|female|23 |art    |
|4  |liudehua|male  |23 |math   |
|1  |liming  |male  |23 |history|
|2  |wangwu  |female|21 |chinese|
|5  |dongmei |female|20 |art    |
+---+--------+------+---+-------+

指定桶查询

--查询第一桶 
select * from student tablesample ( bucket  1 out of 3);
+---+----+----+---+-------+
|sno|name|sex |age|academy|
+---+----+----+---+-------+
|6  |ziri|male|21 |history|
|3  |dufu|male|20 |art    |
+---+----+----+---+-------+

--查询第2桶
select * from student tablesample ( bucket  2 out of 3 );
+---+--------+------+---+-------+
|sno|name    |sex   |age|academy|
+---+--------+------+---+-------+
|7  |tianming|female|23 |art    |
|4  |liudehua|male  |23 |math   |
|1  |liming  |male  |23 |history|
+---+--------+------+---+-------+

--查询第3桶
select * from student tablesample ( bucket  3 out of 3 );
 +---+-------+------+---+-------+
|sno|name   |sex   |age|academy|
+---+-------+------+---+-------+
|2  |wangwu |female|21 |chinese|
|5  |dongmei|female|20 |art    |
+---+-------+------+---+-------+

其他查询

---查询三行数据 ,默认返回前3行
    select * from student limit 3;
    select * from student tablesample(3 rows);
    
+---+--------+------+---+-------+
|sno|name    |sex   |age|academy|
+---+--------+------+---+-------+
|6  |ziri    |male  |21 |history|
|3  |dufu    |male  |20 |art    |
|7  |tianming|female|23 |art    |
+---+--------+------+---+-------+

---查询百分比的数据 大小的百分比所占的那一行。 按比例取这么多数据
select * from student tablesample ( 20 percent );

+---+----+----+---+-------+
|sno|name|sex |age|academy|
+---+----+----+---+-------+
|6  |ziri|male|21 |history|
|3  |dufu|male|20 |art    |
+---+----+----+---+-------+


    
--查询固定大小的数据
--单位（K,KB,MB,GB...）
select * from student tablesample ( 68B );
+---+--------+------+---+-------+
|sno|name    |sex   |age|academy|
+---+--------+------+---+-------+
|6  |ziri    |male  |21 |history|
|3  |dufu    |male  |20 |art    |
|7  |tianming|female|23 |art    |
|4  |liudehua|male  |23 |math   |
+---+--------+------+---+-------+

    
--随机抽三行数据
select * from student order by rand() limit 3;

与ntile的区别

如图,不用建立分桶表,但是可以达到和分桶类似的效果,比如,取数据的前30%求平均值之类的需求

with a as (select *, ntile(3) over () rn from student)
select *
from a
where rn = 2;

+---+--------+------+---+-------+--+
|sno|name    |sex   |age|academy|rn|
+---+--------+------+---+-------+--+
|4  |liudehua|male  |23 |math   |2 |
|7  |tianming|female|23 |art    |2 |
+---+--------+------+---+-------+--+

总结：

定义

clustered by (id);         ---指定表内的字段进行分桶。
sorted by (id asc|desc)   ---指定数据的排序规则，表示咱们预期的数据是以这种规则进行的排序

导入数据

cluster by (id)   
--指定getPartition以哪个字段来进行hash，并且排序字段也是指定的字段，排序是以asc排列
--相当于distribute by (id) sort by (id)

distribute by (id)    -- 指定getPartition以哪个字段来进行hash
sort by (name asc | desc) --指定排序字段

-- 区别：distribute by 这种方式可以分别指定getPartition和sort的字段

---导数据时：
insert overwrite table buc3
select id,name,age from temp_buc1
distribute by (id) sort by (id asc)
;
---和下面的语句效果一样
insert overwrite table buc4
select id,name,age from temp_buc1
cluster by (id)
;

注意事项

分区使用的是表外字段，分桶使用的是表内字段
分桶更加细粒度的管理数据，更多的是使用来做抽样、join

总结

分区表的数据未必均匀,分桶表可以然数据均匀分散
ntile也可以实现分桶

鸭梨山大哎

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录