Hive 分区分桶为什么分区分桶怎么分区分桶

最新推荐文章于 2023-12-21 14:23:50 发布

jialun0116

最新推荐文章于 2023-12-21 14:23:50 发布

阅读量459

点赞数

分类专栏：面经数据仓库 Hive 文章标签： hive 数据库大数据

本文链接：https://blog.csdn.net/qq_30031221/article/details/114419273

版权

面经同时被 3 个专栏收录

21 篇文章 3 订阅

订阅专栏

Hive

11 篇文章 1 订阅

订阅专栏

数据仓库

10 篇文章 2 订阅

订阅专栏

Hive 分区分桶

为什么有分区
怎么分区
如何使用分区
分区注意事项
为什么要分桶？
分桶的意义
如何使用分桶
分区分桶表

为什么有分区

随着系统运行时间增长，表的数据量越来越大，使用分区技术可以指定条件，缩小数据扫描的范围，避免hive全表扫描，提升查询效率
可以将用户的整个表的数据划分到多个子目录，

怎么分区

根据业务，通常按照年月日、地区等

如何使用分区

PARTITION BY(col_name data_type)
hive的分区字段使用的是表外字段。而mysql使用的是表内字段。
hive的分区名区分大小写
hive的分区本质是在表目录下面创建目录，但是该分区字段是一个伪列，不真实存在于数据中
一张表可以有一个或者多个分区，分区下面也可以有一个或者多个分区
双分区partitioned by (date_time string,type string），在文件系统中的表现为date_time为一个文件夹，type为date_time的子文件夹。
动态分区列必须在 SELECT 语句中的最后一个列中指定，且顺序与它们在 PARTITION() 子句中出现的顺序相同。
动态分区需要开启 set hive.exec.dynamic.partition = true; hive.exec.dynamic.parition.mode=nonstrict;

-- 创建静态分区 数据加载到指定的分区
create table if not exists part1(
	uid int,
  uname string,
  uage int
)PARTITION BY (country string)
row format delimiterd fileds terminated by ',';
(stored as ORC| SequenceFile) ORC、 SequenceFile都是存储方式
(loacation 地址)
-- 导入数据 需要指定分区 数据未知，根据分区值确定创建分区
load data local inpath '/usr/loacl/xxx'
into table part1 partition(country='China');

-- 开启动态分区 默认为false，不开启
set hive.exec.dynamic.partition=true;
-- 创建动态双分区
create table if not exists dt_part1(
  uid int,
  uname string,
  uage int
)
PARTITIONED BY (year string,month string)
row format delimited fields terminated by ',';
-- 在文件系统中的表现为date_time为一个文件夹，type为date_time的子文件夹。

-- 追加写入数据
insert into dy_part1 partition(year,month)
select * from part_tmp;
-- 覆盖写入数据
insert overwrite dy_part1 partition(year,month)
select * from part_tmp;

-- 混合分区
create table if not exists dy_part2(
	uid int,
	uname string,
	uage int
)
PARTITIONED BY (year string,month string)
row format delimited fields terminated by ',';
-- 插入数据
insert into dy_part2 partition(year='2018',month)
select uid,uname,uage,month from part_tmp；

-- 多个范围分区键
create table test_demo (value int)
partitioned by range (id1 INT, id2 INT, id3 INT)
(
-- id1在(--∞,5]之间，id2在(-∞,105]之间，id3在(-∞,205]之间
  partition p5_105_205 VALUES LESS THAN (5, 105, 205),
  -- id1在(--∞,5]之间，id2在(-∞,115]之间，id3在(-∞,+∞]之间
  partition p5_115_max VALUES LESS THAN (5, 115, MAXVALUE)
)

-- 查看分区数据
select * from part1 where country = 'China';

-- 显示分区
show partitions part1;
-- 增加分区
alter table part1 add partition(country = 'india') partition(country = 'America');
-- 增加分区并设置数据
alter table part1 add partition(country = 'xxx')
location 'user/hive/warehouse/xxx'
-- 修改分区的存储路径 hdfs路径必须是全路径
alter table part1 partition(country='Vietnam') 
set location 'hdfs://hadoop01:9000/user/hive/warehouse/brz.db/part1/country=Vietnam'
-- 删除分区
alter table part1 drop partition(country = 'india')


-- 手动向hdfs中创建分区目录，添加数据，创建好hive的外表之后，无法加载数据，
-- 元数据中没有相应的记录
msck repair table tablename

分区注意事项

hive的分区使用的表外字段，分区字段是一个伪列但是可以查询过滤。
分区使用的是表外字段，分桶使用的是表内字段
分区字段不建议使用中文
不太建议使用动态分区。因为动态分区将会使用mapreduce来查询数据，如果分区数量过多将导致namenode和yarn的资源瓶颈。所以建议动态分区前也尽可能之前预知分区数量。
分区属性的修改均可以使用手动元数据和hdfs的数据内容
在hive中的数据是存储在hdfs中的**，我们知道hdfs中的数据是不允许修改只能追加的**，那么在hive中执行数据修改的命令时，就只能先找到对应的文件，读取后执行修改操作，然后重新写一份文件。如果文件比较大，就需要大量的IO读写。在hive中采用了分桶的策略，只需要找到文件存放对应的桶，然后读取再修改写入即可。

为什么要分桶？

单个分区或者表中的数据量越来越大，当分区不能更细粒的划分数据时，所以会采用分桶技术将数据更细粒度的划分和管理。

分桶的意义

分桶是更细粒度的划分、管理数据，更多用来做数据抽样、JOIN操作
大表在JOIN的时候，效率低下。如果对两个表先分别按id分桶，那么相同id都会归入一个桶。那么此时再进行JOIN的时候是按照桶来JOIN的，那么大大减少了JOIN的数量。
对数据抽样的时候，也不需要扫描整个文件。只需要对每个分区按照相同规则抽取一部分数据即可。
原始数据中加入一些额外的结构，这些结构可以用于高效的查询，例如，基于ID的分桶可以使得用户的查询非常的块。

如何使用分桶

定义：
- clustered by (uid) – 指定分桶的字段
- sorted by (uid desc) – 指定数据的排序规则，表示预期的数据就是以这里设置的字段以及排序规则来进行存储
- into x buckets 放进几个桶里
- 分区使用的是表外字段，分桶使用的是表内字段
- 分桶数和reduce数对应一个文件对应一个分桶
- hash 值为 20 的 HDFS 目录为：/ warehouse /app/dt =20100801/ctry=US/part-00020
导入数据
- 导入数据有两种，一种是通过文件导入，但是并不会真正的分桶,load data只是把文件上传到表所在的HDFS目录下。并没有做其他操作；一种是通过从其他表插入的方式导入数据，这种方式才能真正的分桶；
- cluster by (uid) – 指定getPartition以哪个字段来进行hash散列，并且排序字段也是指定的字段，默认以正序进行排序
- distribute by(uid) – 指定getPartition以哪个字段来进行hash散列
- sort by(uid asc) – 指定排序字段，以及排序规则，更灵活的方式，这种数据获取方式可以分别指定getPartition的字段和sort的字段
- 方式1
  - 打开enforce bucketing开关,设置强制分桶属性 set hive.enforce.bucketing=true 2.x版本不需要
  - 设置reduces数为-1： hive.enforce.bucketing为true时，reduce要设为-1；
  - insert overwrite table buc1 select uid,uname,uage from buc_temp;
  - 得到的分桶对应的文件，数据是无序的，也就是 sorted by 或 sort by无效）
- 方式2
  - 关闭强制分桶 set hive.enforce.bucketing = false
  - 将reducer个数设置为目标表的桶数，并在 SELECT 语句中用 DISTRIBUTE BY <bucket_key>
    – 对查询结果按目标表的分桶键分进reducer中。
  - set mapred.reduce.tasks = num_buckets
  - insert into table buc1 select uid,uname,uage from buc_temp distribute by (uid) sort by (uage desc);
- cluster by (uid)与distribute by(uid) sort by (uid asc)结果是一样的
抽样语句：tablesample（bucket x out of y）
- y必须是table总共bucket数的倍数或者因子。
- 例如：table总共分了64份，当y=32时，抽取2(64/32)个bucket的数据，当y=128时，抽取1/2(64/128)个bucket的数据。x表示从哪个bucket开始抽取。
- 例如：table总共bucket数为32，tablesample(bucket 3 out of 16)表示总共抽取2(32/16)个bucket的数据，分别为第三个bucket和第19(3+16)个bucket的数据。
- select * from table_name tablesample(n percent) 抽出n%的数据全表扫描
- 如果在 TABLESAMPLE 子句中指定的列与 CLUSTERED BY 子句中的列相匹配，则 TABLESAMPLE 只扫描表中要求的哈希分区【就是具体的桶】

--创建一个分桶表 并且指定排序字段及排序规则
create table if not exists buc1(
  uid int,
  uname string,
  uage int
)
clustered by (uid) 
sorted by(uid desc) into 4 buckets
row format delimited fields terminated by ',';

-- cluster by (uid)指定getPartition以哪个字段来进行hash散列，并且排序字段也是指定的字段，默认以正序进行排序
-- distribute by(uid) – 指定getPartition以哪个字段来进行hash散列

-- 加载数据 方式1
-- 打开enforce bucketing开关,设置强制分桶属性
set hive.enforce.bucketing=true
set mapred.reduce.tasks = -1
insert overwrite table buc1
select uid,uname,uage from buc_temp
sort by (uid);

-- 加载数据 方式2
-- 将reducer个数设置为目标表的桶数，并在 SELECT 语句中用 DISTRIBUTE BY <bucket_key>
-- 对查询结果按目标表的分桶键分进reducer中。
set hive.enforce.bucketing = false
set mapred.reduce.tasks = num_buckets
insert into table buc1
select uid,uname,uage from buc_temp
distribute by (uid) sort by (uage desc);

-- 查看表结构
desc formatted tablename;

-- 分桶查询结果
select * from buc1 cluster by (uid);
'''
	采样 TABLESAMPLE(BUCKET x OUT OF y) 
	x：表示从哪个 bucket 开始抽取数据 y：必须为该表总 bucket 数的倍数或因子
'''
-- 查询第几桶 取出 uid % 4 == 0的数据
select * from buc1 tablesample(bucket 1 out of 4 on uid); 
-- 查询uid 为奇数
select * from buc1 tablesample(bucket 2 out of 2 on uid)
-- 随机查询三条数据
select * from part_tmp order by rand() limit 3;

select * from part_tmp tablesample(0.1 percent) ;

分区分桶表

例子也可参考 https://www.studytime.xin/article/hive-partition-and-bucket.html

-- 按照性别进行分区（1男2女），在分区中按照uid的奇偶进行分桶：
-- 分区使用的是表外字段，分桶使用的是表内字段
1 gyy1 1
2 gyy2 2
3 gyy3 2
4 gyy4 1
5 gyy5 2
6 gyy6 1
7 gyy7 1
8 gyy8 2
9 gyy9 1
10 gyy10 1
11 gyy11 2
12 gyy12 1

-- 创建带有分区的分桶表
create table if not exists stus(
	uid int,
	uname string
)
partitioned by(sex int)
clustered by(uid) into 2 buckets
row format delimited filed terminated by ' ';
-- 创建临时表
create table if not exists stu_temp(
	uid int,
	uname string,
	usex int
)
row format delimited fields terminated by ' ';

-- 临时表中添加数据
load data local inpath '/usr/local/hivedata/stu.dat' into table stu_temp
-- 分桶表中加数据
insert into table stus partition(sex)
select uid,uname,usex from stu_temp
cluster by (uid);

-- 查询性别为女性的、并且学号为奇数的学生：
select * from stus tablesample(bucket 2 out of 2 on uid)
where sex=2;

jialun0116

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Hive 分区分桶为什么分区分桶怎么分区分桶

Hive 分区分桶为什么有分区怎么分区如何使用分区分区注意事项为什么要分桶？分桶的意义如何使用分桶分区分桶表为什么有分区随着系统运行时间增长，表的数据量越来越大，使用分区技术可以指定条件，缩小数据扫描的范围，避免hive全表扫描，提升查询效率可以将用户的整个表的数据划分到多个子目录，怎么分区根据业务，通常按照年月日、地区等如何使用分区PARTITION BY(col_name data_type)hive的分区字段使用的是表外字段。而mysql使用的是表内字段。hive的分区名
复制链接

扫一扫