hive的分桶和分区

最新推荐文章于 2024-08-01 12:41:16 发布

刘李�

最新推荐文章于 2024-08-01 12:41:16 发布

阅读量880

点赞数

分类专栏：归纳

本文链接：https://blog.csdn.net/weixin_43740162/article/details/84991268

版权

归纳专栏收录该内容

10 篇文章 0 订阅

订阅专栏

hive的分桶和分区

分桶
概念：对分区的进一步的更细粒度的划分。分区类似

创建分桶表
create table stu_duck(id int,name strint) //创建表名字段
clustered by(id) //按照id分桶
into 4 buckets//分4个桶
row format delimited fields terminated by ‘\t’;//通过\t分割
插入数据到分桶表
导入数据到分桶表不能使用load语句！！如果可以那一定是你的分桶功能没有开启
set hive.enforce.bucketing=true;
set mapreduce.job.reduces=-1;
先创建一个新的表create table stu(id int,name string) row format delimited fields terminated by ‘\t’;
向该表导入本地数据load data local inpath ‘/path/’ into table stu;
在使用insert …from导入数据到分桶表 insert into table stu_buck select * from stu;
查询数据
select * from stu_buck;
分桶抽样查询
select * from stu_buck TABLESAMPLE(bucket x out of y on id);
tablesample是抽样查询的语句 y必须是table总bucket的倍数或因子。hive根据y的大小，决定抽样的比例。注意x必须小于y！！！

分区
hive表就是hdfs的上的一个目录
hive表中的数据，其实就是对应了HDFS上的一个目录下的数据
概念：对hive表的数据做分区管理
以下partition1为表名
DDL操作
创建分区表
create table partition1 (字段1 类型1,字段2 类型2。。。) partitioned by(字段1 类型1,字段2 类型2。。。） row format delimited fields by ‘\t’;

上传数值到分区
load data local inpath ‘本地路径’ into table partition1 partition(字段1=’’,字段2=’’);

查询分区字段
select * from partition1 where 字段=’’(2个分区加and)(and 字段2 =‘’）;

联合查询
select * from partition1 where 字段=‘1’
union
select * from partition1 where 字段=‘1’
union
…
此联合查询使用mapreduce机制

增加分区（单个）
alter table partition1 add partition(字段=‘3’)

增加分区（多个）
alter table partition1 add partition(字段=‘4’)空格partition(字段=‘5’);

删除分区（单个）
alter table partition1 drop partition(字段=‘5’);

删除分区（多个）
alter table partition1 drop partition(字段=‘4’),partition(字段=‘3’);

查看分区
show partitions partition1;

查看分区表结构
desc formatted partition1;

hive上传数据到hdfs
先创建目录
在分区表目录下创建
dfs -mkdir -p /分区表路径/分区字段=’’/…
dfs -put /本地路径/ /目标hdfs路径/;

此时查询数据select * from partition1 where month=‘201812’ and day’12’;(举个栗子）
查询不到数据！
此时需关联数据
方案1：修复数据
msck repair table partition1; (partition1是之前就创建过的分区表！）
方案2：添加执行分区
alter table partition1 add partition(分区时候添加的分区目录）;
方案3：load data local inpath ‘本地路径’ into table partition1 partition(字段1=’’,字段2=’’);

修改表名
alter table partition1 rename to partition2;

增加字段
alter table partition2 add columns(新字段名，新类型）;
（新添加的字段在最后面）

修改字段
alter table partition2 change column 老字段名新字段名新类型;
有时候会报错 The following columns have types incompatible with the existing columns in their respective positions：…
原因
在hive中执行ALTER列操作时遇到这个问题，试验后发现是因为hive中数据类型强制转换的问题。猜测和hive内部存储表属性的设置相关，只能按照强制转换满足的规则去改变（比如string改int可以，反过来就不行）。另外如果使用ALTER更改列位置也必须满足上面规则，其实感觉移动位置内部还是一个重命名的操作吧，所以更改列位置还是不太靠得住的。

替换字段（所有字段）
alter table partition2 replace columns(所有字段名字属性）;
hdfs仍有备份！

删除表
drop table partition2;

覆盖表数据
overwrite

DML操作
insert导入数据
1.创建一张分区表
create table student(id int,name string) partitioned by(month string) row format delimited fields terminated by ‘\t’;
2.插入基本数据
(1)insert into table student partition(month=‘201812’) values(1,‘ll’);
(不能values在前 partition在后！先查找分区，才插数据）
(2)insert overwrite table student partition(month=‘201811’) select id,name from student where month=‘201812’;
（根据month=201812来查询数据后新建并覆盖到month=201811里去）
(3)from student
insert overwrite table student partition partition(month=‘201810’)
select id,name where month=‘201811’
insert overwrite table student partition partition(month=‘201809’)
select id,name where month=‘201811’;
（将month=201811的数据查询后新建并覆盖到month201810 201809里去多插入模式）。

insert导出数据到本地
insert overwrite local directory ‘自己定义一个导出路径’ select * from student;
格式化数据到本地
insert overwrite local directory ‘/路径/’ row format delimited fields terminated by ‘\t’ select * from student;
(如需上传到hdfs上去掉local 路径也改成hdfs的路径）
export导出到hdfs
export table default .student to ‘/path/’
或者sqoop导出！

清除表中数据(Truncate)
(只能删除管理表，不能删除外部表）
truncate table student;

查询语句select …from where group by having join 别名内连接外连接左外连接右外连接大体跟mysql差不多。