hive排序分区分桶函数行专列

最新推荐文章于 2023-02-27 09:49:08 发布

hidecode

最新推荐文章于 2023-02-27 09:49:08 发布

阅读量219

点赞数 1

分类专栏： java大数据文章标签： hadoop hive kafka hdfs java

本文链接：https://blog.csdn.net/qq_43206800/article/details/107737699

版权

java大数据专栏收录该内容

33 篇文章 1 订阅

订阅专栏

排序
1.1 Order By 全局排序
关注点: 只有一个reducer，也就是只有一个分区.

1.2 Sort By Reducer内部排序/区内排序
关注点: 有多个reducer，也就是有多个分区
注意点: 有多个reducer,单独使用sort by, 数据会被随机分到每个reducer中，在每个reducer中sort by会将数据排序。

   insert overwrite local directory '/opt/module/hive/datas/sort-result/'
   select * from emp sort by deptno desc ;

1.3 Distribute By 分区
关注点: 指定按照哪个字段分区
insert overwrite local directory ‘/opt/module/hive/datas/distribute-result/’
select * from emp distribute by deptno sort by empno desc ;

1.4 Cluster By 分区排序
关注点: 相当于distribute by 和sort by同时用，并且分区和排序的字段是同一个，并且排序是升序的情况.

   select * from emp distribute by deptno sort by deptno asc ; 
   select * from emp cluster by deptno ;

分区表

2.1 问题: Hive没有索引的概念，会暴力扫描整个数据.
2.2 本质: Hive的分区表，实际就是分目录，通过多个目录维护整个数据.

2.3 创建分区表(通过dept数据模拟日志数据)
dept_20200401.log
dept_20200402.log
dept_20200403.log
```
create table dept_partition (
   deptno int, dname string, loc string
)
partitioned by (day string)  -- 指定表的分区字段是day,该字段的类型是string
row format delimited fields terminated by '\t' ;

load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition partition(day='20200403');


查分区数据
select * from dept_partition where day = '20200401' ;
```
2.4 分区表的分区的操作:
1. 查看分区表有多少个分区
  show partitions 表名.
2. 增加分区
  增加单个分区:
  alter table dept_partition add partition(day=‘20200404’);
  增加多个分区:
  alter table dept_partition add partition(day=‘20200405’) partition(day=‘20200406’);
3. 删除分区
  删除单个分区:
  alter table dept_partition drop partition(day=‘20200404’);
  删除多个分区:
  alter table dept_partition drop partition(day=‘20200405’), partition(day=‘20200406’);
2.5 二级分区
```
create table dept_partition2 (
   deptno int, dname string, loc string
)
partitioned by (day string,hour string)  
row format delimited fields terminated by '\t' ;

load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200402',hour='02');
load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition2 partition(day='20200402',hour='03');
load data local inpath '/opt/module/hive/datas/dept_20200403.log' into table dept_partition2 partition(day='20200402',hour='04');
```
2.6 分区与数据产生关联的方式:
1. 手动创建分区目录，执行分区的修复
  创建分区目录
  hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition/day=20200404
  上传数据到分区目录
  hadoop fs -put dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition/day=20200404
  在hive中修复分区
  msck repair table dept_partition
2. 手动创建分区目录，在hive中添加对应的分区
  创建分区目录
  hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition/day=20200405
  上传数据到分区目录
  hadoop fs -put dept_20200402.log /user/hive/warehouse/mydb.db/dept_partition/day=20200405
  在Hive中手动添加分区
  alter table dept_partition add partition(day=‘20200405’)
3. 手动创建分区目录,在hive中load数据到对应的分区
  创建分区目录
  hadoop fs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition/day=20200406
  在hive中load数据到指定的分区
  load data local inpath ‘/opt/module/hive/datas/dept_20200403.log’ into table dept_partition partition(day=‘20200406’) ;
2.7 动态分区
1. 创建动态分区表
  create table dept_dy_partition (
  deptno int, dname string
  )
  partitioned by (loc string)
  row format delimited fields terminated by ‘\t’ ;
2. 往动态分区插2入数据
  a.
  insert into table dept_dy_partition values(11,‘TEST’,1000);
  b.
  insert into table dept_dy_partition partition(loc) select * from dept ;
  c.
  load data local inpath ‘/opt/module/hive/datas/dept.txt’ into table dept_dy_partition ;
分桶表
3.1 分桶表: 分桶表是将数据文件分成多份，每份对应一个桶
3.2 创建分桶表
create table stu_buck
(
id int , name string
)
clustered by(id)
into 4 buckets
row format delimited fields terminated by ‘\t’;

load data inpath ‘/student.txt’ into table stu_buck ;

3.3 load数据到分桶表需要注意的点:
1. reduce的个数设置为-1,让Job自行决定需要用多少个reduce
  或者将reduce的个数设置为大于等于分桶表的桶数。
2. 直接将数据放到hdfs后再进行load操作.
3. 不要使用本地模式
3.4 insert方式将数据导入分桶表
insert into table stu_buck select * from student_insert ;
函数
4.1 查看系统内置函数
show functions ;
4.2 查看函数如何使用
desc function 函数名
desc function extended 函数名

4.3 常用函数
1)nvl
1. CASE WHEN THEN ELSE END
  需求:
  name dept_id sex
  悟空 A 男
  大海 A 男
  宋宋 B 男
  凤姐 A 女
  婷姐 B 女
  婷婷 B 女
结果:
dept_id 男女
A 2 1
B 1 2

分析:
a. 按照dept_id 分组操作
悟空 A 男
大海 A 男
凤姐 A 女

宋宋 B 男
婷姐 B 女
婷婷 B 女
```
 select dept_id ,
 sum(case sex when '男' then 1 else 0 end) man ,
 sum(case sex when '女' then 1 else 0 end) women
 from emp_sex
 group by dept_id ;
```
±---------±-----±-------+
| dept_id | man | women |
±---------±-----±-------+
| A | 2 | 1 |
| B | 1 | 2 |
±---------±-----±-------+

4.4 行转列
需求:
name constellation blood_type
孙悟空白羊座 A
大海射手座 A
宋宋白羊座 B
猪八戒白羊座 A
凤姐射手座 A
苍老师白羊座 B

结果:
射手座,A 大海|凤姐
白羊座,A 孙悟空|猪八戒
白羊座,B 宋宋|苍老师

分析:
a. 将constellation 和 blood_type 拼接
select name , concat_ws(",",constellation,blood_type) c_b
from person_info ==>t1
±------±-------------+
| name | c_b |
±------±-------------+
| 孙悟空 | 白羊座,A |
| 大海 | 射手座,A |
| 宋宋 | 白羊座,B |
| 猪八戒 | 白羊座,A |
| 凤姐 | 射手座,A |
| 苍老师 | 白羊座,B |

b. 分组,并将每组中的name进行collect_set
select t1.c_b , concat_ws("|",collect_set(t1.name)) names
from t1
group by t1.c_b

±------±-------------+
| name | c_b |
±------±-------------+
| 孙悟空 | 白羊座,A |
| 猪八戒 | 白羊座,A |

| 大海 | 射手座,A |
| 凤姐 | 射手座,A |

| 苍老师 | 白羊座,B |
| 宋宋 | 白羊座,B |

组合:
select t1.c_b , concat_ws("|",collect_set(t1.name)) names
from (select name , concat_ws(",",constellation,blood_type) c_b
from person_info)t1
group by t1.c_b ;
```
 +---------+----------+
```
| t1.c_b | names |
±--------±---------+
| 射手座,A | 大海|凤姐 |
| 白羊座,A | 孙悟空|猪八戒 |
| 白羊座,B | 宋宋|苍老师 |
±--------±---------+