Hive基础-DML&分区表&内部函数

最新推荐文章于 2021-08-15 16:54:50 发布

HBinz

最新推荐文章于 2021-08-15 16:54:50 发布

阅读量325

点赞数 1

文章标签： BigData

本文链接：https://blog.csdn.net/Binbinhb/article/details/88263768

版权

一、温故知新

1、insert into追加数据的方法实质是copy了一份表到数据仓库里

hdfs dfs -ls /user/hive/warehouse/ruozedata_emp4

找到copy的表

删除：

hdfs dfs -rm -f /user/hive/warehouse/ruozedata_emp4

2、Hive中一般的查询条件不跑MR

> = between（左闭右闭）limit in('','')等简单的sql查询语句

二、聚合函数

1、max/min/count/sum/avg：多进一出，很多数据进来，输出只有一条，走MR

（1）求部门编号等于10有多少条数据

select count(*) from ruozedata_emp4 where depno = 10;

（2）求最大工资，最小工资，平均工资，工资总额是多少

select max(salary),min(salary),avg(salary),sum(salary) from ruozedata_emp;

2、分组函数 group by

(1）求每个部门的平均工资

select deptno,avg(salary) from ruozedata_emp group by deptno

（2）select中出现的字段，如果没有出现在组函数/聚合函数中，必须出现在group by里面

（3）求每个部门、工作岗位的最高工资

select deptno,job,max(salary) from ruozedata_emp group by deptno,job

（3）求每个部门的平均工资>2000的部门

select deptno,avg(salary) from ruozedata_emp group by deptno having avg(salary) > 2000

单条筛选：where

分组后多条数据筛选：having

（4）where是需要写在group by之前

（5）where和having的执行是在什么位置？

where在group by 之前

having是在聚合函数之后

三、case when then

select ename,salary,

case

when salary>1 and salary<=1000 then 'lower'

when salary>1000 and salary<=2000 then 'middle'

when salary>2000 and salary<=4000 then 'high'

else 'highest'

end

from ruozedata_emp;

相当于给数据打上标签。

四、join

1、创建表a存放人，表b存放年龄

CREATE TABLE a(id int,name string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

CREATE TABLE b(id int,age int)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

2、加载数据

LOAD DATA LOCAL INPATH '/opt/data/join_a.txt' OVERWRITE INTO TABLE a;

LOAD DATA LOCAL INPATH '/opt/data/join_b.txt' OVERWRITE INTO TABLE b;

3、join

（1）join默认inner join：左右表同时存在的数据。

select a.id,a.name from a join b on a.id = b.id;

（2）left join（以左表为主补全）

select a.id,a.name,b.age from a left join b on a.id = b.id;

（3）right join（以右表为主补全）

select a.id,a.name,b.age from a right join b on a.id = b.id;

（4）full join（全部字段补全，但没取上数据的补null）

select a.id,a.name,b.age from a full join b on a.id = b.id;

五、分区表 partition

1、意义：who when do what

hive hdfs + partition <= where partition

==>reduce io会少很多

2、静态分区

（1）单级分区

create table order_partition(ordernumber string,eventtime string)

partitioned by (event_month string)

row format delimited fields terminated by '\t';

加载数据（本地系统）：

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

LOAD DATA LOCAL INPATH '/opt/data/order.txt' OVERWRITE INTO TABLE order_partition PARTITION (event_month='2014-05-25')

分区文件夹：

加载数据（hdfs之上）：

创建hdfs文件夹：

hdfs dfs -mkdir /user/hive/warehouse/hive2_ruozedata.db/order_partition/event_month=2014-05-26

把order.txt放到hdfs文件夹里：

hdfs dfs -put order.txt /user/hive/warehouse/hive2_ruozedata.db/order_partition/event_month=2014-05-26

加载元数据：

打开hive对应mysql的分区表partitions

select * from partitions;

select * from partition_keys;

所有的都指向2014-05-25

需要修改：

1】看官网

2】MSCK REPAIR table order_partition;

但是这个办法太暴力,刷所有分区，一般不推荐使用，请看下面：

3】看官网

ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION (event_month = '2014-05-27') ;

partition_spec:

: (partition_column = partition_col_value, partition_column = partition_col_value, ...)

4】重新创建HDFS文件夹

hdfs dfs -mkdir /user/hive/warehouse/hive2_ruozedata.db/order_partition/event_month=2014-05-27

hdfs dfs -put order.txt /user/hive/warehouse/hive2_ruozedata.db/order_partition/event_month=2014-05-27

5】Hive查询分区是否更新

6】通过add分区

ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION (event_month = '2014-05-27') ;

成功了

3、加载数据到分区

（1）创建订单表

create table order_4_partition(ordernumber string,eventtime string)

row format delimited fields terminated by '\t';

LOAD DATA LOCAL INPATH '/opt/data/order.txt' OVERWRITE INTO TABLE order_4_partition;

总结：

create语句如果带分区字段，load语句必须要带分区字段。

（2）insert语句

insert overwrite table order_partition partition (event_month='2014-05-28');【给order_partition创建分区，order_partition是分区表，而order_4_partition并不是分区表】

select * from order_4_partition;【用于将order_4_partition的数据拷贝到order_partition上】

insert overwrite table order_partition partition (event_month='2014-05-28')

select * from order_4_partition;

（3）查看表里的分区

show partitions order_partition;

总结：

以上的是静态分区中的单级分区。

3、多级分区（工作上基本都是多级分区）

（1）创建多级分区表

create table order_mulit_partition(

ordernumber string,

eventtime string

)

partitioned by (event_month string,event_day string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

（2）导入数据

LOAD DATA LOCAL INPATH '/opt/data/order.txt' OVERWRITE INTO TABLE order_mulit_partition PARTITION (event_month='2014-05',event_day='01');

（3）剖析多级分区，查看HDFS的层级结构

总结：

实际，多级分区在HDFS也只是第一层包含第二层，直到最底层的订单数据表，以文件夹形式来实现。

4、动态分区

（1）创建静态分区表

create table ruozedata_static_emp

(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)

PARTITIONED by(deptno string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

（2）将deptno为20、30，加入静态表

insert into table ruozedata_static_emp partition(deptno='20')

select empno,ename,job,mgr,hiredate,salary,comm from ruozedata_emp

where deptno=20;

insert into table ruozedata_static_emp partition(deptno='30')

select empno,ename,job,mgr,hiredate,salary,comm from ruozedata_emp

where deptno=30;

（3）如果emp表的deptno有N个，得给静态分区表分区N次。

总结：

因此，生产需要用动态表解决。

（4）创建动态表

create table ruozedata_dynamic_emp

(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)

PARTITIONED by(deptno string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

（5）加载分区数据

动态分区明确要求：分区字段写在select的最后面

insert into table ruozedata_dynamic_emp partition(deptno)

select empno,ename,job,mgr,hiredate,salary,comm,deptno from ruozedata_emp ;

报错：

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

（6）直接执行：set hive.exec.dynamic.partition.mode=nonstrict

将hive.exec.dynamic.partition.mode设置为不严格模式。

（7）set hive.exec.dynamic.partition.mode=nonstrict;

这是hive中常用的设置key=value的方式

语法格式：

set key=value; 设置

set key; 取值

总结：

1、创建表的方式是一样的，是分区的方法不一样。

2、在官网Configuration Properties中查看设置

3、默认是严格模式，以及1000分区。

六、函数

1、UDF

2、show functions;

3、将目标字段转为小写

select empno,ename,lower(ename) from ruozedata_emp;

4、将表中数据转换成时间戳格式

5、其他函数

HBinz

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive基础-DML&分区表&内部函数

一、温故知新1、insert into追加数据的方法实质是copy了一份表到数据仓库里hdfs dfs -ls /user/hive/warehouse/ruozedata_emp4找到copy的表删除：hdfs dfs -rm -f /user/hive/warehouse/ruozedata_emp42、Hive中一般的查询条件不跑MR&gt; = betw...
复制链接

扫一扫