【18笔记】Hive数据操作(DML)

最新推荐文章于 2024-06-17 09:08:33 发布

fu-jw

最新推荐文章于 2024-06-17 09:08:33 发布

阅读量4.8k

点赞数 1

分类专栏：大数据

本文链接：https://blog.csdn.net/weixin_44371151/article/details/104766556

版权

大数据专栏收录该内容

47 篇文章

订阅专栏

这篇博客详细介绍了Hive的数据操作，包括数据导入、导出、清除、查询等。重点讲解了LOAD、INSERT、TRUNCATE命令，以及如何使用RLIKE、HAVING、JOIN、CASE WHEN等进行复杂查询。还探讨了窗口函数、Rank系列函数以及自定义函数的使用，特别是UDF、UDAF和UDTF的开发和应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 数据导入

向表中装载数据（Load）

语法：load data [local] inpath ‘/opt/module/datas/student.txt’ overwrite | into table student [partition (partcol1=val1,…)];

load data:表示加载数据
local:表示从本地加载数据到hive表；否则从HDFS加载数据到hive表
inpath:表示加载数据的路径
overwrite:表示覆盖表中已有数据，否则表示追加
into table:表示加载到哪张表
student:表示具体的表
partition:表示上传到指定分区

#1.创建一张表
create table student(id string, name string) row format delimited fields terminated by '\t';
#2.加载HDFS文件到hive中
	#2.1上传文件到HDFS
	dfs -put /opt/module/datas/student.txt /user/atguigu/hive;
	#2.2加载HDFS上数据
	load data inpath '/user/atguigu/hive/student.txt' into table default.student;
	#2.2加载数据覆盖表中已有的数据
	load data inpath '/user/atguigu/hive/student.txt' overwrite into table default.student;

通过查询语句向表中插入数据（Insert）

#1．创建一张分区表
create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t';
#基本插入数据
insert into table  student partition(month='201709') values(1,'wangwu');
#基本模式插入（根据单张表查询结果）
insert overwrite table student partition(month='201708') select id, name from student where month='201709';
#多插入模式（根据多张表查询结果）
from student
insert overwrite table student partition(month='201707') select id, name where month='201709'
insert overwrite table student partition(month='201706') select id, name where month='201709';

查询语句中创建表并加载数据（As Select）

create table if not exists student3 as select id, name from student;

创建表时通过Location指定加载数据路径

#1．创建表，并指定在hdfs上的位置
create table if not exists student5(id int, name string)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student5';
#2．上传数据到hdfs上
dfs -put /opt/module/datas/student.txt /user/hive/warehouse/student5;

Import数据到指定Hive表中

#注意：先用export导出后，再将数据导入。
import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';

2. 数据导出

Insert导出

#1．将查询的结果导出到本地
insert overwrite local directory '/opt/module/datas/export/student' select * from student;
#2．将查询的结果格式化导出到本地
insert overwrite local directory '/opt/module/datas/export/student1' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;
#3．将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/user/atguigu/student2' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;

Hadoop命令导出到本地

dfs -get /user/hive/warehouse/student/month=201709/000000_0 /opt/module/datas/export/student3.txt;

Hive Shell 命令导出

#基本语法：（hive -f/-e 执行语句或者脚本 > file）
hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt;

Export导出到HDFS上

export table default.student to '/user/hive/warehouse/export/student';

Sqoop导出

3. 清除表中数据（Truncate）

#注意：Truncate只能删除管理表，不能删除外部表中数据
truncate table student;

4. 查询

RLIKE
RLIKE子句是Hive中这个功能的一个扩展，其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。

#查找以2开头薪水的员工信息
select * from emp where sal LIKE '2%';
#查找第二个数值为2的薪水的员工信息
select * from emp where sal LIKE '_2%';
#查找薪水中含有2的员工信息
select * from emp where sal RLIKE '[2]';

Having与where不同点

where针对表中的列发挥作用，查询数据；having针对查询结果中的列发挥作用，筛选数据。
where后面不能写分组函数，而having后面可以使用分组函数。
having只用于group by分组统计语句。

#求每个部门的平均薪水大于2000的部门
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

Join语句

Hive支持通常的SQL JOIN语句，但是 只支持等值连接，不支持非等值连接。

#根据员工表和部门表中的部门编号相等，查询员工编号、员工名称和部门名称；
select e.empno, e.ename, d.deptno, d.dname from emp e join dept d on e.deptno = d.deptno;

表的别名
使用别名可以简化查询；
表名前缀可以提高执行效率

#合并员工表和部门表
select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno = d.deptno;

多表连接
大多数情况下，Hive会对每对JOIN连接对象启动一个MapReduce任务。本例中会首先启动一个MapReduce job对表e和表d进行连接操作，然后会再启动一个MapReduce job将第一个MapReduce job的输出和表l;进行连接操作。
注意：为什么不是表d和表l先进行连接操作呢？这是因为Hive总是按照从左到右的顺序执行的。

SELECT e.ename, d.deptno, l. loc_name
FROM   emp e 
JOIN   dept d
ON     d.deptno = e.deptno 
JOIN   location l
ON     d.loc = l.loc;

排序

Order By：全局排序，一个Reducer
Sort By：每个MapReduce内部排序
Distribute By：分区排序
Cluster By

#	Order By
select ename, sal*2 twosal from emp order by twosal;

#	Sort By
#1．设置reduce个数
set mapreduce.job.reduces=3;
#2．查看设置reduce个数
set mapreduce.job.reduces;
#3．根据部门编号降序查看员工信息
select * from emp sort by empno desc;
#4．将查询结果导入到文件中（按照部门编号降序排序）
insert overwrite local directory '/opt/module/datas/sortby-result' select * from emp sort by deptno desc;

#	Distribute By
#	类似MR中partition，进行分区，结合sort by使用。
#	Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
#	对于distribute by进行测试，一定要分配多reduce进行处理，否则无法看到distribute by的效果。
#先按照部门编号分区，再按照员工编号降序排序。
set mapreduce.job.reduces=3;
insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;
#	Cluster By
#	当distribute by和sorts by字段相同时，可以使用cluster by方式。
#	cluster by除了具有distribute by的功能外还兼具sort by的功能。
#	但是排序只能是升序排序，不能指定排序规则为ASC或者DESC。
select * from emp distribute by deptno sort by deptno;
#等价于
select * from emp cluster by deptno;

分桶及抽样查询

分区针对的是数据的存储路径；分桶针对的是数据文件。
分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区，特别是之前所提到过的要确定合适的划分大小这个疑虑。
分桶是将数据集分解成更容易管理的若干部分的另一个技术。

#创建分桶表
create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';
#查看表结构
desc formatted stu_buck;
#导入数据到分桶表中
load data local inpath '/opt/module/datas/student.txt' into table stu_buck;

注意：上面创建语句中分4个桶，但结果只有一个桶，应先设置属性

set hive.enforce.bucketing=true;
set mapreduce.job.reduces=-1;

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求。

select * from stu_buck tablesample(bucket 1 out of 4 on id);

语法：TABLESAMPLE(BUCKET x OUT OF y) 。

y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了4份，当y=2时，抽取(4/2=)2个bucket的数据，当y=8时，抽取(4/8=)1/2个bucket的数据。
x表示从哪个bucket开始抽取，如果需要取多个分区，以后的分区号为当前分区号加上y。例如，table总bucket数为4，tablesample(bucket 1 out of 2)，表示总共抽取（4/2=）2个bucket的数据，抽取第1(x)个和第3(x+y)个bucket的数据。
x的值必须小于等于y的值

5. CASE WHEN

#分别统计男女人数
select 
  dept_id,
  sum(case sex when '男' then 1 else 0 end) male_count,
  sum(case sex when '女' then 1 else 0 end) female_count
from 
  emp_sex
group by
  dept_id;

6. 行转列

相关函数：

CONCAT(string A/col, string B/col…)：返回输入字符串连接后的结果，支持任意个输入字符串;
CONCAT_WS(separator, str1, str2,…)：它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL，返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;
COLLECT_SET(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生array类型字段。

#原数据：
孙悟空	白羊座	A
大海	     射手座	A
宋宋	     白羊座	B
猪八戒    白羊座	A
凤姐	     射手座	A

#需求：
#把星座和血型一样的人归类到一起。结果如下：
射手座,A            大海|凤姐
白羊座,A            孙悟空|猪八戒
白羊座,B            宋宋

#创建hive表：
create table person_info(
name string, 
constellation string, 
blood_type string) 
row format delimited fields terminated by "\t";
load data local inpath “/opt/module/datas/person_info.txt” into table person_info;

#查询语句：
select
    t1.base,
    concat_ws('|', collect_set(t1.name)) name
from
    (select
        name,
        concat(constellation, ",", blood_type) base
    from
        person_info) t1
group by
    t1.base;

7. 列转行

相关函数：

EXPLODE(col)：将hive一列中复杂的array或者map结构拆分成多行。
LATERAL VIEW
用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias
解释：用于和split, explode等UDTF一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

#原数据：
《疑犯追踪》	悬疑,动作,科幻,剧情
《Lie to me》	悬疑,警匪,动作,心理,剧情
《战狼2》	战争,动作,灾难

#需求:
#将电影分类中的数组数据展开。结果如下：
《疑犯追踪》      悬疑
《疑犯追踪》      动作
《疑犯追踪》      科幻
《疑犯追踪》      剧情
《Lie to me》   悬疑
《Lie to me》   警匪
《Lie to me》   动作
《Lie to me》   心理
《Lie to me》   剧情
《战狼2》        战争
《战狼2》        动作
《战狼2》        灾难

#创建hive表并导入数据
create table movie_info(
    movie string, 
    category array<string>) 
row format delimited fields terminated by "\t"
collection items terminated by ",";

load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

#按需求查询数据:
select
    movie,
    category_name
from 
    movie_info lateral view explode(category) table_tmp as category_name;

8. 窗口函数

相关函数：

OVER()：指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化
CURRENT ROW：当前行
n PRECEDING：往前n行数据
n FOLLOWING：往后n行数据
UNBOUNDED：起点，UNBOUNDED PRECEDING 表示从前面的起点， UNBOUNDED FOLLOWING表示到后面的终点
LAG(col,n)：往前第n行数据
LEAD(col,n)：往后第n行数据
NTILE(n)：把有序分区中的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，NTILE返回此行所属的组的编号。注意：n必须为int类型。

#原数据：
name，orderdate，cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

#创建hive表并导入数据
create table business(
name string, 
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/datas/business.txt" into table business;

#按需求查询数据
# 1）查询在2017年4月份购买过的顾客及总人数
select name,count(*) over () 
from business 
where substring(orderdate,1,7) = '2017-04' 
group by name;

# 2）查询顾客的购买明细及月购买总额
select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from
 business;
# 3）上述的场景,要将cost按照日期进行累加
select name,orderdate,cost, 
sum(cost) over() as sample1,--所有行相加 
sum(cost) over(partition by name) as sample2,--按name分组，组内数据相加 
sum(cost) over(partition by name order by orderdate) as sample3,--按name分组，组内数据累加 
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 
from business;

# 4）查看顾客上次的购买时间
select name,orderdate,cost, 
lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 
from business;

# 5）查询前20%时间的订单信息
select * from (
    select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
    from business
) t
where sorted = 1;

9. Rank

函数说明：

RANK() 排序相同时会重复，总数不会变
DENSE_RANK() 排序相同时会重复，总数会减少
ROW_NUMBER() 会根据顺序计算

#原数据：
name	subject	score
孙悟空	语文	87
孙悟空	数学	95
孙悟空	英语	68
大海	语文	94
大海	数学	56
大海	英语	84
宋宋	语文	64
宋宋	数学	86
宋宋	英语	84
婷婷	语文	65
婷婷	数学	85
婷婷	英语	78

#需求
#计算每门学科成绩排名。

#创建hive表并导入数据
create table score(
name string,
subject string, 
score int) 
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/score.txt' into table score;

#按需求查询数据
select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;

#结果：
name    subject score   rp      drp     rmp
孙悟空  数学    95      1       1       1
宋宋    数学    86      2       2       2
婷婷    数学    85      3       3       3
大海    数学    56      4       4       4
宋宋    英语    84      1       1       1
大海    英语    84      1       1       2
婷婷    英语    78      3       2       3
孙悟空  英语    68      4       3       4
大海    语文    94      1       1       1
孙悟空  语文    87      2       2       2
婷婷    语文    65      3       3       3
宋宋    语文    64      4       4       4

10. 自定义函数

根据用户自定义函数类别分为以下三种：

UDF（User-Defined-Function）
一进一出
UDAF（User-Defined Aggregation Function）
聚集函数，多进一出，类似于：count/max/min
UDTF（User-Defined Table-Generating Functions）
一进多出，如lateral view explore()

编程步骤：

1）继承org.apache.hadoop.hive.ql.UDF
2）需要实现evaluate函数；evaluate函数支持重载；
3）在hive的命令行窗口创建函数
a）添加jar
add jar linux_jar_path
b）创建function
create [temporary] function [dbname.]function_name AS class_name;
4）在hive的命令行窗口删除函数
Drop [temporary] function [if exists] [dbname.]function_name;

注意：UDF必须要有返回类型，可以返回null，但是返回类型不能为void；

1．创建一个Maven工程Hive
2．导入依赖

<dependencies>
		<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
		<dependency>
			<groupId>org.apache.hive</groupId>
			<artifactId>hive-exec</artifactId>
			<version>1.2.1</version>
		</dependency>
</dependencies>

3．创建一个类

// 创建一个类
package com.atguigu.hive;
import org.apache.hadoop.hive.ql.exec.UDF;

public class Lower extends UDF {
	public String evaluate (final String s) {	
		if (s == null) {
			return null;
		}		
		return s.toLowerCase();
	}
}

4．打成jar包上传到服务器/opt/module/jars/udf.jar
5．将jar包添加到hive的classpath

add jar /opt/module/datas/udf.jar;

6.创建临时函数与开发好的java class关联

create temporary function mylower as "com.atguigu.hive.Lower";

7．即可在hql中使用自定义的函数

select ename, mylower(ename) lowername from emp;