Hive高阶语句（Load、Insert、排序、聚合运算）

最新推荐文章于 2024-05-14 15:23:36 发布

小五家的二哈

最新推荐文章于 2024-05-14 15:23:36 发布

阅读量1.3k

点赞数 1

文章标签： hive 大数据数据仓库

本文链接：https://blog.csdn.net/qq_43012693/article/details/108673769

版权

我们在前面的文章中介绍了Hive的基础语法和高阶查询，本篇将继续带大家学习Hive的高阶语句。

装载数据：Load

我们在前面的文章中也有使用过load的案例，这里再详细给大家介绍一下。Load用于在Hive中移动数据，可以实现数据的导入，上传。
语法格式：
load data [local] inpath ‘路径’ insert [overwrite] into table 表名 [partition()]
local表示文件地址在本地，如果不加则表名是从HDFS上进行传输。overwrite表示覆盖现有数据，如果不加这是追加。
我们以下面的employee数据为例，分别是姓名、地区、个人信息、科目得分、工种

Michael|Montreal,Toronto|Male,30|DB:80|Product:DeveloperLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead Shelley|New
York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89|Sales:Lead

create table employee(
name string,
address array<string>,
personnalInfo array<string>,
technol map<string,int>,
jobs map<string,string>)
row format delimited
fields terminated by '|' 
collection items terminated by ',' 
map keys terminated by ':' 
lines terminated by '\n';

load data local inpath '/opt/employee.txt'
overwrite into table employee;

用select查看结果
在这里插入图片描述
在用追加的方式：

load data local inpath '/opt/employee.txt'
into table employee;

数据在原有的基础上又增加了四行
在这里插入图片描述
load也支持分区时插入，这里就不再演示。

insert插入

当我们想实现把一个表中的数据直接插入到另一张表的时候，就可以用的本节所介绍的内容，当然，两个表对应的结构要一致。insert插入也支持overwrite和into。insert的用法还是挺广泛的，下面我们一一来介绍。

多插入

格式：
from 调用数据表的表名
insert overwrite/into [table]插入的表1 select *
insert overwrite/into [table]插入的表2 select *
…
需要注意的是，当我们使用overwrite时，table不能省略，使用into时可以省略。
我们还是以上面的数据为例。这里我直接复制了employee表，建立了两个新表firstEmp和secondEmp.然后把employee表中的数据导入到这两个表中：

from employee
insert overwrite table firstEmp select *
insert into secondEMp select *;

查询结果：
在这里插入图片描述

插入到分区

格式：
*from 调用数据表的表名
insert overwrite/into [table]插入的表 partition(分区名 [类型])
select ,‘分区名’;
我们先对上面的数据做个分区表:

create table employee_partition(
name string,
address array<string>,
personnalInfo array<string>,
technol map<string,int>,
jobs map<string,string>)
partitioned by (year string,month int)
row format delimited
fields terminated by '|' 
collection items terminated by ',' 
map keys terminated by ':' 
lines terminated by '\n';

开启分区：

#开启动态分区，默认是false
set hive.exec.dynamic.partition=true;
#开启允许所有分区都是动态的，否则必须要有静态分区才能使用。
set hive.exec.dynamic.partition.mode=nonstrict;

把数据插入到分区：

from employee
insert overwrite table employee_partition partition(year,month)
select * ,'2020',9;

插入或导出到文件

文件插入只支持overwrite.
把表格中的数据插入到本地文件：
格式：
*from 表名 insert overwrite local directory ‘本地路径’ select ;

from employee
insert overwrite local directory '/opt/b.txt'select *;

执行完成后，会在本地创建一个文件夹，数据内容就在这个文件夹下的文件里。即使最后路径命名为.txt也不会自动创建一个文件：
在这里插入图片描述
数据上传到HDFS：
格式：
*from 表名 insert overwrite directory ‘HDFS路径’ select ;

from employee
insert overwrite directory '/data/outer'select *;

执行结果：
在这里插入图片描述
导出到文件：
*from 表名1 insert overwrite table 表名2 select ;

from employee
insert overwrite table newEmp select *;

插入完成后查看表信息：
在这里插入图片描述
指定格式插入数据：

insert overwrite directory '/data/outer2'
row format delimited
fields terminated by ','
select * from employee;

执行结果：
在这里插入图片描述

数据交换

数据交换包括Import数据导入和export数据导出，常用于数据迁移，除了数据库，可导入导出所有数据和元数据。
导出数据：
数据是导出到HDFS上

export table employee to '/data/emp';

导出数据后，会在目录下生产一个_metadata文件和data文件夹，而我们copy的数据就在data文件夹下。
在这里插入图片描述
导入数据：
导入数据就是把我们导入到HDFS上的数据导入到一张新表中，文件新表再导入前是不存在的，无法从本地导入数据到一张表中：

import table e from '/data/emp';

查询结果：
在这里插入图片描述

数据交换也可以实现分区的导入和导出，例如：

export table employee_partition partition(year='2021',month=9) to '/data/partition';

数据排序

Hive的数据排序主要有四种，order by、sort by 、distribute by和cluster by。
order by

Hive中的order by类似于标准SQL，对全局进行排序。我们以下列数据为例，这是一个employye_hr表：

McGrirl|1|945-639-8596|2011-11-24 Gabriela
Feldheim|2|706-232-4166|2017-12-16 Billy
O’Driscoll|3|660-841-7326|2017-02-17 Kevina
Rawet|4|955-643-0317|2012-01-05 Patty
Entreis|5|571-792-2285|2013-06-11 Claudetta
Sanderson|6|350-766-4559|2016-11-04 Bentley
Oddie|7|446-519-0975|2016-05-02 Theressa
Dowker|8|864-330-9976|2012-09-26 Jenica
Belcham|9|347-248-4379|2011-05-02 Reube
Preskett|10|918-740-2357|2015-03-26 Mary
Skeldon|11|361-159-8710|2016-03-09 Ethelred
Divisek|12|995-145-7392|2016-10-18 Field
McGraith|13|149-133-9607|2015-10-06 Andeee
Wiskar|14|315-207-5431|2012-05-10 Lloyd
Nayshe|15|366-495-5398|2014-06-28 Mike
Luipold|16|692-803-9373|2011-05-14 Tallie
Swaine|17|570-709-6561|2011-08-06 Worth
Ledbetter|18|905-586-2348|2012-09-25 Reine
Leyborne|19|322-644-5798|2015-01-05 Norby
Bellson|20|736-881-5785|2012-12-31 Nellie
Jewar|21|551-505-3957|2017-06-18 Hoebart
Deeth|22|780-240-0213|2011-09-19

创建表：

create  table emp_hr(
name string,
id int,
iphone string,
date string)
partitioned by (year int,month int)
row format delimited
fields terminated by '|'
lines terminated by '\n';
stored as textfile
location '/tmp/data/employee';

我们以id列做排序，order by 默认是正序排序，如果想要倒叙需要在order by id后加desc，即 order by id desc;

select * from employee_hr order by id;

排序结果：
在这里插入图片描述
order by的特点是只使用一个Reduce执行全局数据排序，它还支持使用case when或表达式，支持按位置编号排序。
例如我们想把id为15的人排在第一位，其他不变就可以用下列代码实现：

set hive.groupby.orderby.position.alias=true;

select * from employee_hr order by case when id=15 then 1 else 0 end desc;

按照倒叙排序，把id为15的人放第一位
在这里插入图片描述
因为order by是对全局进行排序，所以它的速度较慢，在做排序前，应提前做好数据过滤。
sort by
sort by和order by最大的不同在于，它是局部排序，而且排序列必须在select 的列表中。

select name,id,iphone from employee_hr sort by id;

在这里插入图片描述
如果select的列表中不包含id则会报错：

distribute by
distribute by类似于SQL中的group by
不过group by 可以使用聚合函数，distribute by是按照一定的分组把数据分到不同的reduce中。我们直到，当数据进入到Reduce端的时候是有顺序的，但是ditribute by本身并不会对各个Reduce的输出进行排序所以其一般和sort by结合使用，用在sort by前。
我们用下面的数据来做个演示：

id course score
2016001 PE 91
2016001 YY 60
2016001 SX 88
2016001 YW 91
2016002 SX 77
2016002 YW 33

select * from score distribute by id sort by scores ;

在这里插入图片描述

cluster by
cluster by=distribyte by+sort by。

select * from score cluster by scores;

在这里插入图片描述
可以看到执行结果和用cluster by、sort by的结果一样。
注意：cluster by 不支持desc/asc，为了充分利用所有的Reducer来执行全局程序，可以先使用cluster by，然后使用order by.
区别

sort by ：局部排序，在进入Reduce前完成排序，只在单节点上排序
order by：全局排序，只有一个reduce；
distribute by:按照分组把数据分到不同的reduce中，通常使用的sort by语句前；
cluster by:除了distribute by的功能外，还加入了sort by局部排序，分组的同时排序；

聚合运算

Group by

group by用于分组，常与聚合函数一起使用，如max,min,count,sum等。如果没有指定group by子句则默认聚合整个表。除聚合函数外，所选的其他列也必须包含在group by中。同样，group by也支持case when语句和按位置编号分组。
还是以上面的数据为例：

select id,count(1) from score group by id;

以id为分组，就可以查出每个id各有多少条数据
在这里插入图片描述
使用case when表达式：

select case when id='2016001' then 1 else 0 end as ids,max(scores)
from score group by id;

把id为2016001的表示为1，其他的为2，求出他们各自所得分数的最大值：
在这里插入图片描述
Having
Having 的作用是对聚合的结果条件过滤。它和where不同。where是在语句执行完后，对最终的结果进行条件筛选。
以上面的数据为列，我们要想实现筛选出考试科目大于2门的学生id：

select id,count(1)cnt from score group by id having cnt>2;

在查询完数据后，筛选出数据条数大于2的。
在这里插入图片描述
高级聚合
Grouping sets 实现对同一数据集进行多重group by操作，其本质是多个group by 进行union all 操作：

在这里插入图片描述
例如：

select id,course,sum(scores) from score
group by id,course
grouping sets((id,course),id);

结果为按id和科目分组，计算出各个科目的总分，再加上以id分组的总分数。
在这里插入图片描述

小五家的二哈

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Hive高阶语句（Load、Insert、排序、聚合运算）

我们在前面的文章中介绍了Hive的基础语法和高阶查询，本篇将继续带大家学习Hive的高阶语句。目录一级目录二级目录三级目录装载数据：Loadinsert插入多插入插入到分区插入或导出到文件数据交换一级目录二级目录三级目录装载数据：Load我们在前面的文章中也有使用过load的案例，这里再详细给大家介绍一下。Load用于在Hive中移动数据，可以实现数据的导入，上传。语法格式：load data [local] inpath ‘路径’ insert [overwrite] into table
复制链接

扫一扫