大数据测试-Hive DML语句与函数使用2-CSDN博客

本文链接：https://blog.csdn.net/qq_42114233/article/details/130654659

一、Hive SQL DML语法之加载数据

Hive SQL-DML-Load加载数据

回顾

在Hive中建表成功之后，就会在HDFS上创建一个与之对应的文件夹，且文件夹名字就是表名；

文件夹父路径是由参数hive.metastore.warehouse.dir控制，默认值是/user/hive/warehouse；
不管路径在哪里，只有把数据文件移动到对应的表文件夹下面，Hive才能映射解析成功;
最原始暴力的方式就是使用hadoop fs –put|-mv等方式直接将数据移动到表文件夹下；
但是，Hive官方推荐使用Load命令将数据加载到表中

Load语法功能

Load英文单词的含义为：加载、装载；
所谓加载是指：将数据文件移动到与Hive表对应的位置，移动时是纯复制、移动操作。
纯复制、移动指在数据load加载到表中时，Hive不会对表中的数据内容进行任何转换，任何操作。

Load语法规则

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename;

语法规则之filepath

filepath表示待移动数据的路径。可以指向文件（在这种情况下，Hive将文件移动到表中），也可以指向目录（在这种情况下，Hive将把该目录中的所有文件移动到表中）。
filepath文件路径支持下面三种形式，要结合LOCAL关键字一起考虑：

1. 相对路径，例如：project/data1

2. 绝对路径，例如：/user/hive/project/data1

3. 具有schema的完整URI ，例如：hdfs://namenode:9000/user/hive/project/data1

语法规则之LOCAL

指定 LOCAL ，将在本地文件系统中查找文件路径。

若指定相对路径，将相对于用户的当前工作目录进行解释；
用户也可以为本地文件指定完整的URI-例如：file:///user/hive/project/data1。

没有指定 LOCAL 关键字。

如果filepath指向的是一个完整的URI，会直接使用这个URI；
如果没有指定schema，Hive会使用在hadoop配置文件中参数fs.default.name指定的（不出意外，都是HDFS）

LOCAL 本地是哪里？

如果对HiveServer2服务运行此命令

本地文件系统 指的是 Hiveserver2服务所在机器的本地Linux文件系统，不是Hive客户端所在的本地文件系统。

练习

Load Data From Local FS or HDFS

1、练习Load Data From Local FS

2、练习Load Data From HDFS

3、理解Local关键字的含义

Step1: 建表

--创建库
create database if not exists ods_test comment 'test db' with dbproperties ('createdBy'='yasar');
--使用库
use ods_test;
--step1:建表
--建表student_local 用于演示从本地加载数据
create table if not exists student_local(num int COMMENT '分数',name string COMMENT '姓名',sex string COMMENT '性别',age int COMMENT '年龄',dept string COMMENT '部门') ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
--建表student_HDFS 用于演示从HDFS加载数据
create external table if not exists student_HDFS(num int COMMENT '分数',name string COMMENT '姓名',sex string COMMENT '性别',age int COMMENT '年龄',dept string COMMENT '部门') ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Step2:load 加载数据

--建议使用beeline客户端 可以显示出加载过程日志信息
--step2:加载数据
-- 从本地加载数据 数据位于HS2（node1）本地文件系统 本质是hadoop fs -put上传操作
load data local inpath '/export/tmp/students.txt' into table ods_test.student_local;
--从HDFS加载数据 数据位于HDFS文件系统根目录下 本质是hadoop fs -mv 移动操作
--先把数据上传到HDFS上 hadoop fs -put /root/hivedata/students.txt /
load data inpath '/test_data/students.txt' into table ods_test.student_HDFS;

Hive SQL-DML-Insert插入数据

Insert语法功能

Hive官方推荐加载数据的方式：清洗数据成为结构化文件，再使用Load语法加载数据到表中。这样的效率更高。
也可以使用insert语法把数据插入到指定的表中，最常用的配合是把查询返回的结果插入到另一张表中。

insert+select

insert+select表示：将后面查询返回的结果作为内容插入到指定表中。

1. 需要保证查询结果列的数目和需要插入数据表格的列数目一致。

2. 如果查询出来的数据类型和插入表格对应的列数据类型不一致，将会进行转换，但是不能保证转换一定成功，转换失败的数据将会为NULL。

INSERT INTO TABLE tablename select statement1 FROM from_statement;


--step1:创建一张源表student
drop table if exists ods_test.student;
create table ods_test.student(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';
--加载数据
load data inpath '/test_data/students.txt' into table ods_test.student;
--step2：创建一张目标表 只有两个字段
create table ods_test.student_info(sno int,sname string);
--使用insert+select插入数据到新表中
insert into table ods_test.student_info select num,name from ods_test.student;
select * from ods_test.student_info;

二、Hive SQL DML语法之查询数据

Select语法树

从哪里查询取决于FROM关键字后面的table_reference，这是我们写查询SQL的首先要确定的事即你查询谁？
表名和列名不区分大小写。

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[LIMIT [offset,] rows]；

（1）select_expr

select_expr表示检索查询返回的列，必须至少有一个select_expr。

--1、select_expr
--查询所有字段或者指定字段
select * from ods_test.student_local;
select name from ods_test.student_local;
--查询当前数据库
select current_database(); --省去from关键字

（2）ALL 、DISTINCT

用于指定查询返回结果中重复的行如何处理。

1. 如果没有给出这些选项，则默认值为ALL（返回所有匹配的行）。

2. DISTINCT指定从结果集中删除重复的行。

--返回所有匹配的行
select name from ods_test.student_local;
--相当于
select all name from ods_test.student_local;
--返回所有匹配的行 去除重复的结果
select distinct name from ods_test.student_local;
--多个字段distinct 整体去重
select distinct age,name from ods_test.student_local;

（3）WHERE

WHERE后面是一个布尔表达式（结果要么为true，要么为false），用于查询过滤，当布尔表达式为true时，返回 select后面expr表达式的结果，否则返回空。
在WHERE表达式中，可以使用Hive支持的任何函数和运算符，但聚合函数除外。

--3、WHERE CAUSE
select * from ods_test.student_local where 1 > 2; -- 1 > 2 返回false
select * from ods_test.student_local where 1 = 1; -- 1 = 1 返回true
--找出来自于California州的疫情数据
select * from ods_test.student_local where name = "李勇";
--where条件中使用函数 找出州名字母长度超过10位的有哪些
select * from ods_test.student_local where length(name) >1 ;
--注意：where条件中不能使用聚合函数
-- --报错 SemanticException:Not yet supported place for UDAF ‘sum'
--聚合函数要使用它的前提是结果集已经确定。
--而where子句还处于“确定”结果集的过程中，因而不能使用聚合函数。
select name,sum(age) from ods_test.student_local where sum(age) >100 group by state;
--可以使用Having实现
select name,sum(age) from ods_test.student_local group by name having sum(age) > 20;

比较运算、逻辑运算

特殊条件（空值判断、between、in）

聚合操作

SQL中拥有很多可用于计数和计算的内建函数，其使用的语法是：SELECT function(列) FROM 表。
这里我们要介绍的叫做聚合（Aggregate）操作函数，如：Count、Sum、Max、Min、Avg等函数.

聚合函数的最大特点是不管原始数据有多少行记录，经过聚合操作只返回一条数据，这一条数据就是聚合的结果。

常见的聚合操作函数

AVG(column)	返回某列的平均值
COUNT(column)	返回某列的行数（不包括 NULL 值）
COUNT(*)	返回被选行数
MAX(column)	返回某列的最高值
MIN(column)	返回某列的最低值
SUM(column)	返回某列的总和

--4、聚合操作
--统计总共有多少学生
select count(*) from  ods_test.student_local;
-- 统计年龄20岁的学生总数
select count(*) from  ods_test.student_local where age = 20;
--所有男生的年龄总数
select sum(age) from  ods_test.student_local where sex = "男";
--年龄最大
select max(age) from  ods_test.student_local;
--最小年龄
select min(age) from  ods_test.student_local;
--平均年龄
select avg(age) from  ods_test.student_local;

（5）GROUP BY概念

GROUP BY语句用于结合聚合函数， 根据一个或多个列对结果集进行分组 ；

如果没有group by语法，则表中的所有行数据当成一组。

出现在GROUP BY中select_expr的字段： 要么是 GROUP BY 分组的字段 ； 要么是被聚合函数应用的字段 。

原因：避免出现一个字段多个值的歧义。

1. 分组字段出现select_expr中，一定没有歧义，因为就是基于该字段分组的，同一组中必相同；

2. 被聚合函数应用的字段，也没歧义，因为聚合函数的本质就是多进一出，最终返回一个结果。

（6）HAVING

在SQL中增加HAVING子句原因是，WHERE关键字无法与聚合函数一起使用。
HAVING子句可以让我们筛选分组后的各组数据,并且可以在Having中使用聚合函数，因为此时where，group by 已经执行结束，结果集已经确定。

HAVING与WHERE区别

having是在分组后对数据进行过滤
where是在分组前对数据进行过滤
having后面可以使用聚合函数
where后面不可以使用聚合函数

（7）ORDER BY

ORDER BY 语句用于 根据指定的列对结果集进行排序 。

ORDER BY 语句默认按照升序（ ASC ）对记录进行排序。如果您希望按照降序对记录进行排序，可以使用 DESC 关键字

（8）LIMIT

LIMIT用于限制SELECT语句返回的行数。
LIMIT接受一个或两个数字参数，这两个参数都必须是非负整数常量。
第一个参数指定要返回的第一行的偏移量（从 Hive 2.0.0开始），第二个参数指定要返回的最大行数。当给出单个参数时，它代表最大行数，并且偏移量默认为0。

二、Hive SQL Join关联查询

Hive Join语法规则

在Hive中，使用最多，最重要的两种join分别是：
inner join（内连接）、left join（左连接）

table_reference：是join查询中使用的表名。

table_factor：与table_reference相同,是联接查询中使用的表名。

j oin_condition：join查询关联的条件，如果在两个以上的表上需要连接，则使用AND关键字

join_table:
table_reference [INNER] JOIN table_factor [join_condition]
| table_reference {LEFT} [OUTER] JOIN table_reference join_condition
join_condition:
ON expression

join查询数据环境准备

为了更好的练习、学习掌握Hive中的join语法，下面我们去创建3张表并且加载数据到表中。

表1：employee 员工表；

表2：employee_address 员工住址信息表；

表3：employee_connection 员工联系方式表；

--table1: 员工表
CREATE TABLE employee(
id int,
name string,
deg string,
salary int,
dept string
) row format delimited
fields terminated by ',';
--table2:员工家庭住址信息表
CREATE TABLE employee_address (
id int,
hno string,
street string,
city string
) row format delimited
fields terminated by ',';
--table3:员工联系方式信息表
CREATE TABLE employee_connection (
id int,
phno string,
email string
) row format delimited
fields terminated by ',';

join查询数据环境准备

--加载数据到表中
load data local inpath '/export/tmp/employee.txt' into table employee;
load data local inpath '/export/tmp/employee_address.txt' into table employee_address;
load data local inpath '/export/tmp/employee_connection.txt' into table employee_connection;

inner join 内连接

内连接是最常见的一种连接，它也被称为普通连接，其中inner可以省略：inner join == join ；
只有进行连接的两个表中都存在与连接条件相匹配的数据才会被留下来。

inner join 内连接

--1、inner join
select e.id,e.name,e_a.city,e_a.street
from employee e inner join employee_address e_a
on e.id =e_a.id;
--等价于 inner join=join
select e.id,e.name,e_a.city,e_a.street
from employee e join employee_address e_a
on e.id =e_a.id;
--等价于 隐式连接表示法
select e.id,e.name,e_a.city,e_a.street
from employee e , employee_address e_a
where e.id =e_a.id;

left join 左连接

left join中文叫做是左外连接(Left Outer Join)或者左连接，其中outer可以省略，left outer join是早期的写法。
left join的核心就在于left左。左指的是join关键字左边的表，简称左表。
通俗解释：join时以左表的全部数据为准，右边与之关联；左表数据全部返回，右表关联上的显示返回，关联不上的显示null返回。

--2、left join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e left join employee_connection e_conn
on e.id =e_conn.id;
--等价于 left outer join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e left outer join employee_connection e_conn
on e.id =e_conn.id;

三、 Hive 常用函数入门

Hive 函数概述及分类标准

概述

Hive内建了不少函数，用于满足用户不同使用需求，提高SQL编写效率：

1. 使用 show functions 查看当下可用的所有函数；

2. 通过 describe function extended funcname 来查看函数的使用方式。

分类标准

Hive的函数分为两大类： 内置函数 （Built-in Functions）、 用户定义函数 UDF （User-Defined Functions）：

内置函数可分为：数值类型函数、日期类型函数、字符串类型函数、集合函数、条件函数等；
用户定义函数根据输入输出的行数可分为3类：UDF、UDAF、UDTF。

用户定义函数UDF分类标准

根据函数输入输出的行数：

UDF （User-Defined-Function）普通函数，一进一出

UDAF （User-Defined Aggregation Function）聚合函数，多进一出

UDTF （User-Defined Table-Generating Functions）表生成函数，一进多出

UDF分类标准扩大化

1、UDF分类标准本来针对的是用户自己编写开发实现的函数。 UDF分类标准可以扩大到Hive的所有函数中：包括内置函数和用户自定义函数。

因为不管是什么类型的函数，一定满足于输入输出的要求，那么从输入几行和输出几行上来划分没有任何问题。千万不要被UD（User-Defined）这两个字母所迷惑，照成视野的狭隘。

2、比如Hive官方文档中，针对聚合函数的标准就是内置的UDAF类型

Hive 常用的内置函数

内置函数（build-in）指的是Hive开发实现好，直接可以使用的函数,也叫做内建函数。
内置函数根据应用归类整体可以分为8大种类型，我们将对其中重要的，使用频率高的函数使用进行详细讲解

（1）String Functions 字符串函数

•字符串长度函数：length

•字符串反转函数：reverse

•字符串连接函数：concat

•带分隔符字符串连接函数：concat_ws

•字符串截取函数：substr,substring

------------String Functions 字符串函数------------
--长度
select length("yasar");
--倒序
select reverse("yasar");
连接字符串
select concat("angela","baby");
--带分隔符字符串连接函数：concat_ws(separator, [string | array(string)]+)
select concat_ws('.', 'www', array('baidu', 'com'));
select concat_ws('.', 'www', 'baidu', 'com');

--字符串截取函数：substr(str, pos[, len]) 或者 substring(str, pos[, len])
select substr("angelababy",-2); --pos是从1开始的索引，如果为负数则倒着数
select substr("angelababy",2,2);
--分割字符串函数: split(str, regex)
select split('apache hive', ' ');

（2）Date Functions 日期函数

----------- Date Functions 日期函数 -----------------
--获取当前日期: current_date
select current_date();
--获取当前UNIX时间戳函数: unix_timestamp
select unix_timestamp();
--日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp("2011-12-07 13:01:03");
--指定格式日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss');
--UNIX时间戳转日期函数: from_unixtime
select from_unixtime(1618238391);
select from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
--日期比较函数: datediff 日期格式要求'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'
select datediff('2012-12-08','2012-05-09');
--日期增加函数: date_add
select date_add('2012-02-28',10);
--日期减少函数: date_sub
select date_sub('2012-01-1',10);

（3）Mathematical Functions 数学函数

----Mathematical Functions 数学函数-------------
--取整函数: round 返回double类型的整数值部分 （遵循四舍五入）
select round(3.1415926);
--指定精度取整函数: round(double a, int d) 返回指定精度d的double类型
select round(3.1415926,4);
--取随机数函数: rand 每次执行都不一样 返回一个0到1范围内的随机数
select rand();
--指定种子取随机数函数: rand(int seed) 得到一个稳定的随机数序列
select rand(3)

（4）Conditional Functions 条件函数

主要用于条件判断、逻辑判断转换这样的场合

-----Conditional Functions 条件函数------------------
--使用之前课程创建好的student表数据
select * from student limit 3;
--if条件判断: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
select if(1=2,100,200);
select if(sex ='男','M','W') from student limit 3;
--空值转换函数: nvl(T value, T default_value)
select nvl("allen","yasar");
select nvl(null,"yasar");
--条件转换函数: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end;
select case sex when '男' then 'male' else 'female' end from student limit 3;

四、Sampling采样

# 随机桶抽取， 分配桶是有规则的
# 可以按照列的hash取模分桶
# 按照完全随机分桶
-- 其它条件不变的话，每一次运行结果一致
select username, orderId, totalmoney FROM itheima.orders
tablesample(bucket 3 out of 10 on username);
-- 完全随机，每一次运行结果不同
select * from itheima.orders
tablesample(bucket 3 out of 10 on rand());
# 数据块抽取，按顺序抽取，每次条件不变，抽取结果不变
-- 抽取100条
select * from itheima.orders
tablesample(100 rows);
-- 取1%数据
select * from itheima.orders
tablesample(1 percent);
-- 取 1KB数据
select * from itheima.orders
tablesample(1K);