6 | 史上最全大数据笔记-Hive（下）

凸头到佬

于 2024-06-04 12:37:06 发布

阅读量923

点赞数 30

分类专栏：大数据技术学习笔记文章标签：大数据 hadoop hive 笔记

本文链接：https://blog.csdn.net/m0_61696517/article/details/139439508

版权

大数据技术学习笔记专栏收录该内容

6 篇文章 0 订阅

订阅专栏

第五章 Hive Shell技巧

5.1 只执行一次Hive命令

通过shell的参数 -e 可以执行一次就运行完的命令

[root@hadoop01 hive]#  hive -e "select * from cat"

小技巧2:可以通过外部命令快速查询某个变量值:

hive -S -e "set" |grep cli.print

-S 是静默模式,会省略到多余的输出

5.2 单独执行一个sql文件

通过参数-f +file文件名就可以,经常用在以后的sql文件单独执行,导入数据场景中

[root@hadoop01 hive]#  hive -f /path/cat.sql

小技巧:在Hive的客户端内，可以通过source命令来执行一个sql

5.3 执行Linux命令

在Hive的shell中加上前缀! 最后以分号;结尾,可以执行linux的命令

 hive>   !pwd ;

5.4 执行HDFS命令

用户可以在Hive的shell中执行HDFS的DFS命令,不用敲入前缀hdfs或者hadoop

 hive> dfs -ls /tmp

5.5 使用历史命令和自动补全

在Hive的Shell操作中可以使用上下箭头查看历史记录

如果忘记了命令可以用tab键进行补全

5.6 显示当前库

下面是通过配置文件hive-site.xml显示

 <property>
     <name>hive.cli.print.current.db</name>
     <value>false</value>
     <description>Whether to include the current database in the Hive prompt.</description>
 </property>

5.7 当前session里设置该参数

 hive> set hive.cli.print.current.db=true;

5.8 查看当前参数设置的值

小技巧1:可以在shell中输入set命令,可以看到hive已经设定好的参数

  hive> set hive.cli.print.current.db;

5.9 local模式

 set hive.exec.mode.local.auto=true； (建议打开) 
 hive.exec.mode.local.auto.inputbytes.max=134217728
 hive.exec.mode.local.auto.input.files.max=4

如果出现打开文件过多的错误：

 vim /etc/sectrity/limits.conf
 
 在下方添加：
 root soft nofile 65535
 root hard nofile 65535

第六章 Hive基本查询语法

6.1 基本使用规则

6.1.1 基本查询语句组成

 select ..
 from ..
         join [tableName] on ..
         where ..
         group by ..
         having ..
         order by ..
         sort by ..
         limit ..
 union | union all ...

6.1.2 执行顺序

 第一步: FROM <left_table>
 第二步: ON <join_condition>
 第三步: <join_type> JOIN <right_table>
 第四步: WHERE <where_condition>
 第五步: GROUP BY <group_by_list>
 第六步: HAVING <having_condition>
 第七步: SELECT
 第八步: DISTINCT <select_list>
 第九步: ORDER BY <order_by_condition>
 第十步: LIMIT <limit_number>
 
 标准sql语句的一些规则：
 -1. 列别名的使用，必须完全符合执行顺序，不能提前使用。（mysql除外）
 -2. 在分组查询时，select子句中只能含有分组字段和聚合函数，不能有其他普通字段。(mysql除外)

6.1.3 查询原则

 1. 尽量不使用子查询、尽量不使用in 或者not in (可以使用 [not] exists替代)
 2. 尽量避免join连接查询，但是通常避免不了
 3. 查询永远是小表驱动大表（小表作为驱动表）
   --注意：内连接时，默认是左表是驱动表，因此左表一定要是小表。
   --         外连接看需求而定。

6.2 常用子句回顾

6.2.1 where语句特点

 where后不能使用聚合函数，可以使用子查询，也可以是普通函数。
 条件可以是：
 1. 关系表达式： =, >,>=,<,<=,!=,<>
 2. 连接符号：  or，and, between .. and ..
 3. 模糊查询： like   
                          %:通配符
                          _:占位符
 4. [not] in
     >all(set)  >any();
 
 注意事项：在hive的where中如果使用了子查询作为条件，等号“=”不好使，需要使用[not] in.
 换句话说，即使子查询返回的是唯一的一个值，也是集合形式。

6.2.2 group by语句特点

 group by: 分组，通常和聚合函数搭配使用
 
 查询的字段要么出现在group by 后面，要么出现在聚合函数里面
 
 聚合函数：count(),sum(),max(),min(),avg()
 
 count的执行
 1. 执行效果上：
         - count(*)包括了所有的列，相当于行数，在统计结果的时候不会忽略null值
         - count(1)包括了所有列，用1代表行，在统计结果的时候也不会忽略null值
         - count(列名)只包括列名那一列，在统计结果时，会忽略null值
 
 2.执行效率上：
         - 列名为主键，count(列名)会比count(1)快
         - 列名不为主键，count(1)会比count(列名)快
         - 如果表中有多个列并且没有主键，count（1）的效率高于count(*)
         - 如果有主键count(主键)效率是最高的
         - 如果表中只有一个字段count(*)效率最高

6.2.3 having子句特点

 对分组以后的结果集进行过滤。可以使用聚合函数。

6.2.4 order by子句

 对查询的数据进行排序。
 desc 降序
 asc  升序
 
 语法：
 order by colName [desc|asc][,colName [desc|asc]]

6.2.5 limit语句特点

 limit ： 从结果集中取数据的条数
 将set hive.limit.optimize.enable=true 时，limit限制数据时就不会全盘扫描，而是根据限制的数量           进行抽样。 
 
 同时还有两个配置项需要注意：
 hive.limit.row.max.size        这个是控制最大的抽样数量
 hive.limit.optimize.limit.file 这个是抽样的最大文件数量
 
 
 注意：limit 在mysql中 可以有两个参数 limit [m,] n
             在hive中，只能有一个参数 limit n;  查询前n条。
             一般情况下，在使用limit时，都会先order by排序。

6.2.6 union | union all

 union all：将两个或者多个查询的结果集合并到一起。不去重
 union：将两个或者多个查询的结果集合并到一起，去重合并后的数据并排序
 union语句字段的个数要求相同，字段的顺序要求相同。

6.3 join连接

6.3.1 join知识点回顾

有的业务所需要的数据，不是在一张表中，通常会存在多张表中，而这些表中通常应该会存在"有关系"的字段。多表查询时，使用关联字段"连接"（join）在一起，组合成一个新的数据集，就是连接查询。

连接查询操作分为两大类：内连接和外连接，而外连接有细分为三种类型。参考下图

 1. 内连接:  [inner] join
 2. 外连接 (outer join)：（引出一个驱动表的概念：驱动表里的数据全部显示)
   - 左外连接:left [outer] join, 左表是驱动表
   - 右外连接:right [outer] join, 右表是驱动表
   - 全外连接:full [outer] join, hive支持，mysql不支持.两张表里的数据全部显示出来
 3. 注意： join连接只支持等值连接

需要大家注意的是，两张表的关联字段的值往往是不一致的。比如，表 A 包含张三和李四，表 B 包含李四和王五，匹配的只有李四这一条记录。从上图很容易看出，一共有四种处理方式和结果。下图就是四种连接的图示，这张图比上面的维恩图更易懂，也更准确。

上图中，表 A 的记录是 123，表 B 的记录是 ABC，颜色表示匹配关系。返回结果中，如果另一张表没有匹配的记录，则用 null 填充。

笛卡尔积

指的是表 A 和表 B 不存在关联字段，这时表 A（共有 n 条记录）与表 B （共有 m 条记录）连接后，会产生一张包含 n x m 条记录(笛卡尔积)的新表。

案例演示：

 准备数据
 u1文件中的数据如下：
 1,a
 2,b
 3,c
 4,d
 7,y
 8,u
 
 u2文件中的数据如下：
 2,bb
 3,cc
 7,yy
 9,pp

 create table if not exists u1(
 id int,
 name string
 )
 row format delimited 
 fields terminated by ','
 ;
 
 create table if not exists u2(
 id int,
 name string
 )
 row format delimited fields terminated by ','
 ;
 
 load data local inpath './data/u1.txt' into table u1;
 load data local inpath './data/u2.txt' into table u2;

6.4 Hive日志

Hive中的日志分为两种，一种是系统日志，记录了hive的运行情况，错误状况。Job 日志，记录了Hive 中job的执行的历史过程。

在hive-log4j2.properties记录着日志文件的存储位置

 property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
 property.hive.log.file = hive.log

6.5 HQL的几种运行方式

6.5.1 在hive的client中运行

 1. 本地模式下使用hive进行client
 2. 远程模式下使用beeline工具进入client
 3. 远程模式下使用hive进入client

6.5.2 在linux命令行中执行HQL

[root@hadoop01 ~]$ hive -e 'hql query'

案例：
[root@hadoop01 ~]$ hive --database mydb -e 'select * from studentinfo';
[root@hadoop01 ~]$ hive --database exercise --hivevar ls=2 --hiveconf tn=student -e 'select * from ${hiveconf:tn} limit ${hivevar:ls}';


参数选项说明: 
-e  用于执行hql语句
-f  用于执行sql脚本文件
-S  静音模式，不显示mapreduce执行过程
-i  启动hive时初始化一个文件

6.5.3 在命令行中执行HQL文件

[root@hadoop01 ~]$ hive -f 'sql script'

案例：
[root@hadoop01 ~]$ vi hfile.sql
use exercise;
select count(*) from student where s_id<5;
[root@hadoop01 ~]$ hive -f ./hfile.sql或者使用静音模式
[root@hadoop01 ~]$ hive -S -f ./hfile.sql

第七章数据类型的讲解

在hive中，数据类型分为基础数据类型和复杂数据类型两大类型

7.1 数据类型

分类	类型	描述	字面量示例
基本类型	BOOLEAN	true/false	TRUE
	TINYINT	1字节的有符号整数 -128~127	1Y
	SMALLINT	2个字节的有符号整数，-32768~32767	1S
	INT	4个字节的带符号整数	1
	BIGINT	8字节带符号整数	1L
	FLOAT	4字节单精度浮点数	1.0
	DOUBLE	8字节双精度浮点数	1.0
	DEICIMAL	任意精度的带符号小数	1.0
	STRING	字符串，变长	“a”,’b’
	VARCHAR	变长字符串,要设置长度	“a”,’b’
	CHAR	固定长度字符串	“a”,’b’
	BINARY	字节数组	无法表示
	TIMESTAMP	时间戳，纳秒精度	122327493795，另一种“yyyy-MM-dd HH:mm:ss”
	DATE	日期	‘2016-03-29’
复杂类型	ARRAY	有序的的同类型的集合	array(1,2)
	MAP	key-value,key必须为原始类型，value可以任意类型	map(‘a’,1,’b’,2)
	STRUCT	字段集合,类型可以不同	struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
	UNION	在有限取值范围内的一个值	create_union(1,’a’,63)

示例:

 create table if not exists datatype1(
 id1 tinyint,
 id2 smallint,
 id3 int,
 id4 bigint,
 slary float,
 comm double,
 isok boolean,
 content binary,
 dt timestamp
 )
 row format delimited 
 fields terminated by ','
 ;

 
 233,12,342523,455345345,30000,60000,nihao,helloworld,2017-06-02
 126,13,342526,455345346,80000,100000,true,helloworld1,2017-06-02 11:41:30
 
 -timestamp 如果是年月日时分秒的格式，必须是写全，才能映射成功。
 
 load data local inpath './data/datatype.txt' into table datatype1;
 - 在做运算时，小范围类型都可以自动转为大范围类型做运算

7.2 复杂数据类型之array

7.2.1 定义格式如下：

 create table tableName(
 ......
 colName array<基本类型>
 ......
 )
 
 说明：下标从0开始，越界不报错，以null代替

7.2.2 案例准备：

 zhangsan        78,89,92,96
 lisi        67,75,83,94
 王五        23,12

 create table if not exists arr1(
 name string,
 scores array<String>
 )
 row format delimited 
 fields terminated by '\t'
 ;
 
 drop table arr2;
 create table if not exists arr2(
 name string,
 scores array<String>
 )
 row format delimited 
 fields terminated by '\t'
 collection items terminated by ','
 ;
 
 load data local inpath './data/arr1.txt' into table arr1;
 load data local inpath './data/arr1.txt' into table arr2;

7.2.3 查询语句：

 select * from arr1;
 select name,scores[1] from arr2 where size(scores) > 3;
 
 --统计arr2中的每个人的总成绩
 select scores[0]+scores[1]+nvl(scores[2],0)+nvl(scores[3],0) from arr2;

7.2.4 想要一种效果：也就是将数组类型的数据元素展开，换句话说，就是列转行

 zhangsan        78,89,92,96
 lisi        67,75,83,94
 王五        23,12
 将上述效果转成下面的效果，更方便统计每个人的总成绩。
 zhangsan        78
 zhangsan        89
 zhangsan        92
 zhangsan        96
 lisi        67
 lisi        75
 lisi        83
 lisi        94
 王五        23
 王五        12

7.3 展开函数的使用

7.3.1 简介

 - explode：
         展开函数(UDTF函数中的一种),作用是：接受一个数据行，然后返回产生多个数据行  
 - lateral view:虚拟表。
         会将UDTF函数生成的结果放到一个虚拟表中，然后这个虚拟表会和输入行进行join来达到数据聚合的目的

7.3.2 上案例：

 - select explode(score) score from arr2;
 
 - select name,cj from arr2  lateral view explode(scores) mytable as cj;
 
 
 - 统计每个学生的总成绩：
 select name,sum(cj) as totalscore from arr2 lateral view explode(scores) mytable as cj 
 group by name;

7.3.3 需求：向array字段中动态加载数据，不直接load加载，而是insert。

7.3.3.1 准备数据

 create table arr_temp
 as
 select name,cj from arr2 lateral view explode(scores) score as cj;

7.3.3.2 借助collect_set函数，列转行函数，有去重效果。collect_list函数没有去重效果

 drop table arr3;
 create table if not exists arr3(
 name string,
 scores array<string>
 )
 row format delimited 
 fields terminated by ' '
 collection items terminated by ','
 ;
 
 将数据写成array格式：
 insert into arr3
 select name,collect_set(cj) from arr_temp group by name;
 
 查询每个人的最后一科的成绩
 select name,scores[size(scores)-1] lastsubject from arr3;

7.4 复杂数据类型之map

7.4.1 定义格式如下：

 create table tableName(
 .......
 colName map<T,T>
 ......
 )

7.4.2 案例准备：

 zhangsan        chinese:90,math:87,english:63,nature:76
 lisi        chinese:60,math:30,english:78,nature:0
 wangwu        chinese:89,math:25
 
 create table if not exists map1(
 name string,
 score map<string,int>
 )
 row format delimited 
 fields terminated by '\t'
 collection items terminated by ','
 map keys terminated by ':'
 ;
 
 load data local inpath './data/map1.txt' into table map1;

7.4.3 查询语句：

 #查询数学大于35分的学生的英语和自然成绩：
 select 
 m.name,
 m.score['english'] ,
 m.score['nature']
 from map1 m
 where m.score['math'] > 35
 ;
 
 #查看每个人的前两科的成绩总和
 select 
 m.name,
 m.score['chinese']+m.score['math']
 from map1 m;

7.4.4 展开查询

 - 展开效果
 zhangsan        chinese                90
 zhangsan        math        87
 zhangsan        english         63
 zhangsan        nature                76
 
 - explode展开数据：
 select explode(score) as (m_object,m_score) from map1;
 
 - 使用lateral view explode 结合查询：
 select name,m_object,m_score from map1 lateral view explode(score) score as 
 m_object,m_score;
 
 
 - 统计每个人的总成绩
 select name,sum(m_score)
 from map1 lateral view explode(score) score as 
 m_object,m_score
 group by name;

7.4.5 将数据动态写入map字段中

 将下面的数据格式
 zhangsan        chinese 90
 zhangsan        math    87
 zhangsan        english 63
 zhangsan        nature  76
 lisi    chinese 60
 lisi    math    30
 lisi    english 78
 lisi    nature  0
 wangwu  chinese 89
 wangwu  math    25
 wangwu  english 81
 wangwu  nature  9
 转成：
 zhangsan chinese:90,math:87,english:63,nature:76
 lisi chinese:60,math:30,english:78,nature:0
 wangwu chinese:89,math:25,english:81,nature:9

7.4.5.1 准备数据

 create table map_temp
 as
 select name,m_subject,m_score from map1 lateral view explode(score) t1 as m_subject,m_score;

7.4.5.2 开始写：

 第一步：将科目和成绩组合在一起，concat
 select name,concat(m_subject,':',m_score) as score from map_temp;
 
 第二步: 将所有属于同一个人的数据组合在一起
 select name,collect_set(concat(m_subject,":",m_score)) 
 from map_temp
 group by name
 ;
 
 第三步:将数组变成一个字符串concat_ws
 select name,concat_ws(",",collect_set(concat(m_subject,":",m_score)))
 from map_temp
 group by name
 ;
 
 第四步:将字符串转成map 使用函数str_to_map(text, delimiter1, delimiter2)
 text：是字符串
 delimiter1：多个键值对之间的分隔符
 delimiter2：key和value之间的分隔符
 
 select
 name,
 str_to_map(concat_ws(",",collect_set(concat(m_subject,":",m_score))),',',':')
 from map_temp
 group by name
 ;
 
 第五步：存储准备的表中
 create table map2 as
 select
 name,
 str_to_map(concat_ws(",",collect_set(concat(m_subject,":",m_score))),',',':') score
 from map_temp
 group by name
 ;

7.5 复杂数据类型 struct

7.5.1 简介

struct类型，类似于java编程语言中对象实例的模板，即类的结构体。如地址类型的结构体:

 public class Address{
         String provinces;
         String city;
         String street;
         .......
 }

使用struct类型来定义一个字段的类型，语法格式为:

 create table tableName(
 ........
 colName struct<subName1:Type,subName2:Type,........>
 ........
 )

调用语法：

 colName.subName

7.5.2 案例演示：

1）数据准备

 zhangsan        90,87,63,76
 lisi        60,30,78,0
 wangwu        89,25,81,9
 
 create table if not exists struct1(
 name string,
 score struct<chinese:int,math:int,english:int,natrue:int>
 )
 row format delimited 
 fields terminated by '\t'
 collection items terminated by ',';
 
 导入数据：
 load data local inpath './data/arr1.txt' into table struct1;

2）需求：查询数学大于35分的学生的英语和语文成绩：

 select name,
 score.english,
 score.chinese
 from str2
 where score.math > 35
 ;