4.hive知识点

最新推荐文章于 2024-05-15 04:37:52 发布

努力成为一名资深大数据选手

最新推荐文章于 2024-05-15 04:37:52 发布

阅读量652

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_50166024/article/details/114648260

版权

大数据专栏收录该内容

58 篇文章 4 订阅

订阅专栏

一、复习

1.1 Linux

- linux的安装
- 基本命令
- 相对路径和绝对路径
- 重定向符号：> 和 >>   
- 管道符号：|
- VIM/VI编辑器
- 高阶命令
- 软件管理
- 克隆
- 免密登录认证（重点）
- 静态IP的管理（重点）
- 服务的管理
- 时间同步（重点）
- shell（重点）

1.2 HDFS

- 大数据的概念
	（1）概念
	（2）特征
	（3）大数据工作流程：
		-1）数据采集：   数据源的分类有行为数据(日志文件)、业务数据(RDBMS)、内容数据、第三方数据
		-2）数据存储
		-3）数据清洗
		-4）数据分析
		-5）数据展示
- hadoop的简介（重点）
	（1）hadoop是什么，有什么特点
	（2）三大核心模块：hdfs，mr,yarn
	（3）三篇论文：GFS，Mapreduce,bigtable
- hdfs的搭建模式
	（1）独立模式
	（2）伪分布式模式
	（3）完全分布式模式
	（4）启动|关闭脚本
- hdfs的命令
	（1）创建  hdfs dfs(hadoop fs) -mkdir
- hdfs的管理命令
	         hdfs dfsadmin -safemode [enter|leave|get|wait]
	         hdfs dfsadmin -report
	         hdfs dfsadmin -rollEdits
	         hdfs dfsadmin -refreshNodes
- hdfs的块机制（重点）
- hdfs的体系结构（重点）
- hdfs的工作机制
	（1）开机启动过程（重点）
	（2）安全模式
	（3）心跳机制（重点）
	（4）检查点机制（重点）
	（5）机架感知策略（副本冗余策略）（重点）
	（6）datanode动态上下线
- hdfs的api
- hdfs的数据流（重点）
	(1) 读流程
	(2) 写流程

1.3 zookeeper

- 概念和特点
- 数据模型（重点）
- 应用场景（重点）
	- 集群的管理，比如HA
	- 集群应用的统一配置管理，比如db.properties里的配置
	- 集群节点的动态上下线感知
	- 软负载均衡
	- 分布式队列
	- 分布式锁
- 安装模式
- 常用指令
	- ls ls2       
	- create
	- delete 
	- rmr 
- 节点的类型 (重点)
- 开启时的选举机制
- 监听和通知机制（zookeeper的核心技术，重要）
- 写流程
- HA
    zookeeper提供了  在/hadoop-ha/active-
     
    两个namenode节点上都各有一个zkfc（zookeeper客户端，但是是专门为namenode做故障转移的服务）

1.4 MapReduce

- 概念
	容错性：当一个task失败后，还会有几次尝试。
- 设计思想（重点）
-  阶段分类和数据扭转
-  编程模型
-  入门案例
-  分区器（重点）
-  IDE运行MR的三种方式
- hadoop序列化机制
- mr运行流程概述
- 分片机制（重点）
    （1）分片描述的就是maptask要处理的原始数据的这些描述信息，如path,start,length,hosts ，我们称之为逻辑数据，就是输入分片（inputsplit）,简称分片。
    （2）分片大小如何选择
- maptask流程（重点）
- reducetask流程（重点）
- shuffle流程（重点）
- yarn（重点）

day01—————

一、Hive的概述

1.1 Hive的简介

1. 起源于facebook开发团队，后来捐给了apache基金会
2. hive是一个运行在hadoop上的一款数据仓库管理工具，可以将具有结构化的大数据文件映射成一张表，并提供了类sql的语法来管理和分析这些文件(表)。
3. 用于描述映射关系的数据，称之为元数据。
4. 这套类sql语言，称之为hive sql,简称hql
5. 默认情况下，hive的执行引擎(底层应用的框架)是mapreduce,
6. 执行引擎也可以使用spark,tez等计算框架来替换

1.2 Hive的优缺点

1.2.1 优点

- 学习成本低
- 扩展性： 集群增加hive的相关节点特别容易，不需要重启
- 容错性： hive在运行时(mr)如果出错，会再别的节点上再次尝试运行
- 延展性： 内置的函数不够用，可以使用java语言，自定义一个函数
- 适合做OLAP(离线分析处理)的工作：比较适合做相对来说高延迟的工作。（OLTP,实时分析）
- 元数据统一管理：使用关系型数据库来保存元数据（比如mysql,oracle）

1.2.2 缺点

- hive的sql表达能力有限： 不是所有的算法，逻辑都可以使用sql来表达
	如：pageRank,kmeans算法
- hive本身底层是mapreduce
    （1）本身翻译出来的mapreduce不能满足所有用户的需求
    （2）调优粒度粗糙。

二、Hive的体系结构和工作流程

2.1 hive的体系结构（重点）

第一层：指的是连接hive的客户端接口层
	- CLI (command line interface),指的是hive提供的shell命令
	- JDBC/ODBC,指的是可以使用编程语言来来连接hive.比如，java，c++,ruby等
	- webui, hive提供了web访问接口
第二层：
	- thrift server :第三方软件服务项框架，当使用远程方式连接时，需要使用这个服务。
	- 元数据存储：指的是元数据存储的位置，可以是mysql,oracle，derby等
第三层：驱动层
    - 解析器(SQLParser):
		作用就是解析语法的正确性，比如表是否存在、字段是否存在、SQL语义是否有误。	
	- 编译器(Compiler)：
	    作用就是将正确的语法翻译成mr程序，会与元数据库进行交互，获取对应的元数据，比如表名对应的文件路径。
		最后形成job程序
	- 优化器(Optimizer)：
		将执行计划进行优化，减少不必要的列、使用分区、使用索引等。优化job。
	- 执行器(Executer):
	将优化后的执行计划提交给hadoop的yarn上执行。提交job。

2.2 hive的工作流程(重点)

1. 用户提交查询等sql任务给Driver。
2. 驱动程序里的解析器解析Hql语法的正确性，然后发送编译器，生成查询计划。
3. 编译器Compiler根据用户任务去MetaStore中获取需要的Hive的元数据信息。
4. 编译器得到元数据信息后，继续对任务进行编译，生成完整的job程序，然后通过优化器进一步优化。
5. 将最终的计划提交给Driver。到此为止，查询解析和编译完成。
6. Driver将计划Plan转交给ExecutionEngine去执行。
7. 在内部，执行作业的过程是一个MapReduce工作。执行引擎发送作业给yarn（yarn的job执行流程）。
7.1 与此同时,在执行时,执行引擎可以通过Metastore执行元数据操作（比如，create....as select操作）
8. 执行引擎接收来自数据节点的结果。
9. 执行引擎发送这些结果值给驱动程序。
10. 驱动程序将结果发送给Hive接口。

2.3 hive与mapreduce的区别

hive就是对mapreduce的封装，相当于在mapreduce外套了一个壳子。

2.4 hive与传统型数据库的区别

Hive和数据库的比较如下表：

比较项	关系数据库	Hive
ANSI SQL	支持	不完全支持
更新	UPDATE INSERT DELETE	INSERT OVERWRITE\ INTO TABLE
事务	支持	支持(部分支持)
模式	写模式	读模式
存储位置	块设备、本地文件系统	HDFS
延时	低	高
多表插入	不支持	支持
子查询	完全支持	只能用在From子句中
视图	Updatable	Read-only
可扩展性	低	高
数据规模	小	大
实时响应	毫秒级	秒级

三、hive的安装

3.1 内嵌模式

3.1.1 说明

内嵌模式指的是hive使用的元数据库是内置的derby数据库。缺点非常明显，只支持单session，而且还有一个特点，就是derby数据库的位置位于启动时所在的工作路径里，只要更改工作路径，就失效。

3.1.2 步骤

步骤1）上传，解压，并更名

步骤2）配置环境变量

步骤3）配置hive的环境脚本hive-env.sh

export HIVE_CONF_DIR=/usr/local/hive/conf
export JAVA_HOME=/usr/local/jdk
export HADOOP_HOME=/usr/local/hadoop
export HIVE_AUX_JARS_PATH=/usr/local/hive/lib

步骤4) 配置hive的用户自定义配置文件hive-site.xml

<!-- 删除所有的内容，只需要配置一下四个即可 -->
<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/usr/local/hive/iotmp/root</value>
</property>
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/usr/local/hive/iotmp/${hive.session.id}_resources</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/usr/local/hive/iotmp/root</value>
</property>
<property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/usr/local/hive/iotmp/root/operation_logs</value>
</property>

步骤5）启动hadoop

[root@qianfeng01 hadoop]# start-dfs.sh
注意：当执行hsql语句时，涉及到mr程序，就需要启动yarn了
[root@qianfeng01 hadoop]# start-yarn.sh

步骤6）初始化元数据库

[root@qianfeng01 hive]# schematool –initSchema –dbType derby

步骤7）测试连接

#使用bin目录下的hive指令启动hive
[root@qianfeng01 hive]# bin/hive

3.2 本地模式

3.2.1 说明

--1. 本地模式指的是hive指令在运行时，内置会启动服务项metastore。 （本地模式的本质）
--2. 然后metastore的作用就是连接元数据库mysql.
--3. mysql是独立的一款关系型数据库，在哪一台机器上无所谓。
--4. 本地模式与mysql在哪一台机器上是无关的。

3.2.2 安装步骤

步骤1）安装mysql

##1) 卸载mariadb
[root@qianfeng03 ~]# rpm -qa | grep mariadb
mariadb-libs-5.5.64-1.el7.x86_64
[root@qianfeng03 ~]# rpm -e --nodeps mariadb-libs-5.5.64-1.el7.x86_64
[root@qianfeng03 ~]# rpm -qa | grep mariadb

##2) 上传，拆包
[root@qianfeng03 ~]# tar -xvf mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar

##3) 安装下面顺序安装mysql
[root@qianfeng03 ~]# rpm -ivh mysql-community-common-5.7.28-1.el7.x86_64.rpm
[root@qianfeng03 ~]# rpm -ivh mysql-community-libs-5.7.28-1.el7.x86_64.rpm
[root@qianfeng03 ~]# rpm -ivh mysql-community-client-5.7.28-1.el7.x86_64.rpm
[root@qianfeng03 ~]# rpm -ivh mysql-community-server-5.7.28-1.el7.x86_64.rpm

##4) 启动服务项
[root@qianfeng03 ~]# systemctl start mysqld

##5) 查询服务项
[root@qianfeng03 ~]# systemctl status mysqld

##6）查看临时密码，然后登陆
[root@qianfeng03 ~]# cat /var/log/mysqld.log | grep password
[root@qianfeng03 ~]# mysql -uroot -p

##7) 修改密码策略
mysql> set global validate_password_policy=low;
mysql> set global validate_password_length=6;
查看密码策略
mysql> show variables like '%validate_password%';

##8）修改密码
alter user root@localhost identified by '@Mm123123';

##9) 远程授权
grant all privileges on *.* to root@'%' identified by '远程密码' ;

步骤2）上传hive安装包，解压，更名，配置环境变量

步骤3）配置hive的环境脚本

export HIVE_CONF_DIR=/usr/local/hive/conf
export JAVA_HOME=/usr/local/jdk
export HADOOP_HOME=/usr/local/hadoop
export HIVE_AUX_JARS_PATH=/usr/local/hive/lib

步骤4）配置hive的用户自定义配置文件

<!--hive仓库在hdfs的位置-->
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>

<!-- 该参数主要指定Hive的临时文件存储目录  -->
<property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive</value>
</property>

<!-- 删除所有的内容，只需要配置一下四个即可 -->
<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/usr/local/hive/iotmp/root</value>
</property>
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/usr/local/hive/iotmp/${hive.session.id}_resources</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/usr/local/hive/iotmp/root</value>
</property>
<property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/usr/local/hive/iotmp/root/operation_logs</value>
</property>

<!-- 配置metastore服务项连接mysql的参数 -->

<!--配置mysql的连接字符串-->
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://qianfeng03:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<!--配置mysql的连接驱动-->
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<!--配置登录mysql的用户-->
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
</property>
<!--配置登录mysql的密码-->
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>@Mm123123</value>
</property>

步骤5）将mysql的驱动包上传到hive的lib目录

步骤6）初始化元数据库mysql

[root@qianfeng03 ~]# schematool -initSchema -dbType mysql

步骤7）连接测试

[root@qianfeng03 ~]#  hive

再次说明：启动hive指令时，会内置启动一个metastore服务，此服务项会主动连接mysql元数据库

day02 —————

一 hive的安装（续）

1.3 远程模式

1.3.1 说明

--1. hive的客户端连接的服务项必须单独启动，这就是远程模式。
--2. 单独启动的服务项有两种，分别是hiveserver2和metastore

1.3.2 服务端的配置

配置hive的环境脚本hive-env.sh

export HIVE_CONF_DIR=/usr/local/hive/conf
export JAVA_HOME=/usr/local/jdk
export HADOOP_HOME=/usr/local/hadoop
export HIVE_AUX_JARS_PATH=/usr/local/hive/lib

配置hive-site.xml

<!-- 在hive-site.xml中配置以下属性 -->
<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/usr/local/hive/iotmp/root</value>
</property>
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/usr/local/hive/iotmp/${hive.session.id}_resources</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/usr/local/hive/iotmp/root</value>
</property>
<property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/usr/local/hive/iotmp/root/operation_logs</value>
</property>

<!--配置mysql的连接字符串-->
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://mei03:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<!--配置mysql的连接驱动-->
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<!--配置登录mysql的用户-->
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
</property>
<!--配置登录mysql的密码-->
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>@Mm123123</value>
</property>

说明：使用远程模式，需要在hadoop的core-site.xml文件中添加一下属性，别忘记分发

<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>

1.3.3 metastore服务项的应用

1.3.3.1 metastore服务项的启动

##方式1：前台启动
	hive --service  metastore
##方式2：后台运行
	hive --service  metastore &

1.3.3.2 hive指令连接远程metastrore服务

注意：
--1. metastore服务项必须使用hive指令连接
--2. hive客户端的配置中必须要有9083端口号的配置（metastore服务项的端口号就是9083）
<property>
	<name>hive.metastore.uris</name> 
	<value>thrift://ip:9083</value>
</property>

ip指的是metastore服务项所在的ip地址

1.3.4 hiveserver2服务项的应用

1.3.4.1 服务项的启动

注意：
--1.  hiveserver2会内置开启metastore服务项
--2.  hiveserver2服务项必须使用beeline工具连接
##方式1：前台启动
	hive --service  hiveserver2
##方式2：后台运行
	hive --service  hiveserver2 &

1.3.4.2 beeline工具连接hiveserver2

##方式1： 进入beeline命令行中连接
beeline
!connect jdbc:hive2://mei02:10000  回车
用户名和密码可以不用输入

##方式2： beeline直连（在linux的命令行上）
beeline -u jdbc:hive2://mei02:10000 回车

注解：
hive2   是连接hiveserver2服务的协议名称
10000   是hiveserver2服务项的端口号

注意：

当本机上的hive-site.xml文件里配置了9083端口号：
--1. 那么本机就不能开启hiveserver2服务项
--2. 也不能使用本地模式，只能使用远程模式的metastore服务项

1.4 扩展:hive命令行上的一些小应用

1）查看hive的内置属性

hive>  set;

2）可以在hive命令行上显示某一些属性的默认值

hive>  set hive.cli.print.current.db;

3）临时修改某一些属性的默认值

hive>  set hive.cli.print.current.db = true;

4）当然，也可以在hive-site.xml里永久配置相关属性

<property>
	<name>hive.cli.print.current.db</name> 
	<value>true</value>
</property>

5）可以在hive的命令行上使用linux或者是hdfs的命令

##linux的命令
hive> !pwd;
hive> !ls -a;

##hdfs的命令
hive> dfs -ls /;
hive> dfs -rmr /output2;

二、Hive的库与表的DDL操作

2.1 语法的注意事项

--1. 表名、库名等名称都不区分大小写
--2. 命名时数字不能开头，不能使用特殊符号
--3. 命名时不能使用关键字

2.2 库的DDL操作

1）创建数据库

方式1：create database dbName;
方式2：create database if not exists dbName;
方式3：create database if not exists dbName comment '。。。。。。';

create database dbName default character set utf8;

注意:

--1. 创建库的本质，是在hdfs上创建目录，目录名为库名.db
--2. 创建库的时候，也会更新元数据库里的元数据表中的元数据

2）查看所有的数据库

show databases;

3）查看建库语句

show create database dbName;

显示当前数据库

set hive.cli.print.current.db=true;

切换数据库

use dbName;

查看某一个库的详细信息

desc database testdb1;
describe database testdb1;
desc database extended testdb1;

删除库

drop database dbName;
如果库中有表，只能强制删除
drop database dbName cascade;

2.3 表的DDL操作

0）数据类型

Hive的数据类型分为基本数据类型和复杂数据类型,下面是基本数据类型(复杂类型到后期再讲)

其中加粗体是重点要掌握的类型

分类	类型	描述	字面量示例
基本类型	BOOLEAN	true/false	TRUE
	TINYINT	1字节的有符号整数 -128~127	1Y
	SMALLINT	2个字节的有符号整数，-32768~32767	1S
	INT	4个字节的带符号整数	1
	BIGINT	8字节带符号整数	1L
	FLOAT	4字节单精度浮点数	1.0
	DOUBLE	8字节双精度浮点数	1.0
	DEICIMAL	任意精度的带符号小数	1.0
	STRING	字符串，可变长度	“a”,’b’
	VARCHAR	变长字符串,要设置长度	“a”,’b’
	CHAR	固定长度字符串	“a”,’b’
	BINARY	字节数组	无法表示
	TIMESTAMP	时间戳，纳秒精度	122327493795
	DATE	日期	‘2016-03-29’
复杂类型	ARRAY	有序的的同类型的集合	array(1,2)
	MAP	key-value,key必须为原始类型，value可以任意类型	map(‘a’,1,’b’,2)
	STRUCT	字段集合,类型可以不同	struct(‘1’,1,1.0)
	UNION	在有限取值范围内的一个值	create_union(1,’a’,63)

1）创建表

##写法1：
create table tableName(
colName colType comment '......',
.......
);

上述写法的记录分隔符是换行符，列分隔符不是空格，而是^A(\001)

写法2：
create table tableName(
colName colType comment '......',
.......
)
row format delimited
fields terminated by ' '
lines terminated by by '\n'
stored as textfile;


##写法3：
create table if not exists tableName(
colName colType comment '......',
.......
)
comment '....';

注意：

建表的本质其实就是在相应的库目录下创建表目录

create table if not exists score (
sid int,
score double
)
row format delimited 
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;


drop table if exists score;
create table if not exists score (
sid int,
score double
)
comment 'this is my table' 
row format delimited 
fields terminated by '|';

2）显示当前库下的所有表

show tables;

显示别的库下的所有表

show tables in dbName;

4）显示建表语句

show create table tableName;

5）查看表结构

desc tableName;
describe tableName;
desc extended tableName

6）删除表

drop table [if exists] tableName;

7）修改表结构

- 修改表名
	alter table oldTableName rename to newTableName;

- 修改列名：change column	
	alter table tableName change column oldName newName colType;
- 修改字段类型：
	alter table tableName change column colName colName colType;

- 增加字段：add columns
	alter table tableName add columns (sex int,...);


- 修改列的位置:  注意，2.x版本后，必须是相同类型进行移动位置。
	alter table tableName change column colName colName colType after colName1;   
	alter table tableName change column colName colName colType first;
	
- 删除字段：replace columns	#注意，2.x版本后，注意类型的问题，替换操作，其实涉及到位置的移动问题。
	alter table tableName replace columns(
    id int,
    name int,
    size int,
    pic string
    );
	注意：实际上是保留小括号内的字段。

三、表数据的导入和导出

3.1 表数据的导入

方式1）使用hdfs的指令将文件上传到对应的表目录下

案例演示：
##1. 建表
create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate date,
sal double,
comm double,
deptno int
)
row format delimited 
fields terminated by ','
lines terminated by '\n'
stored as textfile;

##2. 上传数据文件
hdfs dfs -put /root/data/emp.txt /user/hive/warehouse/sz2003.db/emp/
##3. 测试
select * from emp
select * from emp limit 3;

方式2）在hive的命令行使用load指令导入数据

案例演示：
##1. 建表
create table if not exists dept(
deptno int,
dname string,
loc string
)
row format delimited 
fields terminated by ','
lines terminated by '\n'
stored as textfile;

##2. 导入数据
语法：load data [local] inpath '具体路径' [overwrite] into table tableName
解释：  如果带local，表示加载linux本地的文件
       如果不带local,表示加载hdfs上的文件到表目录里
       带overwrite 表示覆盖表目录里的所有文件

演示：
    load data local inpath '/root/data/dept.txt' into table dept;

方式3）动态加载数据到表中

动态的意思就是数据来源于其他的表。

##案例演示：
create table dept_1 as select deptno,dname from dept;

##扩展： 因为hive默认使用的hadoop集群的配置，而配置中使用的是yarn模式。在测试时，性能比较慢，因此可以通过修改hive的参数来达到执行hadoop的本地模式。
hive> set hive.exec.mode.local.auto=true;



##扩展一下：使用hql向多张表中插入数据
create table if not exists dept_10(
deptno int,
dname string,
loc string
)
row format delimited 
fields terminated by ','
lines terminated by '\n'
stored as textfile;

drop table dept_11;
create table if not exists dept_11(
deptno int,
dname string
)
row format delimited 
fields terminated by ','
lines terminated by '\n'
stored as textfile;

#同时向多张表中插入
from dept
insert into dept_10 select * 
insert into dept_11 select deptno,dname where deptno<30;

3.2 表数据的导出

3.2.1 将表数据导出到文件系统的目录中

--1. 导出到本地文件系统的目录中
insert overwrite local directory '/root/data/dir1' select * from dept;
--2. 导出到hdfs文件系统的目录中
insert overwrite  directory '/input' select deptno,loc from dept;

##注意：
如果在导出时，不指定目标文件的分隔符，则使用默认的分隔符^A

--3. 修改导出后的列与列之间的格式：
insert overwrite local directory '/root/out/dir2'
row format delimited fields terminated by ','
select * from dept;

3.2.2 将表数据导出到具体文件名中

因为要指定具体文件名，所以应该在linux的命令行上使用重定向功能
hive -e "select empno,ename,job,hiredate from emp where deptno<30" >> /root/data/result

注意：此时文件中的数据分隔符是制表符

四、表类型的概述

4.1 说明

在hive中，表类型（说的是tableType这个属性）只有两种
##1. 内部表，又叫管理表（MANAGED_TABLE）,
	（1）默认创建的表就是管理表
	（2）表目录的位置位于所在库目录下
##2. 外部表
	 (1)创建外部表必须使用external关键字
	 (2)外部表可以使用location关键字来灵活指定表目录的具体位置,如果不使用location,位置与内部表一致
	 
	 
##演示使用关键字external，不使用location
create external table if not exists student(
sid int,
sname string,
gender char(1),
age int
)
row format delimited 
fields terminated by ',';

## 使用location
create external table if not exists student_1(
sid int,
sname string,
gender char(1),
age int
)
row format delimited 
fields terminated by ','
location 'hdfs://mei01:8020/student_1';

4.2 表类型的转换

##1. 内部表转外部表
alter table tablename set tblproperties('EXTERNAL'='TRUE');
注意：true必须大写

##2. 外部表转内部表
alter table tablename set tblproperties('EXTERNAL'='false');
注意：false不区分大小写

4.3 外部表与内部表的区别

##1. 两种类型的表在删除时的区别
--1. 内部表的元数据和原始数据(数据文件)都会被删除
--2. 外部表的元数据会被删除，但是原始数据不会被删除，

##2. 在创建时的区别
--1. 外部表需要加external或者location关键字

##3. 在实际生产环境中
--1. 内部表通常用于测试，不怕被删除
--2. 外部表通常用于存储不能轻易删除的数据文件。

五、hive的基本查询语法

5.1 扩展内容：使用idea连接hive

5.1.1 说明

使用idea充当客户端连接hive时，hive必须开启hiveserver2服务（idea使用beeline工具连接hive）

5.1.2 步骤

##步骤1：
	在idea任意一个项目下，创建一个扩展名为sql的文件，比如hive.sql
##步骤2：
	选项文件最右边的database选项，
##步骤3：
	弹出窗口后，选择+，然后选择Data Source里的Apache hive,然后弹出以下窗口，按照下图，填写内容

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-X1mzy88I-1615385700161)(ClassNotes.assets/image-20201010160128629.png)]

##步骤4 
	选择上图中的Driver：Apache Hive，进入另一个窗口，选择hive的驱动jar包，如下图操作，选择+，然后选择自己的驱动包，点击应用，然后再点击向左箭头，回去之前的窗口，进行测试。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xmryh7YY-1615385700165)(ClassNotes.assets/image-20201010160513433.png)]

##步骤5：
	之后就可以在文件里写sql了，然后点击绿色按钮运行。

5.2 hql的基本语法和执行顺序

5.2.1 基本语法子句

select [distinct]....
from leftTable
	[join rightTable] [on condition]
[where condition]
[group by colName[,.....]]
[having condtition]
[order by colName[asc|desc][,......]]
[limit [m,]n]
[union all]

5.2.2 子句的执行顺序

第一步：from子句
第二步：on 条件
第三步：join...
第四步：where子句
第五步：group by子句
第六步：having子句
第七步：select子句
第八步：order by 子句
第九步：limit
第十步：union all

5.3 各个子句的回顾

5.3.1 where子句

--1. 作用：where子句是用来过滤和筛选数据的
--2. 条件有以下形式：
	(1)关系表达式:   >,>=,<,<=,=,!=,<>
	               where age>23;
	               where age<>23;
	(2)连接符号： and ，or, between .. and ..
    			   where age>=23 and age <=40  相当于 where age between 23 and 40
                   where age<23 or age >40
                   where age<23 and name<>'zhangsan'
    (3)模糊查询： 
    	关键字：like
    	统配符号：  %（匹配任意长度的字符）
    			  _(占位符)
    			   where name like '_A%'
    (4)集合操作：[not] in:
    	  			where age in (10,20,10,30): 查询年龄是10，或者是20，或者是30的信息
    	  			相当于 where age =10 or age = 20 or age = 30
    	  			where age not in (10,20,30):查询年龄既不是10，也不是20，也不是30的信息
    	  			相当于 where age!=10 and age!=20 and age!=30
    (5)where条件中可以使用子查询。
    	案例：查看员工表emp，工资小于smith工资的所有人的信息
    	     select * from emp where sal<(select sal from emp where ename='smith');
注意：
--1. 在hive的where子句中，支持子查询，但是子查询被认为是一个集合，因此不能使用关系表达式符号，只能使用in或者是not in
--2. 只要涉及到子查询，建议要使用表别名，列名也带上表别名

5.3.2 group by子句

##1. 作用
	 是按照某一个字段值或者一些字段值组合分组，然后查询每组中的统计数据，比如个数，和，最大值，最小值，平均值。
   注意:在多个字段分组查询时，分组的组合情况是不可以跨行交叉组合在一起的，只能是同一行的数据进行组合，组合的数据一样的话，就是一组。
##2. 分组查询一般都要使用聚合函数（分组函数，也被称之为分析函数）
   sum(colName)
   count(colName|*|1):统计个数
   max(colName)
   min(colName)
   avg(colName)
   
   特点：聚合函数都会忽略null值。sum，max,min这三个函数忽略空值，是不影响结果的。
        count和avg都影响结果。
        count:如果要统计的是表中的所有记录个数，那么不应该指定具体字段，而应该使用*或者是1.
        avg: 如果要统计的平均值是除以所有行数时，那么就应该做null值处理。
        	 在hive中，通常要使用nvl函数，在mysql中使用ifnull函数
        	 avg(nvl(score,0)):if score is null 就使用0这个值
 ## 分组查询的输出结果的特点
 	--1. 多条变一条。
 	--2. 分组查询时不能显示每一组中的详细信息
 ## 分组查询中的select子句中只能有分组函数和分组字段，其他字段是不可以存在的。
 select ename,deptno,count(sal) from emp group by deptno;

day03 —————

一 Hive的基本查询语法（续）

1.1 各个子句的回顾（续）

1.1.1）having子句

--作用： 对分组查询后的数据进行再次过滤和筛选。
--注意事项：
   (1)可以使用分组函数作为条件的一部分。where子句中不可以使用分组函数
   (2)hive中的having子句也能使用子查询，但是也只能是[not] in
      
案例演示：查询部门平均工资 不等于 10号部门平均工资的部门编号
select A.deptno,avg(nvl(A.sal,0)) from emp A group by A.deptno having avg(nvl(A.sal,0)) not in (select avg(nvl(sal,0)) from emp  where deptno = 10);

1.1.2) order by 子句

--作用： 对select选择的字段数据进行排序，升序或者是降序
--语法： order by colName1 [asc|desc][,colName1 [asc|desc],.....]
--解析： asc: 升序，可以不写。默认就是升序
		desc:降序

1.1.3) limit子句

--作用： 对排序后的数据进行分页查询（抽样查询）
--特点： 分页前应该使用order by进行排序
--语法： limit [m,] n
--解析： m表示从表行数的第m条开始查询，如果m不写，默认从第一条开始查询。第一条的下标为0
		n表示显示m后的n条
       案例： limit 5,10    表示从第六条开始查询10条记录
--扩展：  findByPage(int pageNum,int pageSize)
		 limit (pageNum-1)*pageSize,pageSize
       
         1   0    5
         2   5    5
         3   10   5
         4   15   5

1.1.4) union [all]

--作用： 连接（联合）两个sql的结果，组成一张表。 第二个sql的结果从后面追加到第一个sql的结果后。
--特点： 
	（1）两个sql的字段个数要一致，当不一致时，需要使用常量或者null来占位，以达到字段个数一致。
	（2）两个sql的字段名的顺序和类型，也应该一致，否则无意义。
	（3） union all  不会去重
	（4） union 有去重效果
--案例： 10、20号部门的所有信息与20、30号部门的信息进行联合
select empno,ename,job,deptno,sal from emp  where deptno =10 or deptno = 20  
union 
select empno,ename,job,deptno,null from emp where deptno =30 or deptno = 20
order by deptno desc;

1.2 hive的join连接

1.2.1）join连接查询的回顾

--概念：因为需求中的数据在多张表中，这个时候需要连接多张表，才能查询查询出想要的数据，多张表关联可以使用关键字join，进行连接，这种操作就是join连接查询。
--join连接查询的分类：
	（1）内连接： lefttable [inner] join  righttable  on  关联条件
		特点：返回的数据是满足关联条件的所有数据的组合。
		默认左表是驱动表
	（2）外连接： outter join,再细分：
    	 -左外连接：   lefttable left [outter] join  righttable  on 关联条件
    	 	  左表中除了显示满足条件的记录，还包含不满足条件的其他记录，也就是整张表的数据都要显示。
    	 	  不满足条件的记录与null组合。
    	 -右外连接：   lefttable right [outter] join  righttable  on 关联条件
    	 	  右表中除了显示满足条件的记录，还包含不满足条件的其他记录，也就是整张表的数据都要显示。
    	 	  不满足条件的记录与null组合。
    	 -驱动表：显示全部数据的那一张表，就是驱动表。左外连接左表是驱动表，右外连接游标是驱动表	  
    	 -全(满)外连接： lefttable full join righttable on 关联条件
    	 	  两张表除了显示满足条件的数据组合外，不满足条件的数据也都与null组合进行显示
    (3) 自连接：两张表是同一张表，这种连接就是自连接
        案例演示：查询员工的信息及其领导的姓名和职位
7369    SMITH   CLERK   		7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-02-20      1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-02-22      1250.0  500.0   30
7566    JONES   MANAGER 		7839    1981-04-02      2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-09-28      1250.0  1400.0  30
7698    BLAKE   MANAGER 		7839    1981-05-01      2850.0  NULL    30
7782    CLARK   MANAGER 		7839    1981-06-09      2450.0  NULL    10
7788    SCOTT   ANALYST 		7566    1987-04-19      3000.0  NULL    20
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7844    TURNER  SALESMAN        7698    1981-09-08      1500.0  0.0     30
7876    ADAMS   CLERK   		7788    1987-05-23      1100.0  NULL    20
7900    JAMES   CLERK   		7698    1981-12-03      950.0   NULL    30
7902    FORD    ANALYST 		7566    1981-12-02      3000.0  NULL    20
7934    MILLER  CLERK   		7782    1982-01-23      1300.0  NULL    10        
案例演示1：查询员工的信息及其领导的姓名和职位   
分析：如果A表作为员工表，那么A表中的mgr才是关联字段
           B表作为领导表，那么B表中的empno才是关联字段
   select A.*,B.ename `领导姓名`,B.job `领导职位`
   from  emp A join emp B on  A.mgr = B.empno
案例演示2：查询员工的信息及其下属的姓名和职位
分析： 表A作为员工，关联字段应该是empno
	  表B作为下属表，关联字段应该是mgr
select A.*,B.ename `下属姓名`,B.job `下属职位`
from  emp A join emp B on  A.empno = B.mgr  



sales shanghai  7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
aaaa   sz  7499    ALLEN   SALESMAN        7698    1981-02-20      1600.0  300.0   30

1.2.2）left semi join

--左半开连接：就是使用关联条件判断左表中的数据是否在右表中有满足条件的数据，如果有，就显示，如果没有，就不显示

select * from u1 left semi join u2 on  u1.id = u2.id;
+--------+----------+--+
| u1.id  | u1.name  |
+--------+----------+--+
| 2      | b        |
| 3      | c        |
| 7      | y        |
+--------+----------+--+

--查询效果：与内连接只显示左表的数据效果一模一样
	与这个sql语句select u1.* from u1 join u2 on u1.id = u2.id;结果一致。
--查询原理：与exists原理一致。

--扩展：exists的用法
（1）作用：
	
案例演示1：使用exists查询有下属的员工信息。
写法1：select A.* from emp A join emp B on A.empno = B.mgr;
写法2：select * from emp A where exists (select 'total'  from emp B where A.empno = B.mgr)

案例演示2：使用not exists查询没有下属的员工信息。												 select * from emp A where not exists (select 'total'  from emp B where A.empno = B.mgr)

案例演示3：使用exists查询有领导的员工信息。	
写法1：select * from emp where mgr is not null;
写法2：select * from emp A where exists (select 1 from emp B where A.mgr = B.empno)

1.2.3 ) map-side join

-- 什么是map端join：
   将小文件的数据提前加载到map端的内存中，然后map函数读取大文件时，直接进行join连接组合。这样的组合数据就排出了没有满足条件的数据。因此shuffle阶段中的网络IO，就会大大减少数据的传输。从而提高效率。
-- hive 1.x版本以后，已经自动开启了map端join的优化。
   前提1：join中的两张表，必须有一个小表，小表的数据必须少于25000000字节
   hive.auto.convert.join=true
   hive.mapjoin.smalltable.filesize=25000000   约23.8MB
   前提2：innerjoin时 小表文件需要写在左表的位置

1.3 hive的日志

日志的位置由以下属性控制：
第一个是由：hive-site.xml里的hive.querylog.location属性控制
第二个是由conf里的hive-log4j2.properties文件控制，但是需要由hive-log4j2.properties.template拷贝而来

property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
property.hive.log.file = hive.log

1.4 Hql的三种执行方式

1.4.1 hive的shell命令

就是使用本地连接模式或者是远程连接模式登陆到hive的命令行客户端里运行hql语句

1.4.2 linux的命令行上使用hive -e

--说明：在linux的命令行上使用hive指令，并使用-e（execute）参数，只能执行一条hql语句。
--语法：[root@qianfeng01 ~]$ hive -e 'hql query'

案例：
[root@qianfeng01 ~]$ hive --database mydb -e 'select * from studentinfo';
[root@qianfeng01 ~]$ hive --database exercise --hivevar ls=2 --hiveconf tn=student -e 'select * from ${hiveconf:tn} limit ${hivevar:ls}';

参数选项说明: 
-e	用于执行hql语句
-f  用于执行sql脚本文件
-S  静音模式，不显示mapreduce执行过程
-i  启动hive时初始化一个文件

1.4.3 linux的命令行上使用hive -f

[root@qianfeng01 ~]# vim myhive.sql

--扩展:也可以在shell脚本中写hql

1.5 总结

--1. 什么时候hive的sql不执行mr程序
	(1) select * from tablename
	(2) select * from tablename limit [m,]n;
--2. hive对子查询支持的不完全。
	(1) hql有子查询时，通常都会有表别名和列别名的应用
	(2) 子查询可以在from子句中，与mysql的用法一致
	(3) 子查询可以在where子句中，但是只能应用[not] in 或者 [not] exists
	(4) 子查询可以在having子句中，但是只能应用[not] in 或者 [not] exists
	(5) 子查询在having子句中时，如果用的是in，并且有聚合函数，那么聚合函数应该在select子句出现
	(6) 子查询不支持在select子句中。
--3. 查询原则
	（1）能不使用子查询，就不用子查询，通常子查询都可以使用join来替换
	（2）能不用join就别用join，但是通常避免不了
	（3）[not] in 通常要替换成 [not] exists
	（4）join时，一定要注意小表驱动大表
--4，聚合函数的count的优化
	#1. 执行效果上：
	- count(*)包括了所有的列，相当于行数，在统计结果的时候不会忽略null值
	- count(1)包括了所有列，用1代表行，在统计结果的时候也不会忽略null值
	- count(列名)只包括列名那一列，在统计结果时，会忽略null值

	#2. 执行效率上：
	- 列名为主键，count(列名)会比count(1)快，count(主键)效率是最高的
	- 列名不为主键，count(1)会比count(列名)快
	- 如果表中有多个列并且没有主键，count（1）的效率高于count(*)
	- 如果表中只有一个字段count(*)效率最高

day04—————

一、数据类型的应用

1.1 Hive数据类型的分类

分为基础数据类型（原始数据类型）和复杂数据类型，如下图

分类	类型	描述	字面量示例
基本类型	BOOLEAN	true/false	TRUE
	TINYINT	1字节的有符号整数 -128~127	1Y
	SMALLINT	2个字节的有符号整数，-32768~32767	1S
	INT	4个字节的带符号整数	1
	BIGINT	8字节带符号整数	1L
	FLOAT	4字节单精度浮点数	1.0
	DOUBLE	8字节双精度浮点数	1.0
	DEICIMAL	任意精度的带符号小数	1.0
	STRING	字符串，变长	“a”,’b’
	VARCHAR	变长字符串,要设置长度	“a”,’b’
	CHAR	固定长度字符串	“a”,’b’
	BINARY	字节数组	无法表示
	TIMESTAMP	时间戳，纳秒精度	122327493795，另一种“yyyy-MM-dd HH:mm:ss”
	DATE	日期	‘2016-03-29’
复杂类型	ARRAY	有序的的同类型的集合	array(1,2)
	MAP	key-value,key必须为原始类型，value可以任意类型	map(‘a’,1,’b’,2)
	STRUCT	字段集合,类型可以不同	struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
	UNION	在有限取值范围内的一个值	create_union(1,’a’,63)

1.2 基础数据类型的使用

create table if not exists datatype1(
id1 tinyint,
id2 smallint,
id3 int,
id4 bigint,
salary float,
comm double,
isok boolean,
content binary,
dt timestamp
)
row format delimited 
fields terminated by '\t'
;

233	12	342523	455345345	30000	60000	nihao	helloworld	2017-06-02
126	13	342526	455345346	80000	100000	true	helloworld1	2017-06-02 11:41:30

load data local inpath './data/datatype.txt' into table datatype1;

注意：
（1） 超范围的数据，会解析成null。
（2） 小类型的数据会自动解析成大类型的数据，如int的数据可以自动解析成double/float类型的数据（隐式转换）
（3）timestamp 如果是年月日时分秒的格式，必须是写全，才能映射成功。

1.3 复杂类型之Array类型

1.3.1 说明

--介绍：Array是一个有序的数组类型，可以将一列的多个值映射成数组元素
比如：  zhangsan  100,98,45,60这一行记录中的第二列可以当成一个分组的数组来解析
--使用语法：在建表时，规定一个字段是数组类型
create table tablename(
....
scores array<type>
....
)

1.3.2 案例演示：

--数据准备：
zhangsan	78,89,92,96
lisi	67,75,83,94
王五	23,12

--建表1: 没有指定数组元素分隔符
create table array1(
name string,
scores array<int>
)
row format delimited
fields terminated by '\t';
load data local inpath '/root/data/array.txt' into table array1;


--建表2： 指定数组元素分隔符
create table array2(
name string,
scores array<int>
)
row format delimited
fields terminated by '\t'
collection items terminated by ',';
load data local inpath '/root/data/array.txt' into table array2;

--建表3: 没有指定数组元素分隔符
create table array3(
name string,
scores array<string>
)
row format delimited
fields terminated by '\t';
load data local inpath '/root/data/array.txt' into table array3;

--查询：
select * from array1;   --可以知道类型不匹配，映射数据不成功，用null占位
select * from array3;   --可以发现字符串数组元素 带双引号
select size(scores) from array3; --计算数组的长度
select name,scores[0],scores[10],scores[-1] from array3; --下标可以越界，不报异常

--求：array2表中的学生姓名，第一科成绩，第二科成绩，总科目数
select name,scores[0],scores[1],size(scores) from array2;

--求：有两科成绩的学生信息
select * from array2 where size(scores)=2;

--求：array2表中的每个人的总成绩
select name,scores[0]+scores[1]+nvl(scores[2],0)+nvl(scores[3],0) `total` from array2;

--总结：
 (1) 数组元素的访问使用 colName[index]
 (2) 数组的下标从0开始，下标越界不会报异常，会返回null
 (3) 数组的长度可以使用函数size(colname)来统计
 (4) 数组元素求和时，要注意空值处理，就体现了缺点（如果元素过多，需要写很多nvl）

1.3.3 展开函数和虚拟表的应用

--需求：将表中如下数据
zhangsan	78,89,92,96
lisi	67,75,83,94
王五	23,12

转成如下格式：
zhangsan 78
zhangsan 89
zhangsan 92
zhangsan 96
lisi 67
lisi 75
lisi 83
lisi 94
王五 23
王五 12

-- hive中提供了一个展开函数explode，可以完成以上效果，是UDTF函数的一种
-- 作用：explode可以将集合字段，比如array,map等展开，一行中的一列变多行显示（列转行）
-- 语法： explode(colname)
将 array2表中的scores字段展开

select explode(scores) from array2;   展开后的效果如下
+------+--+
| col  |
+------+--+
| 78   |
| 89   |
| 92   |
| 96   |
| 67   |
| 75   |
| 83   |
| 94   |
| 23   |
| 12   |
+------+--+

-- 将开展后的数据与对应行的其他字段进行关联。需要借助虚拟表 lateral view
-- 虚拟表的作用是可以存储展开函数的数据

案例演示：
select name, score from array2 lateral view explode(scores)  A as score

+-----------+--------+--+
|   name    | score  |
+-----------+--------+--+
| zhangsan  | 78     |
| zhangsan  | 89     |
| zhangsan  | 92     |
| zhangsan  | 96     |
| lisi      | 67     |
| lisi      | 75     |
| lisi      | 83     |
| lisi      | 94     |
| 王五        | 23     |
| 王五        | 12     |
+-----------+--------+--+

案例需求：查询每个学生的总成绩

select name,sum(score)
from array2 lateral view explode(scores)  A as score
group by name

1.3.4 搜集函数的应用

搜集函数在hive中有两个，分别是：
--1. collect_set(colName)  ：搜集的数据会去重
--2. collect_list(colName)：搜集的数据不会去重
--3. 搜集后的结果是一个数组形式。


##数据准备：使用克隆的方式创建array4
create table array4 as select name, score from array2 lateral view explode(scores)  A as score;
insert into array4 values ('王五',12);

--使用collect_set查询，
select name,collect_set(score) from array4 group by name; --结果没有重复的数据
--使用collect_list查询，
select name,collect_list(score) from array4 group by name; --结果有重复的数据

案例演示：不分组和分组的情况
select collect_set(score) from array4;  --将整张表的数据当成一组，搜集到一起
select name,collect_set(score) from array4 group by name;

案例演示：将以下表形式
+--------------+---------------+--+
| array4.name  | array4.score  |
+--------------+---------------+--+
| zhangsan     | 78            |
| zhangsan     | 89            |
| zhangsan     | 92            |
| zhangsan     | 96            |
| lisi         | 67            |
| lisi         | 75            |
| lisi         | 83            |
| lisi         | 94            |
| 王五           | 23            |
| 王五           | 12            |
| 王五           | 12            |
+--------------+---------------+--+
转为：
+--------------+----------------+--+
| array2.name  | array2.scores  |
+--------------+----------------+--+
| zhangsan     | [78,89,92,96]  |
| lisi         | [67,75,83,94]  |
| 王五           | [23,12,12]        |
+--------------+----------------+--+
借助搜集函数完成。
create table array5
as 
select name,collect_list(score) scores from array4 group by name;

1.4 复杂类型之Map类型

1.4.1 案例需求

有以下数据文件
zhangsan	chinese:78,math:89,english:92,nature:96
lisi	chinese:67,math:75,english:83,nature:94
王五	english:23,nature:12

就可以使用复杂类型map类型来对应这个字段。

--语法：
create table tableName(
.....
colName map<T,T>
.....
)

##案例演示：
--建表：
create table map1(
name string,
scores map<string,int>
)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';

load data local inpath '/root/data/map.txt' into table map1;


-- map类型的访问语法： colName[key]
-- 查询每个人的数学成绩：
select name,scores['math'] from map1;
-- 计算每个人的总成绩
select name,nvl(scores['chinese'],0)+nvl(scores['math'],0)+nvl(scores['english'],0)+nvl(scores['nature'],0) `total` from map1;
-- 计算每个人的考试科目数
select name,size(scores) from map1;


-- 案例需求：将map类型的数据进行展开练习，展开如下效果
zhangsan	chinese	78
zhangsan	math	89
zhangsan	english	92
zhangsan	nature	96
lisi	chinese	67
lisi	math	75
lisi	english	83
lisi	nature	94
王五	english	23
王五	nature	12

--（1） 展开map类型的字段
select explode(scores) as (course,score) from map1;
+----------+--------+--+
|  course  | score  |
+----------+--------+--+
| chinese  | 78     |
| math     | 89     |
| english  | 92     |
| nature   | 96     |
| chinese  | 67     |
| math     | 75     |
| english  | 83     |
| nature   | 94     |
| english  | 23     |
| nature   | 12     |
+----------+--------+--+
-- (2) 借助虚拟表，与表中其他字段关联
select name,course,score from map1 lateral view explode(scores) t as course,score
+-----------+----------+--------+--+
|   name    |  course  | score  |
+-----------+----------+--------+--+
| zhangsan  | chinese  | 78     |
| zhangsan  | math     | 89     |
| zhangsan  | english  | 92     |
| zhangsan  | nature   | 96     |
| lisi      | chinese  | 67     |
| lisi      | math     | 75     |
| lisi      | english  | 83     |
| lisi      | nature   | 94     |
| 王五        | english  | 23     |
| 王五        | nature   | 12     |
+-----------+----------+--------+--+

1.4.2 将数据动态保存到map字段里的应用

需求：将以下的表数据
+-----------+----------+--------+--+
|   name    |  course  | score  |
+-----------+----------+--------+--+
| zhangsan  | chinese  | 78     |
| zhangsan  | math     | 89     |
| zhangsan  | english  | 92     |
| zhangsan  | nature   | 96     |
| lisi      | chinese  | 67     |
| lisi      | math     | 75     |
| lisi      | english  | 83     |
| lisi      | nature   | 94     |
| 王五        | english  | 23     |
| 王五        | nature   | 12     |
+-----------+----------+--------+--+
转成以下的形式
+------------+----------------------------------------------------+--+
| map1.name  |                    map1.scores                     |
+------------+----------------------------------------------------+--+
| zhangsan   | {"chinese":78,"math":89,"english":92,"nature":96}  |
| lisi       | {"chinese":67,"math":75,"english":83,"nature":94}  |
| 王五         | {"english":23,"nature":12}                         |
+------------+----------------------------------------------------+--+



--1.  数据准备
create table map2 
as 
select name,course,score from map1 lateral view explode(scores) t as course,score;
+-----------+----------+--------+--+
|   name    |  course  | score  |
+-----------+----------+--------+--+
| zhangsan  | chinese  | 78     |
| zhangsan  | math     | 89     |
| zhangsan  | english  | 92     |
| zhangsan  | nature   | 96     |
| lisi      | chinese  | 67     |
| lisi      | math     | 75     |
| lisi      | english  | 83     |
| lisi      | nature   | 94     |
| 王五        | english  | 23     |
| 王五        | nature   | 12     |
+-----------+----------+--------+--+
--2. 借助concat函数将course和score拼接到一起，拼接符号为冒号
 select name,concat(course,":",score) `result` from map2;
 +-----------+-------------+--+
|   name    |   result    |
+-----------+-------------+--+
| zhangsan  | chinese:78  |
| zhangsan  | math:89     |
| zhangsan  | english:92  |
| zhangsan  | nature:96   |
| lisi      | chinese:67  |
| lisi      | math:75     |
| lisi      | english:83  |
| lisi      | nature:94   |
| 王五        | english:23  |
| 王五        | nature:12   |
+-----------+-------------+--+

--3. 使用搜集函数搜集每一个人的所有科目信息,注意，搜集函数的返回值是数组形式
select name,collect_set(concat(course,":",score))  from map2 group by name
+-----------+----------------------------------------------------+--+
|   name    |                         c1                         |
+-----------+----------------------------------------------------+--+
| lisi      | ["chinese:67","math:75","english:83","nature:94"]  |
| zhangsan  | ["chinese:78","math:89","english:92","nature:96"]  |
| 王五        | ["english:23","nature:12"]                         |
+-----------+----------------------------------------------------+--+

--4. 将数组中的所有元素使用concat_ws函数拼接成一个字符串形式
select name, concat_ws(",",collect_set(concat(course,":",score))) from map2 group by name;
+-----------+------------------------------------------+--+
|   name    |                    c1                    |
+-----------+------------------------------------------+--+
| lisi      | chinese:67,math:75,english:83,nature:94  |
| zhangsan  | chinese:78,math:89,english:92,nature:96  |
| 王五        | english:23,nature:12                     |
+-----------+------------------------------------------+--+

--5. 借助str_to_map函数将字符串转为map类型
select name,str_to_map(concat_ws(",",collect_set(concat(course,":",score)))) from map2 group by name;

--6. 使用克隆的方式，存储查询出来的数据。
create table map3
as
select name,str_to_map(concat_ws(",",collect_set(concat(course,":",score)))) scores from map2 group by name;

--7. 查看表结构
desc map3;
OK
+-----------+---------------------+----------+--+
| col_name  |      data_type      | comment  |
+-----------+---------------------+----------+--+
| name      | string              |          |
| scores    | map<string,string>  |          |
+-----------+---------------------+----------+--+

--8. 计算每个人的英语和自然的成绩之和
select name,scores['english']+scores['nature'] from map3;
OK
+-----------+--------+--+
|   name    |   c1   |
+-----------+--------+--+
| lisi      | 177.0  |
| zhangsan  | 188.0  |
| 王五        | 35.0   |
+-----------+--------+--+

--9：从8的结果中可以得出纯数字的字符串可以自动转成浮点型进行运算。

1.5 复杂类型之Struct类型

1.5.1 说明

struct类型可以理解为java的普通类型的定义

比如使用java定义一个学生类型：
public class student{
	int id,
    string name,
    char gender,
    int age
    .....
}

如果使用hive的struct类型来描述的话，要用一下语法：
create table tablename(
.....
colname struct<propertyName:type,propertyName:type,.......>,
.....
)

1.5.2 案例演示：

-- 数据说明：
文件里没有显示成绩分别是哪一个科目的
zhangsan	78,89,92,96
lisi	67,75,83,94
王五	23,12

--建表：建表时要使用struct类型来描述成绩都是哪一科的
create table struct1(
name string,
scores struct<chinese:int,math:int,english:int,nature:int>
)
row format delimited
fields terminated by '\t'
collection items terminated by ',';
load data local inpath '/root/data/struct.txt' into table struct1;

--查询一下数据：
+---------------+--------------------------------------------------------+--+
| struct1.name  |                     struct1.scores                     |
+---------------+--------------------------------------------------------+--+
| zhangsan      | {"chinese":78,"math":89,"english":92,"nature":96}      |
| lisi          | {"chinese":67,"math":75,"english":83,"nature":94}      |
| 王五            | {"chinese":23,"math":12,"english":null,"nature":null}  |
+---------------+--------------------------------------------------------+--+
--查看结构体：
desc struct1;
OK
+-----------+------------------------------------------------------+----------+--+
| col_name  |                      data_type                       | comment  |
+-----------+------------------------------------------------------+----------+--+
| name      | string                                               |          |
| scores    | struct<chinese:int,math:int,english:int,nature:int>  |          |
+-----------+------------------------------------------------------+----------+--+

--struct类型中属性的访问：colname.propertyname

--查询1：查询每个人的语文成绩
select name,scores.chinese from struct1;

二、Hive函数的应用

2.1 说明

hive函数分两大类，分别是内置函数，和用户自定义函数
--1. 内置函数有常用三种基础函数，日期函数、字符串函数、数值函数
     还有窗口分析函数
--2. 用户可以通过写api来自定义函数，然后加载到hive环境中进行使用  

--3. 查看所有的内置函数： show functions
--4. 查看某一个函数的用法： desc function funcName

2.2 常用内置函数

2.2.1 日期函数

当前系统时间函数：
    current_date()、current_timestamp()、unix_timestamp()
日期转时间戳函数：
	unix_timestamp()
时间戳转日期函数：
	from_unixtime
计算时间差函数：
	datediff()、months_between()
日期时间分量函数：
	year()、month()、day()、hour()、minute()、second()
日期定位函数：
	last_day()、next_day()
日期加减函数：
	date_add()、date_sub()、add_months()
字符串转日期：
	to_date()
日期转字符串(格式化)函数：
	date_format

2.2.2 字符串函数

lower--（转小写）
select lower('ABC');

upper--（转大写）
select upper('abc');

length--（字符串长度，字符数）
select length('abc');

concat--（字符串拼接）
select concat("A", 'B');

concat_ws --（指定分隔符）
select concat_ws('-','a' ,'b','c');

substr--（求子串）
select substr('abcde',3);

split(str,regex)--切分字符串，返回数组。
select split("a-b-c-d-e-f","-");

--练习：统计"hello world hello java hello c++"字符串里的单词出现的频率
select word,count(1) from (select explode(split("hello world hello java hello c++"," ")) word) t group by word;

2.2.3 数值函数

round --四舍五入((42.3 =>42))
select round(42.3);

ceil --向上取整(42.3 =>43)
select ceil(42.3);

floor --向下取整(42.3 =>42)
select floor(42.3);

2.2.4 其他常用函数

cast(value as type) -- 类型转换
select cast('123' as int)+1;

nvl(value,default value):如果value为null,则使用default value，否则使用本身value.

isnull()
isnotnull()

case when  then ....when ...then.. else... end
if(p1,p2,p3)
coalesce(col1,col2,col3...)返回第一个不为空的

2.3 窗口函数（重点）

2.3.1 没有窗口函数的查询

--需求：查询每个员工的信息及其所在部门的平均工资
--1： 先查询每个部门的平均工资
     select deptno, avg(nvl(sal,0)) from emp group by deptno;
--2. 将上述的查询结果当做一张表与emp表关联
 select A.*,B.avg_sal 
 from emp A join ( select deptno, avg(nvl(sal,0)) avg_sal from emp group by deptno) B on A.deptno = B.deptno;
 
--3. 上述需求的最终结果里，有表的明细（每一条记录信息），同时又有统计数据。想要完成上述需求，一个简单的sql不可能完成。

因此：如果能有一个函数同时做到既能显示明细，又能查询统计信息，就会方便很多。这个函数就是开窗函数，又叫窗口函数。

2.3.2 开窗函数over的简介

-1) 窗口函数又名开窗函数，属于分析函数的一种。
-2) 是一种用于解决复杂报表统计需求的函数。
-3) 窗口函数常用于计算基于组的某种值，它和聚合函数的不同之处是：开窗函数对于每组数据返回多行，而聚合函数对于每组数据只返回一行。
	简单的说窗口函数对每条详细记录开一个窗口,进行聚合统计的查询
-4) 开窗函数指定了分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变化而变化。
-5) 窗口函数一般不单独使用，要配置其他分析函数一起使用，比如聚合函数，排名函数，序列函数等
-6) 窗口函数内也可以分组和排序

2.3.3 案例演示

-- 数据准备：
姓名,购买日期,购买数量
saml	2018-01-01	10
saml	2018-01-08	55
tony	2018-01-07	50
saml	2018-01-05	46
tony	2018-01-04	29
tony	2018-01-02	15
saml	2018-02-03	23
mart	2018-04-13	94
saml	2018-04-06	42
mart	2018-04-11	75
mart	2018-04-09	68
mart	2018-04-08	62
neil	2018-05-10	12
neil	2018-06-12	80
-- 创建order表:
create table if not exists t_order
(
    name      string,
    orderdate string,
    cost      int
)  row format delimited fields terminated by '\t';
--加载数据:
load data local inpath "./data/order.txt" into table t_order;

--练习1：求每个人的明细及其每个人每个月购买的总数
select A.*,
sum(cost) over(distribute by name, month(orderdate))
from t_order A;

--练习2： 查询每个人的明细及其每个月购买的总数，同时按照日期升序排序
select A.*,
sum(cost) over(distribute by name, month(orderdate) sort by orderdate)
from t_order A;

--注意：
	（1）当over函数中涉及到排序时，就会出现逐行累加的情况。
	（2）当over函数中没有排序只有分组时，每一行显示都是整组的累加之和
	
--练习3：over()里什么都不加 
select A.*,
sum(cost) over(),
max(cost) over(), 
min(cost) over(),
avg(cost) over(),
count(cost) over()
from t_order A;

2.3.4 over函数中添加分组子句

--分组子句：distrubite by colname[,colname........]
--作用：将窗口内的数据按照字段进行分组

--查看顾客的购买明细及商城每个月的销售总额
select *, sum(cost) over (distribute by month(orderdate)) from t_order;

2.3.5 over函数中添加排序子句

-- 排序子句： sort by colName [asc|desc][,colName [asc|desc]...]
-- 作用：将窗口内的数据按照指定字段进行排序

--查看顾客的购买明细及商城每个月的销售总额,并按照日期升序
select *, sum(cost) over (distribute by month(orderdate) sort by orderdate asc) from t_order;
--演示一下over函数中只有排序子句，没有分组子句
select *, sum(cost) over (sort by orderdate asc) from t_order;

2.3.6 over函数的子句总结

--1. distribute by和sort by是一对组合，每个子句也可以单独使用
--2. partition by和order by是一对组合，可以用来替换distribute by和 sort by，功能一致。
--3. 当over函数中有1或2中的组合时，并且是升序时，排序字段和分组字段是同一个时，可以使用cluster by来替换


select *, sum(cost) over(order by orderdate) from t_order;
select *, sum(cost) over(partition by month(orderdate)) from t_order;
select *, sum(cost) over(partition by month(orderdate) order by orderdate) from t_order;

--扩展：研究一下cluster by子句
-- 案例演示：使用distribute by和sort by组合来求每个人的明细和商城每个月的销售总额
select *, sum(cost) over(distribute by orderdate sort by orderdate) from t_order;
select *, sum(cost) over(cluster by orderdate) from t_order1;

2.4 排名函数

共有三个排名函数：
--1. row_number(): 用于给同一组的所有行，按照自然数从1开始分配行号。也就是排名效果
排名的名次是连续的，且不重复的
        98     1
        97     2
        97     3
        96     4
        96     5
        95     6
        		  
--2. rank()： 用于指定字段值的排名。  排名的名次特点是：有并列，但是会出现跳跃效果
        98     1
        97     2
        97     2
        96     4
        96     4
        95     6

--3. dense_rank()：   用于指定字段的排名。 排名的名次特点是：有并列，连续的效果。不会出现名次跳跃情况
        98     1
        97     2
        97     2
        96     3
        96     3
        95     4

--4. 这三种函数不能单独使用，必须配合over函数一起使用。

--练习：数据准备
userid	classno	score
1 gp1808 80   
2 gp1808 92
3 gp1808 84
4 gp1808 86
5 gp1808 88
6 gp1808 70
7 gp1808 98
8 gp1808 84
9 gp1808 86
10 gp1807 90
11 gp1807 92
12 gp1807 84
13 gp1807 86
14 gp1807 88
15 gp1807 80
16 gp1807 92
17 gp1807 84
18 gp1807 86
19 gp1805 80
20 gp1805 92
21 gp1805 94
22 gp1805 86
23 gp1805 88
24 gp1805 80
25 gp1805 92
26 gp1805 94
27 gp1805 86

create table if not exists stu_score(
userid int,
classno string,
score int
)
row format delimited 
fields terminated by ' ';

load data local inpath './data/stu_score.txt' overwrite into table stu_score;

--1. 查询全年级排名，显示三种效果
select *,
row_number() over(sort by score desc) `第一种效果`,
rank() over(sort by score desc) `第二种效果`,
dense_rank() over(sort by score desc) `第三种效果`
from stu_score;

--2. 查询每个班级内的排序，显示三种效果
select *,
row_number() over(distribute by classno sort by score desc) `第一种效果`,
rank() over(distribute by classno sort by score desc) `第二种效果`,
dense_rank() over(partition by classno order by score desc) `第三种效果`
from stu_score;


--3 练习：求每个班级的前三名的学生信息
select * from
(
select *,
row_number() over(distribute by classno sort by score desc) rn
from stu_score) t
where t.rn<4;

day05—————

一、Hive函数的应用（续）

1.1 开窗函数over的总结

--作用：over函数对表的数据进行单独开一个窗口，进行计算统计信息(最大值，最小值，平均值，排名，序列等)，并显示详细信息。
select deptno,max(sal) from emp group by deptno : 按照字段分组，每一组中多条记录最终汇总成一条记录
10，sales, 5000
20，manager,2350

select deptno,max(sal) over(distribute by deptno) from emp: over函数负责规定窗口大小，一组的所有数据就是一个窗口，窗口里的数据是明细，因此有多少条就显示多少条，同时配合其他分析函数统计相应的数据
10,sales,7369,5000
10,sales,7788,5000
10,sales,7799,5000
20,manager,1234,3600
20,manager,2234,3600
.......

--注意事项：
（0）over函数不能单独使用，必须配合其他的统计函数才能使用
（1）over函数里可以没有任何子句，表示就有一个窗口此时表示整张表的数据是一组,在一个窗口里面，窗口不会变大或变小，每条详细信息后的统计数据都是相同的值。
（2）over函数中可以有分组子句（distribute by），表示开窗函数为每一组数据开一个窗口。窗口大小不会变化，
每条详细信息后的统计数据都是这一组的统计信息，是相同的值。
（3）over函数中可以只有排序子句（sort by）,将整张表的数据排序到窗口中，同时统计窗口内现有数据的统计信息，也就是统计信息会变化。这个时候，窗口的特点是会变大。
（4）over函数中有分组子句和排序子句可以共存（distribute by+sort by）:为每一组的数据单独开一个窗口，窗口内的数据进行排序。
（5）partition by 和 order by是一个组合，partiton by可以替换distribute by,order by 可以替换sort by
（6）cluster by相当于 distribute by + sort by asc,而且是同一个字段

1.2 排名函数

row_number() over(...):   名次是连续不重复的
rank() over(....)： 名次是重复不连续
dense_rank()(....)： 名次是重复连续的。

总结：
（1）排名函数必须配合over函数使用，over函数中必须要至少有一个排序子句。
（2）只有排序子句时，是对整张表指定字段进行排名
（3）又有分组子句时，是对组内的数据进行排名

1.3 window子句

--作用，可以使用window子句更细粒度来控制窗口中的逻辑运算。
--关键字：
PRECEDING：往前 
FOLLOWING：往后 
CURRENT ROW：当前行 
UNBOUNDED:无边界
UNBOUNDED PRECEDING：表示从前面的起点， 
UNBOUNDED FOLLOWING：表示到后面的终点 

--案例演示：
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,-- 按name分组，组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,-- 按name分组，组内数据累加

sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )  as sample4 ,-- 与sample3一样，由起点到当前行的聚合
       
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   and current row) as sample5, -- 当前行和前面一行做聚合
       
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   AND 1 FOLLOWING  ) as sample6,-- 当前行和前边一行及后面一行
       
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 -- 当前行及后面所有行
       
from t_order;

-- 练习：统计当前行和当前行的前两行的和，平均值，最大值，最小值, 使用分组子句并使用排序子句
select *,
sum(cost) over(distribute by name sort by orderdate rows between 2 preceding and current row) sum_a,
avg(cost) over(distribute by name sort by orderdate rows between 2 preceding and current row) avg_a,
max(cost) over(distribute by name sort by orderdate rows between 2 preceding and current row) max_a,
min(cost) over(distribute by name sort by orderdate rows between 2 preceding and current row) min_a from t_order;

1.3 序列函数

1）NTILE

--1. ntile的作用是将同一组内的数据平均分发到num个bucket中，每条记录后标记的就是bucket对应的号码。如果不能均分，那么会优先放入号码比较小的那一个bucket中。
语法： ntile(num) over(....)
--2. 应用场景：用于做抽象查询
案例演示：

select * , ntile(3) over(distribute by name sort by orderdate) rn
from t_order

--练习：分桶后，取每组中的第一桶中的数据
select *
from 
(select * , ntile(3) over(distribute by name) rn
from t_order) A
where A.rn = 1;

2）LAG和LEAD

lag(colname,num[,default value]): 取当前行的前第num行 
lead(colname,num[,default value]): 取当前行的后第num行
--逻辑解析：当指定位置的colname的值为null时，使用default value


--案例演示：  列出每个人详细信息和上一次购买的数量
select *,lag(cost,1,10000) over(distribute by name sort by orderdate) `上次购买数量`
from t_order;
--案例演示：  列出每个人详细信息和大上次购买的数量，以及当前日期的下一次购买的数量
select *,lag(cost,2) over(distribute by name sort by orderdate) `大上次购买数量`,
lead(cost,1) over(distribute by name sort by orderdate) `下次购买数量`
from t_order;

--案例演示： 列出每个人的详细信息和上一次购买数据，以及与上一次购买数据的差值。
select *,
lag(cost,1,10000) over(distribute by name sort by orderdate) last_cost,
cost-lag(cost,1,10000) over(distribute by name sort by orderdate) as sub
from t_order;

3）FIRST_VALUE和LAST_VALUE

first_value(colname): 返回组内排序后的第一条记录指定字段的值
last_value(colname): 返回组内排序后相对于当前行来说的最后一条记录指定字段的值

--案例演示：
--需求：求分组排序后的第一个值
select *,
first_value(cost) over(distribute by name sort by orderdate)
from t_order;

--需求：求分组排序后的最后一个值，但是相当当前行来说，当前行就是最后一行。
select *,
last_value(cost) over(distribute by name sort by orderdate)
from t_order;

--需求：求组内排序的最后一个值，可以变换思维，使用降序排序后，使用first_value取第一个值。
select *,
first_value(cost) over(distribute by name sort by orderdate desc)
from t_order;

1.4 自定义函数

1.4.1 简介

--1. 当hive的内置函数不能满足我们的需求时，我们可以通过自定义函数来帮助完成我们的需求。
--2. 自定义函数分三类：
	（1）udf函数：  一对一的输入和输出
	（2）udaf函数： 多对一的输入和输出，比如聚合函数的count,max，min等
	（3）udtf函数： 一对多的输入和输出，比如展开函数 explode

1.4.2 自定义udf函数

--如何自定义udf函数？？？？
（1） 自定义一个java类型，继承UDF类型
（2） 在自定义的类内部定义一个必须叫evaluate名的方法。方法的形式参数的类型和个数，开发人员根据业务需求决定。当在翻译sql中的函数时，会自动查询自定义类型中的形参个数和类型匹配的evaluate方法，找到了就执行方法逻辑，找不到不报异常提醒。

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>2.1.1</version>
</dependency>

1.4.3 第一个udf函数案例：小写转大写

package com.qf.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

/**
 * 需求：自定义一个函数，完成小写字母转大写字母的需求。
 *
 * （1） 继承UDF类型
 * （2） 定义一个evaluate方法
 */
public class ToUpper extends UDF {
    /**
     *
     * @param content  是需要转换的字符串
     * @return   转换完的大写的字符串
     */
    public String evaluate(String content){
        //第一步，判断参数是否有意义
        if(content==null||"".equals(content)){
            throw new RuntimeException("传入的数据不符合要求");
        }
        //直接调用字符串类型的toUpperCase()方法
        return content.toUpperCase();
    }

    public static void main(String[] args) {
        ToUpper toUpper = new ToUpper();
        System.out.println(toUpper.evaluate("AbcDe"));
    }
}

1.4.4 自定义函数的使用方式

方法1）命令加载，只针对当前session有效

1）打包成jar文件，上传到linux上
2）在hive的命令行上，将jar文件引用到hive的classpath里（jar也可以直接放到hive的lib目录下，此步可以省略）
   add jar filepath
3）create temporary function funcName as '类全名' 
4）show functions;
5）删除临时函数：drop temporary function funcName

方法2）使用hive -i 加载配置文件的形式（当前会话生效）

参考文档

方式3）使用hive的bin目录下的.hiverc文件的形式

参考文件
特点：使用hive指令时，会主动加载.hiverc文件的内容

1.4.5 案例之出生日期转换成年龄

参考文档，或者代码

1.4.6 案例之通过key取出value值

sex=1&hight=180&weight=100&sal=2000000
sex=0&hight=176&weight=101&sal=1000000
sex=1&hight=170&weight=102&sal=2000000
sex=0&hight=169&weight=103&sal=3000000
sex=0&hight=188&weight=104&sal=7000000
sex=1&hight=187&weight=105&sal=9000000

create table content(
text string
);
load data loca inpath '/root/data/content.txt' into table content;

--克隆表：
create table content1 
as
select 
superfunction(text,"sex") sex,
superfunction(text,"sex") hight,
superfunction(text,"sex") weight,
superfunction(text,"sex") sal
from content;

参考代码中的如下方法：
public String evaluate(String str, String key)

1.4.7 案例之正则表达式

create table log100(
log string
);

load data local inpath './data/access.log' into table log100;

-- 查询应用
select superfunction(log,1.0) from log100 limit 10;

216.244.66.231  20150409 063401 GET     /robots.txt     HTTP    502     Mozilla
220.181.108.104 20150409 070020 GET     /xref/linux-3.18.6/     HTTP    502     Mozilla
216.244.66.249  20150409 070143 GET     /robots.txt     HTTP    502     Mozilla
211.0.154.125   20150409 071139 GET     /       HTTP    502     Mozilla
180.76.15.27    20150409 071139 GET     /robots.txt     HTTP    502     Mozilla
180.76.15.152   20150409 073033 GET     /       HTTP    502     Mozilla
180.76.15.15    20150409 073111 GET     /       HTTP    502     Mozilla
216.244.66.249  20150409 073140 GET     /robots.txt     HTTP    502     Mozilla

--参考代码中的：public String evaluate(String log,double label)

1.4.8 案例之json字符串解析

create table if not exists movie_json
(json string); 

load data local inpath './data/rating.json' into table moive_json;


--查询应用：
select superfunction(json,1,"aaa") from movie_json limit 10;

1193    5       978300760       1
661     3       978302109       1
914     3       978301968       1
3408    4       978300275       1
2355    5       978824291       1
1197    3       978302268       1
1287    5       978302039       1
2804    5       978300719       1
594     4       978302268       1
919     4       978301368       1

--参考代码中的：public String evaluate(String json,long label,String l)

--克隆表
create table if not exists movie_rate
as
select 
split(superfunction(json),'\t')[0] as movie,
split(superfunction(json),'\t')[1] as rate,
split(superfunction(json),'\t')[2] as times,
split(superfunction(json),'\t')[3] as uid
from movie_json;

1.4.9 内置json解析函数的应用

select get_json_object(json,"$.movie") movie,
get_json_object(json,"$.rate") rate,
get_json_object(json,"$.datetime") datetime,
get_json_object(json,"$.uid") uid
from movie_json;

二、hive的分区

2.1 为什么要分区

-- 就是为了避免hive查询时进行全表扫描而降低效率,引入分区(partition)的概念.
-- 就是在建表时指定分区，然后将表数据存在不同的分区（本质就是表目录中的子目录）中
-- 查询时，可以指定某一个分区进行查询，从而避免全表扫描

2.2 如何分区

--1. 根据具体的数据内容进行设置分区的级别和名称。
--2. 通常使用的是日期、地域等
--3. 比如按照日期分区，就可以设置一级分区：只需要设置不同的年份即可，同一年的数据都放到同一个子目录
	 比如按照日期分区，可以设置二级分区：一级是年份，二级是月份，数据是存在月份目录下
	 比如按照日期分区，可以设置三级分区：一级是年份，二级是月份，三级是日，数据存在三级目录下

--4.建表语法：
create table tableName(
....
)
partitioned by(colname type[,....])
--5. 从建表语句中可知分区字段属于表外字段，不是表内字段。

2.3 注意事项

--1. 分区名不区分大小写，不能用中文
--2. 分区的本质是子目录，目录的名字  “分区名=值”
--3. 分区字段是表外字段，但是desc查询表结构时，可以查看出来，同时可以当正常字段进行使用。
--4. 分区字段的值没有存在文件中，在元数据库里维护的。

2.4 分区的应用

2.4.1 一级分区的使用

--创建分区表
create table part1(
id int,
name string,
age int
)
partitioned by(dt string)
row format delimited
fields terminated by '|';

load data local inpath '/root/data/user1.txt' into table part1 
partition(dt="2020-05-05");
load data local inpath '/root/data/user2.txt' into table part1 
partition(dt="2020-05-05");

2.4.2 二级分区的使用

create table part2(
id int,
name string,
age int
)
partitioned by(year string,month string)
row format delimited
fields terminated by '\t';

load data local inpath '/root/data/user1.txt' into table part2 partition(year="2020",month="10");
load data local inpath '/root/data/user2.txt' into table part2 partition(year="2019",month="11");

2.4.3 三级分区的使用

create table part3(
id int,
name string,
age int
)
partitioned by(year string,month string,day string)
row format delimited
fields terminated by '\t';

load data local inpath '/root/data/user1.txt' into table part3 partition(year="2020",month="10",day="13");
load data local inpath '/root/data/user2.txt' into table part3 partition(year="2020",month="10",day="14");

2.5 分区的查看

show partitions tableName

2.6 分区的增删改

2.6.1 修改分区

--说明：修改分区指的是使用location更换分区映射的路径(location后接的hdfs路径需要写成完全路径)


alter table part3 partition(year='2019',month='10',day='23') set location '/user/hive/warehouse/mydb1.db/part1/dt=2018-03-21';    --错误使用

#:修改分区，指的是修改分区字段值对应的映射位置。

alter table part3 partition(year='2020',month='10',day='13') set location 'hdfs://mei01:8020/user/hive/warehouse/part1/dt=2020-05-06';

2.6.2 增加分区

1）新增分区（空）
alter table part3 add partition(year='2020',month='05',day='02');

alter table part3 add partition(year='2020',month='05',day='03') partition(year='2020',month='05',day='04');

2）新增分区 (带数据)
alter table part3 add partition(year='2020',month='05',day='05') location '/user/hive/warehouse/part1/dt=2020-05-06';


3） 新增多分区
alter table part3 add 
partition(year='2020',month='05',day='06') 
location '/user/hive/warehouse/part1/dt=2020-05-05'
partition(year='2020',month='05',day='07') 
location '/user/hive/warehouse/part1/dt=2020-05-06';

2.6.3 删除分区

1）删除单个分区
alter table part3 drop partition(year='2020',month='05',day='07');

2）删除多个分区
alter table part3 drop partition(year='2020',month='05',day='06'),partition(year='2020',month='05',day='06');

day06 —————

一、Hive的分区(续)

2.7 分类的类型

2.7.1 简介

根据加载数据的方式可以将分区分为以下三种类型：
（1）静态分区：使用load指令向指定分区加载数据，这种分区数据静态分区
（2）动态分区：使用insert into方式向表中加载数据，分区值是由数据来决定的，不是开发人员控制的
（3）混合分区： 加载数据时，有静态分区，还有动态分区

-- 属性说明：
hive.exec.dynamic.partition=true：是否支持动态分区操作
hive.exec.dynamic.partition.mode=strict/nonstrict:  严格模式/非严格模式
  严格模式：动态分区导入数据时，必须指定至少一个静态值。
  非严格模式：动态分区导入数据时，可以不指定静态值
hive.exec.max.dynamic.partitions=1000: 总共允许创建的动态分区的最大数量
  动态导入数据时，每个表支持的最大动态分区数量
hive.exec.max.dynamic.partitions.pernode=100:in each mapper/reducer node
 每个MapTask/reduceTask最多同时处理的分区数量

2.7.2 动态分区案例演示

--1. 创建分区表（建表时是不分动态和静态形式）

create external table dy_part1(
sid int,
name string,
gender string,
age int,
academy string
)
partitioned by (dt string)
row format delimited fields terminated by ','
;

--2. 为分区表动态导入数据：分区的值由数据文件中的日期来决定 ，必须使用insert into
(1) 先维护一个临时表：
create external table tmp_part1(
sid int,
name string,
gender string,
age int,
academy string,
dat string
)
row format delimited 
fields terminated by ','
;
95001,李勇,男,20,CS,2017-8-31
95002,刘晨,女,19,IS,2017-8-31
95003,王敏,女,22,MA,2017-8-31
95004,张立,男,19,IS,2017-8-31
95005,刘刚,男,18,MA,2018-8-31
95006,孙庆,男,23,CS,2018-8-31
95007,易思玲,女,19,MA,2018-8-31
95008,李娜,女,18,CS,2018-8-31
95009,梦圆圆,女,18,MA,2018-8-31
95010,孔小涛,男,19,CS,2017-8-31
95011,包小柏,男,18,MA,2019-8-31
95012,孙花,女,20,CS,2017-8-31
95013,冯伟,男,21,CS,2019-8-31
95014,王小丽,女,19,CS,2017-8-31
95015,王君,男,18,MA,2019-8-31
95016,钱国,男,21,MA,2019-8-31
95017,王风娟,女,18,IS,2019-8-31
95018,王一,女,19,IS,2019-8-31
95019,邢小丽,女,19,IS,2018-8-31
95020,赵钱,男,21,IS,2019-8-31
95021,周二,男,17,MA,2018-8-31
95022,郑明,男,20,MA,2018-8-31

load data local inpath './data/student2.txt' into table tmp_part1;    
(2) 使用insert into 向分区表中动态导入数据

 先修改为非严格模式：set hive.exec.dynamic.partition.mode=strict;
 insert into dy_part3 partition(dt) select * from tmp_part1;

--注意：想要从临时表中动态导入数据到分区表时，要赋值给分区字段的临时表中的字段必须在最后一列

2.7.3 混合分区案例演示：

--1. 创建分区表
create table dy_part2(
id int,
name string
)
partitioned by (year string,month string,day string)
row format delimited fields terminated by ','
;

--2. 向分区表中导入混合分区数据
（1） 维护一个临时表
create table temp_part2(
id int,
name string,
year string,
month string,
day string
)
row format delimited fields terminated by ','
;
数据如下：
1,廉德枫,2019,06,25
2,刘浩(小),2019,06,25
3,王鑫,2019,06,25
5,张三,2019,06,26
6,张小三,2019,06,26
7,王小四,2019,06,27
8,夏流,2019,06,27

load data local inpath './data/temp_part2.txt' into table temp_part2;
（2）向分区表中动态导入数据
insert into dy_part2 partition(year="2020",month,day) select id,name,month,day from temp_part2;

2.8 分区表注意事项

1. hive的分区使用的是表外字段，分区字段是一个伪列，但是分区字段是可以做查询过滤。
2. 分区字段不建议使用中文
3. 一般不建议使用动态分区，因为动态分区会使用mapreduce来进行查询数据，如果分区数据过多，导致namenode和resourcemanager的性能瓶颈。所以建议在使用动态分区前尽可能预知分区数量。
4. 分区属性的修改都可以修改元数据和hdfs数据内容。

二、Hive的分桶

2.1 分桶的简介

1）为什么要分桶
	当某一个分区的数据或者没分区的表的数据量特别大时，可以将数据分散到不同的文件（一个文件就是一个桶）中进行存储，然后可以指定某一个桶进行查询，这样就可以更好的提高查询性能。
2）分桶的原理
	调用分桶字段的hash值对桶的数量取模，某一条记录应该存入哪一个桶中，由这一条记录中的分桶字段值取模后的结果来决定，比如有5个桶，某一个记录的分桶字段值对5取模等于1，那么这一条记录就存储1对应的文件中。
	与mr的hashPartitioner一样
3）分桶的意义
	-- 将数据存储到不同的桶文件中
	-- 可以使用分桶查询做抽样
	-- 多表join查询时，使用分桶效率更高
4）分桶建表语法
create table tableName(
.......
)
clustered by (colName) [sorted by(colname [asc|desc])] into num buckets
....
5）分桶表导入数据，只能使用动态方式导入数据

2.2 案例演示

--1. 创建分桶表
drop table student1;

create table student1(
sno int,
name string,
sex string,
age int,
academy string
)
clustered by (sno) sorted by (age desc) into 5 buckets
row format delimited 
fields terminated by ','
;

--2. 导入数据: 应该使用insert into 动态导入数据，使用load导入不会出现分桶情况。
（1）先维护一个临时表
create table temp_student(
sno int,
name string,
sex string,
age int,
academy string
)
row format delimited 
fields terminated by ','
;

load data local inpath './data/student2.txt' into table temp_student;
（2）动态导入数据
insert into student1 select * from temp_student distribute by(sno)  sort by (age desc)
;
或者
insert overwrite student1 select * from temp_student distribute by(sno) sort by (age desc)
;

2.3 注意事项

--1. 在2.1.1版本里，底层实现了强制分桶，强制排序策略
   即： 正规写法要带上distribute by(分桶字段)[sort by 排序字段]，如果没有带上，也会分桶和排序。
--2. 使用insert into时可以不加关键字table. 使用insert overwrite时必须带关键字table.
--3. 因为底层实行了强制分桶策略，所以修改mapreduce.job.reduces的个数，不会影响桶文件数据。但是会影响真正执行时reduceTask的数量。是
     真正的reduceTask的数量是最接近mapreduce.job.reduces的数量的因子。如果是素数，就使用本身。

2.4 分桶查询

--分桶查询语法：tablesample(bucket x out of y on colName);
x：表示查询第几桶的数据
y: 表示桶的总数，在2.1.1.版本以后，y可以自定义，也就是不一定非要是建表时指定的桶的数量

--案例1： 查询整张表的数据
select * from student1;
select * from student1 tablesample(bucket 1 out of 1 on sno);
--案例2：查询八桶中的第三桶
select * from student1 tablesample(bucket 3 out of 8 on sno);
--案例3：查询八桶中的第三桶和第五桶
select * from student1 tablesample(bucket 3 out of 8 on sno)
union
select * from student1 tablesample(bucket 5 out of 8 on sno);
--案例4：查询八通中的第二桶和第六桶
select * from student1 tablesample(bucket 2 out of 4 on sno);

	01 09 %8  =  1   对4取模时，就会得到相同的值       %4     = 1
	05 13 %8  =  5         						   %4     = 1
--案例5： 查询7桶的第2桶
select * from student1 tablesample(bucket 2 out of 7 on sno);

2.5 总结

--1. 分区和分桶在建表时，关键字上都带ed
    (1)分区的是partitioned by
    (2)分桶的是clustered by .. sorted by ...
--2. 动态导入数据时，
	(1)分区的是partition by(colname="static value",colname)
	(2)分桶的是distribute by(colname) sort by(colname)
    (3)当分桶的字段和排序字段一致，并且是升序时，可以使用cluster by(colname)	
--3. 本质区别：
	（1）分区的本质是分多个子目录来管理表数据
	（2）分桶的本质是将目录中的大文件划分为多个小文件（桶文件）来管理数据
	（3）两者都是优化手段，但是分桶比分区更细粒度。
	（4）分区是表外字段，分桶是表内字段。

三、Hive的SerDe

3.1 hql读取/动态加载数据的解析

--建表语句：

create table t2(
id int,
name string
)
row format delimited
fields terminated by '\t';

--第一个类型：
实际上delimited 对应一个类型：org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
fields terminated by 后指定的分隔符，是要传入LazySimpleSerDe类型中的切分方法的形参的。

--底层的第二个类型：
INPUTFORMAT 对应的具体的输入规则的类型，默认是：org.apache.hadoop.mapred.TextInputFormat
此类型主要决定记录阅读区如何解析一行记录，通常使用换行符作为记录分隔符。

--底层的第三个类型：org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
OUTPUTFORMAT 指向的具体的输出格式的类型，决定了在动态加载数据时，如何存储数据到文字中，通常会忽略Key3.


--select 过程：
     是数据文件“反序列化”到内存中的一个过程。  数据文件中的一条记录----->row对象
--insert 过程：
     是内存中的数据“序列化”到磁盘文件中的过程。  row对象-------->数据文件里的记录

3.2 常用的SerDe类型

-- 默认SerDe
-- CSVSerDe
-- JSONSerDe
-- RegexSerDe
-- TSVSerde

row format delimited fields terminated by '符号'  -- 指定列分隔符
line terminated by  '\n'                   -- 指定行分隔符

3.2.1 LazySimpleSerde

-- LazySimpleSerde是hive表默认使用的SerDe类型  
-- 对应的默认行分隔符是\n
-- 对应的默认列分隔符是^A ( ctrl+V 和ctrl+A), 对应的ascii码是'\001'

-- 建表语句如下：
create table if not exists csv1(
uid int,
uname string,
age int,
gender char(1)
);  <--------格式都没有指定，默认使用的就是LazySimpleSerde,记录分隔符是\n，列分隔符是^A

-- 可以使用fields terminated by子句来修改默认的列分隔符
drop table csv2;
create table if not exists csv2(
uid int,
uname string,
age int,
gender char(1)
)
row format delimited 
fields terminated by '\001';  <------使用^A的ascii码来指定分隔符


-- 可以使用fields terminated by子句来修改默认的列分隔符
create table if not exists csv3(
uid int,
uname string,
age int,
gender char(1)
)
row format delimited 
fields terminated by ',';  <------告诉SerDe里的方法使用逗号作为分隔符。

3.2.2 CSVSerDe

-- CSV 是Comma-Separated Values的简写，逗号分隔值，也可以称之字符分隔值
-- CSV 对应的具体SerDe类型是：org.apache.hadoop.hive.serde2.OpenCSVSerde.class
-- 默认列分隔值是逗号
-- 默认行分隔值是\n
-- 常用的三个属性：
    1. 默认转义字符(DEFAULT_ESCAPE_CHARACTER): 	\ 	<--反斜线
    2. 默认引用字符(DEFAULT_QUOTE_CHARACTER):		"	<--双引号    
    3. 默认分隔符(DEFAULT_SEPARATOR):			 ，	<-逗号
    
    
-- 案例演示：使用CSVSerDe
-- 1. 不修改默认列的分隔符建表
drop table csv2;
create table if not exists csv2(
uid int,
uname string,
age int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
;
-- 2. 导入数据
1001,zhangsan,23
1002,lisi,24
1003,wangwu,25
1004,,26

load data local inpath './data/csv2.txt' into table csv2;


--案例演示：修改默认列分隔符的建表形式：
create table csv3(
uid int,
uname string,
age int
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"="7",
"qutoeChar"="'",
"escapeChar"="\\")
stored as textfile
;

--导入数据：
10017zhangsan723
10027lisi724
10037wangwu725
10047726
load data local inpath './data/csv3.txt' overwrite into table csv3;

3.2.3 JsonSerde

--1. hive本身没有JsonSerDe类型
--2. 可以使用自己编写的，也可以使用第三方的JsonSerDe类型
	json-serde-1.3.8-jar-with-dependencies.jar
--3. 在使用JsonSerDe时，应该将jar包加载到classpath下
	（1）可以使用add jar临时加载
	（2）可以将jar包放入到hive的lib目录下


-- 数据如下：是一个json格式的数据，
{"uid":"1","uname":"gaoyuanyuan","age":"18"}
{"uid":"2","uname":"zhaoyouting","age":"42"}
{"uname":"fanbingbing","uid":"3","age":"42"}

load data local inpath './data/json1.txt' overwrite into table json1;
-- 为上述数据文件，创建表，使用JsonSerDe类型
drop table json1;
create table json1(
uid int,
uname string,
age int
)
row format serde 'org.openx.data.jsonserde.JsonSerDe';

JSON的应用场景：复杂类型的使用

--1. 原始数据：
zs math:100,98,76 chinese:98,96,100 english:100,99,90  
ls math:60,80,70 chinese:90,66,91 english:50,51,70  

-------------------------------------------
create table t1(
uname string,
math map<string,array<int>>,
....
)
row format delimited 
fields terminated by ' '
map keys terminiated by ':'
collection items terminated by','
----上述的表语句需要执行map的元素分隔符，还要指定数组的元素分隔符，都用到了collection.因此一个collection无法指定多个分割符。这种情况，只能换思路，使用比较合适的serde去导入数据(jsonSerde最适合)


-- 清洗数据：
可以使用编程语言转成下面json字符串  
{"uname":"zs","score":{"math":[100,98,76],"chinese":[98,96,100],"english":[100,99,90]}}  
{"uname":"ls","score":{"math":[60,80,70],"chinese":[90,66,91],"english":[50,51,70]}}

--建表：
create table if not exists complex(
uname string,
score map<String,array<int>>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
stored as textfile
;
--导入数据
load data local inpath './data/json2.txt' into table complex;

3.2.4 RegexSerDe

-- 为什么要提供RegexSerde
	因为其他的SerDe都不支持多字符作为分隔符，因此提供了正则SerDe.
-- 正则SerDe对应的具体类型：org.apache.hadoop.hive.serde2.RegexSerDe


--案例演示：
01||zhangsan||23
02||lisi||24


create table if not exists t_regex(
id string,
uname string,
age int,
gender char(1)
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties(
'input.regex'='(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)',
'output.format.string'='%1$s %2$s %3$s %4$s'
)
stored as textfile
;

load data local inpath './data/regex.txt' into table t_regex;

day07 —————

一、Hive的文件存储格式

1.1 文件存储格式说明

hvie的文件存储格式指的是在写建表语句时，stored as "" 子句指定的格式，会在hdfs上使用相应的数据存储格式存储数据。

对应的属性：
--1. 默认的存储格式属性设置
<property>
    <name>hive.default.fileformat</name>
    <value>TextFile</value>
</property>

--2. 内部表（管理表的）默认存储格式，当设置为none时，使用hive.default.fileformat对应的值。
<property>
    <name>hive.default.fileformat.managed</name>
    <value>none</value>
</property>

--3. 常用的存储格式分两大类，分别是纯文本文件和二进制文件
（1）纯文本文件指的是TextFile，也是默认的
（2）二进制存储文件有：sequenceFile,rcfile,orc,parquet

1.2 常用的存储格式案例演示：

参考文档

二、Hive的索引和视图

2.1 索引

2.1.1 索引简介

(1)索引是SQL的一个对象（表、视图、库、序列这些都是SQL中的对象）
(2)关键字是Index
(3)使用索引的目的是提高查询效率，避免全表扫描
(4)用户只需要为相关的表创建索引对象，当sql中的条件中含有索引对应的字段时，就会自动使用索引数据。
(5)当有索引的表的数据更新（增加，修改，删除）时，索引信息要重新维护
(6)索引数据是有序的，需要单独存储的。

2.1.2 索引案例演示：

参考文档

注意：hive默认是没有启动索引查询的，需要修改以下属性为true
SET hive.optimize.index.filter=true;

2.2 视图

2.2.1 视图的简介

（1）视图也是SQL中的一个对象。
（2）视图的关键字是view
（3）使用视图的目的是简化查询，表中局部暴露（隐藏敏感数据）
（4）视图的本质就是真实表或者一个DQL语句的一个投影
（5）hive的视图是只读的，不能通过视图对原表进行修改。

2.2.2 视图的应用

参考文档

三、Hive的压缩机制

3.1 Hive的压缩机制简介

（1）hive的压缩机制针对的是insert into操作  
（2）hql都会被翻译成MR,所以hive的压缩机制指的就是MR的压缩机制
（3）压缩的位置有两处，分别是map阶段的输出和reduce阶段的输出
（4）map阶段的输出，由于是临时存储，所以选择的压缩算法应该选择一个不能太影响cpu的性能开销的算法
	（deflate算法，snappy算法，lzo算法）
（5）	参数整理：
	--map端的属性
		hive.exec.compress.intermediate=[false|true],默认值是false
		hive.intermediate.compression.codec  压缩编码器对应的类文件。
		hive.intermediate.compression.type   压缩时的类型，可以是按照RECORD或者BLOCK压缩
	--reduce端的属性
    	hive.exec.compress.output=[false|true],默认是false，表示不开启压缩
    	如果开启压缩，使用的压缩算法是map端指定的

3.2 压缩算法对应的类型

压缩格式	压缩编码器
deflate	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
lzo	com.hadoop.compression.lzo.LzopCodec（中间输出使用）
snappy	org.apache.hadoop.io.compress.SnappyCodec（中间输出使用）

3.3 常用的压缩格式比较

压缩格式	压缩比	压缩速度	需要安装	支持切分
bzip2	最高	慢	否	是
gzip	很高	比较快	否	否
snappy	比较高	很快	是	否
lzo	比较高	很快	是	是(需要建立索引)

四、hive的优化

4.1 hql查询时的严格模式

hive.strict.checks.large.query=[false|true],默认是false，表示不开启检查大数据查询

如果开启了会禁止一下操作：
Cartesian Product.   笛卡尔积
No partition being picked up for a query. ：分区表查询时没有指定分区
Orderby without limit.	：order by子句后面没有limit子句
Comparing bigints and strings.  ： bigint和string的比较
Comparing bigints and doubles.	： bigint和double的比较

重点划分：

--1：hive是什么
--2. hive的体系结构和工作流程
--3. hive的本地模式和远程模式的区别
--4. hive表类型
--5. hive的复杂数据类型
--6. hive的开窗函数
--7. 分区和分桶
--8. serde
--9. 存储格式


distribte by（分桶）: 调用数据的hash值计算属于哪一个桶里的，然后将数据存储对应的桶中。
sort by:    排序是组内排序，   
partition by（动态分区）： 相当于group by分组
order by:    在having后的order by的，对正个表的所有数据进行全排序，reduceTask的个数是1
cluster by ：当distribute by 和sort by组合时，并且是同一个字段，排序是升序时，就可以使用cluster by替换。

努力成为一名资深大数据选手

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
4.hive知识点

一、复习1.1 Linux- linux的安装- 基本命令- 相对路径和绝对路径- 重定向符号：> 和 >> - 管道符号：|- VIM/VI编辑器- 高阶命令- 软件管理- 克隆- 免密登录认证（重点）- 静态IP的管理（重点）- 服务的管理- 时间同步（重点）- shell（重点）1.2 HDFS- 大数据的概念（1）概念（2）特征（3）大数据工作流程： -1）数据采集：数据源的分类有行为数据(日志文件)、业务数据(RDBM
复制链接

扫一扫

专栏目录