Apache Hive

Apache Hive

MySQL-5.6离线安装

# 版本
	CentOS7
	MySQL
# Linux准备一台虚拟机
	1G内存
	1cpu
	ip hosts hostname
	防火墙
	免密登录
 # 1.上传MySQL安装包
 # 2.卸载mariadb
 	查看mariadb是否已经安装
 	rpm -qa | grep mariadb
 	删除mariadb软件
 	rpm -e --nodeps mariadb-libs
 # 3.安装mysql的依赖工具
 	autoconf、net-tools、perl、openssl-devel
 	yum install -y autoconf net-tools perl openssl-devel
 # 4.安装MySQL
 	解压:
 	tar xvf MySQL-5.6.46-1.el7.x86_64.rpm-bundle.tar -C mysql/
	安装:(按照顺序安装)
		MySQL-client-5.6.46-1.el7.x86_64.rpm
		MySQL-devel-5.6.46-1.el7.x86_64.rpm
		MySQL-embedded-5.6.46-1.el7.x86_64.rpm
		MySQL-shared-5.6.46-1.el7.x86_64.rpm
		MySQL-shared-compat-5.6.46-1.el7.x86_64.rpm
		MySQL-test-5.6.46-1.el7.x86_64.rpm
		MySQL-server-5.6.46-1.el7.x86_64.rpm
	安装命令:
		rpm -ivh MySQL-client-5.6.46-1.el7.x86_64.rpm
		rpm -ivh MySQL-devel-5.6.46-1.el7.x86_64.rpm
		rpm -ivh MySQL-embedded-5.6.46-1.el7.x86_64.rpm
		rpm -ivh MySQL-shared-5.6.46-1.el7.x86_64.rpm
		rpm -ivh MySQL-shared-compat-5.6.46-1.el7.x86_64.rpm
		rpm -ivh MySQL-test-5.6.46-1.el7.x86_64.rpm
		rpm -ivh MySQL-server-5.6.46-1.el7.x86_64.rpm
# 5.MySQL初始化配置
	获得默认生成密码:
		[root@hive41 mysql]# cat /root/.mysql_secret
		 密码:rzvx0yhh4qn368ng
	启动mysql:
		[root@hive41 mysql]# systemctl start mysql
	登陆:
		[root@hive41 mysql]# mysql -u root -p
		密码:刚才查到的密码  
	修改密码:
		mysql> set password = password('mysql');
	开启远程访问:(命令)
		show databases; 
		use mysql;//进入mysql的databases
		select host,user,password from user; 
		grant all privileges on *.* to root@'%'; //给root用户的任意ip,授权所有权限
		grant all privileges on *.* to root@'%' identified by password(""); //给root用户的任意ip,授权所有权限并且带密码
		select host,user from user;
		select host,user,password  from user;//查看是否已经存在%任意ip能访问记录
	设置远程访问的密码:(修改所有的密码)
		update user set password = password("mysql");
	重新加载配置信息:
		flush privileges;
	推出后重启mysql服务:e
		systemctl restart mysql;

在这里插入图片描述

# 使用远程linux客户端工具访问验证
ip:192.168.153.41
user:root
password:mysql

MySQL之密码过期问题

一、MySQL中mysql数据库user表的password_expired属性设置为N即密码不会过期。
二、update user set password_expired='N'

Hive

概述

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

Hive是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具,可以用来进行数据提取转化加载(ETL),这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。 Hive (1.x版本)查询操作过程严格遵守Hadoop MapReduce 的作业执行模型,Hive 将用户的HiveQL 语句通过解释器转换为提交到Hadoop 集群上,Hadoop 监控作业执行过程,然后返回作业执行结果给用户。 Hive 并非为联机事务处理而设计,Hive 并不提供实时的查询和基于行级的数据更新操作。Hive 的最佳使用场合是大数据集的批处理作业,例如,网络日志分析等。
联机事务处理-OLTP:在线事务处理,泛指RDBMS,事务处理完善,延迟低。

Hive(Impala)是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。同时,Hive也是建立在Hadoop上的数据仓库基础构架。它提供了一系列的工具,可以用来进行数据提取转化加载(ETL),并且提供了存储、查询和分析Hadoop中的大规模数据的机制。Hive定义了简单的类SQL查询语言,称为HQL,它允许熟悉SQL的用户查询数据。这个语言也允许熟悉MapReduce的开发者的设计自定义的Mapper和Reducer来处理内建的Mapper和Reducer无法完成的复杂的分析工作。

为什么有Hive

在这里插入图片描述

简介

hive是facebook开源,并捐献给了apache组织,作为apache组织的顶级项目(hive.apache.org)。 hive是一个基于大数据技术的数据仓库(DataWareHouse)技术,主要是通过将用户书写的SQL语句翻译成MapReduce代码,然后发布任务给MR框架执行,完成SQL 到 MapReduce的转换。可以将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。

总结

  • Hive是一个数据仓库(数据库)
  • Hive构建在HDFS上
  • Hive允许程序员使用SQL命令来完成数据的分布式计算,计算构建在yarn之上。(Hive会将SQL转化为MR操作)

优点:

​ 简化程序员的开发难度,写SQL即可,避免了去写mapreduce,减少开发人员的学习成本

缺点:

​ 延迟较高,适合做大数据的离线处理(TB PB级别的数据,统计结果延迟1天产出),小数据量延迟高,不适合。

  • 数据库 DataBase

    • 数据量级小,数据价值高
  • 数据仓库 DataWareHouse

    • 数据体量大,数据价值低

架构

Hive提供了一个SQL命令的操作接口,允许用户可以使用类似SQL的Hive的Query Language执行一些离线的SQL分析。但是Hive和传统的数据库不同,Hive只是构建在Hadoop的MapReduce之上的SQL解析工具,并不参与数据的管理和存储,Hive中所有的数据都是在运行任务的时候才会真正的加载。
在这里插入图片描述

总的来说Hive有以下几点特性:

  • Hive和关系数据库存储文件的系统不同,Hive使用的是Hadoop的HDFS,关系数据库则是服务器本地的文件系统。
  • Hive使用的计算模型是MapReduce,而关系数据库则是自己设计的计算模型。
  • 关系数据库都是为OLTP进行设计的,而Hive则是为海量数据做数据挖掘设计的,实时性很差,实时性的区别导致Hive的应用场景和关系数据库有很大的不同。
  • Hive很容易扩展自己的存储能力和计算能力,这个是继承Hadoop的特性,而关系数据库在这个方面要比Hive差很多。

Hive的安装

说明

由于hive需要存储一些建表的元数据信息,因此在生产环境下安装Hive需要事先安装MySQL数据库服务(注意:MySQL数据库的编码必须是Latin1编码)。其次Hive是构建在Hadoop存储和计算之上的工具,因此还需要在配置和安装hive之前保证Hadoop的hdfs和MapReduce正常运行。最后Hive的安装主机上必须配置HADOOP_HOME环境变量,这样Hive可以通过HADOOP_HOME环境变量感知用户的Hadoop计算集群的位置和规模。虚拟机内存设置为1G即可。

# 安装前准备
1. 安装mysql数据库
2. 安装Hadoop
# 配置hdfs和yarn的配置信息
[root@hive40 ~]# jps
1651 NameNode
2356 NodeManager
2533 Jps
1815 DataNode
2027 SecondaryNameNode
2237 ResourceManager
# 安装hive
1.上传hive安装包到linux中
2.解压缩hive
	[root@hive42 ~]# tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt/install
	[root@hive42 ~]# mv apache-hive-1.2.1-bin hive1.2.1
3.配置环境变量  
	export HIVE_HOME=/opt/install/hive1.2.1
	export PATH=$PATH:$HIVE_HOME/bin
4.加载系统配置生效
	[root@hive42 ~]# source /etc/profile
5.配置hive(hive-env.sh)
	# 配置hadoop目录
	HADOOP_HOME=/opt/install/hadoop2.9.2/
	# 指定hive的配置文件目录
	export HIVE_CONF_DIR=/opt/install/hive1.2.1/conf/
hive-site.xml(新建)
	<?xml version="1.0" encoding="UTF-8" standalone="no"?>
	<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!--hive的元数据保存在mysql中,需要连接mysql,这里配置访问mysql的信息-->
    <!--url:这里必须用ip-->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://192.168.153.42:3306/hive</value>
        //hive是数据库  需要创建
    </property>
    <!--drivername-->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <!--username-->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <!--password-->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>mysql</value>
    </property>
</configuration>
# 注意:
	1.将jar包拷贝到,hive的lib目录。
	2.在mysql中创建hive数据库。
# 2.创建数据库
	> 登录mysql创建hive数据库(使用命令行创建)
	> create database hive
	> 重启mysql
# 启动(先启动Hadoop的hdfs和yarn,myaql)

1. 启动 hadoop

启动hadoop

start-dfs.sh
start-yarn.sh

在这里插入图片描述

2. 本地启动hive

启动本地hive并进入客户端模式

[root@hadoop40 installs]# hive
Logging initialized using configuration in jar:file:/opt/installs/hive1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive>

3.Hive命令行

a.Linux本地命令

b.hdfs命令

c.HQL语句

# 1. Linux本地命令(java命令)
语法:!Linux命令;
!java;
!clear;
!javac;
# 2. hdfs命令
语法:hdfs命令;
	dfs -ls /;
	1. 查看dfs中的文件。
	dfs -ls /;
	2. 查看dfs中 /user 下的文件
	dfs -ls /user;
	3. 以递归的方式,查看/user下的所有文件
	dfs -lsr /user;
# 3. HQL语句
	1.查看hive集群中所有数据库(database)
		hive> show databases;
	2.创建一个数据库
		hive> create database btrc;
	3.切换进入数据库
		hive> use btrc;
	4.查看当前hive的database中的表
		hive> show tables;
	5.创建一个表
		hive> create table t_user(id string,name string,age int);
	6.添加一条数据(转化为MR执行)
		hive> insert into t_user values('1001','yangdd',42);
		hive> insert into t_user values('1002','houqw',18);
		hive> insert into t_user values('1003','wyunt',24);
		hive> insert into t_user values('1004','wjh',23);
	7.查看表结构
		hive> desc t_user;
	8.查看表的schema描述信息。(表元数据,描述信息)
		hive> show create table t_user;
		# 明确看到,该表的数据存放在hdfs中。
		hive的database在hdfs中是一个文件夹;
		hive的table在hdfs中是一个文件夹或者文件;
	9.查看数据库结构
		hive> desc database btrc;
	10.查看当前库
		hive> select current_database();
	11.其他sql(简单查询数据不会触发MapReduce)
		select * from t_user;
		select count(*) from t_user;(统计相关的会触发MapReduce)
		select * from t_user order by id;(排序相关的会触发MapReduce)

4.启动Hive方式

  • 单用户模式

    启动本地hive并进入客户端模式(管理员模式)

[root@hive42 conf]# hive

Logging initialized using configuration in jar:file:/opt/install/hive1.2.1/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive>
  • 多用户模式
# a.前台启动方式,启动后的进程Runjar,启动后此窗口不能进行任何命令
[root@hive42 ~]# hiveserver2

# b.beeline模式(必须在已经启动的前提下才可以使用)
# 1.首先进入beeline的命令行
	beeline
# 2.其次连接hive数据库
	beeline>!connect jdbc:hive2://192.168.153.42:10000  回车
	>mysql用户名:root		回车
	>mysql密码:mysql		回车
	也可以直接所有:beeline -u jdbc:hive2://hive42:10000 -n root
	(-u 指定的是URL -n 指定是name)
# 3.注意事项:
	①beeline是一个会话连接,hive属于即时模式,每个命令是一个独立的连接。
	②beeline语法格式必须要以分号结束,否则会出问题。
Error: Error while compiling statement: FAILED: ParseException line 2:0 missing EOF at 'show' near 'databases' (state=42000,code=40000)
0: jdbc:hive2://192.168.153.42:10000>
	解决:重启(ctrl+c);
# c.后台启动方式
	hiveserver2 &
[root@hive42 conf]# hiveserver2 &
[1] 63003
[root@hive42 conf]#

5.JDBC访问Hive

# 导入依赖
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.2.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.9.2</version>
</dependency>
# JDBC操作Hive
package demo1;
import java.sql.*;
/**
 * 查询btrc数据库中的t_user的信息
 */
public class HiveJdbc {
    public static void main(String[] args) throws ClassNotFoundException, SQLException {
        //1.加载驱动
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        //2.获得连接
        Connection conn = DriverManager.getConnection("jdbc:hive2://192.168.153.42:10000/btrc", "root", "mysql");
        //3.创建SQL
        String sql = "select id,name,age from t_user";
        //4.发送SQL
        PreparedStatement pstm = conn.prepareStatement(sql);
        ResultSet resultSet = pstm.executeQuery();
        //5.处理结果集
        while (resultSet.next()){
            String id = resultSet.getString("id");
            String name = resultSet.getString("name");
            int age = resultSet.getInt("age");
            System.out.println(id+":"+name+":"+age);
        }
        //6.释放资源
        resultSet.close();
        pstm.close();
        conn.close();
    }
}

Hive数据类型

数据类型(primitivearraymapstruct )

  • primitive(原始类型):

    hive数据类型字节备注
    TINYINT1java里的byte 整型
    SMALLINT2java里的short 整型
    INT4java里int 整型
    BIGINT8java里long 整型
    BOOLEAN布尔
    FLOAT4浮点型
    DOUBLE8浮点型
    STRING字符串 无限制
    VARCHAR字符串 varchar(20) 最长20
    CHAR字符串 char(20) 定长20
    BINARY二进制类型
    TIMESTAMP时间戳类型
    DATE日期类型

    复杂类型

    类型描述字面量示例
    ARRAY有序的的同类型的集合array(1,2)
    MAPkey-value,key必须为原始类型,value可以任意类型map(‘a’,1,’b’,2)
    STRUCT字段集合,类型可以不同struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)
  • array(数组类型):

    array<泛型>
    # 建表
    create table t_tab(
    	score array<float>
    );
    访问方式:
    	数组列名[下标]
    	select id,name,score[1] from t_tab
    
  • map(key-value类型):MAP <primitive_type, data_type>

    map<k泛型-v泛型>
    # 建表
    create table t_tab(
    	score map<string,float>
    );
    访问方式:
    	map列['k']
    	select id,name,score['k'] from t_tab
    
  • struct(结构体类型):STRUCT <col_name:data_type, …>

    对象类型:
    	struct<属性1:类型,属性2:类型,属性3:类型,>
    # 建表
    create table t_tab(
    	info struct<name:string,age:int,sex:char(1)>
    );
    访问方式:
    	struct列名.属性名
    	select id,name,info.age from t_tab
    

根据以上类型构建一个表结构

create table t_person(
 id int,
 name string,
 salary double,
 birthDay date,
 sex char(1),
 hobbies array<string>,
 cards map<string,string>,
 addr struct<zipCode:string,city:string>
);
//*struct里的string类型必须小写 

***注意 hive在建表时没有约束条件

# 1.查看表结构
desc t_person;
# 2.查看表的详细信息(hdfs)
show create table t_person;
# 3.查看更多信息
desc formatted t_person;

Hive数据导入

1 默认分隔符

分隔符输入方式含义
\nvi中换行即可用来分割数据的,每行1条数据。
^Actrl+V ctrl+A用来分割字段(列),在HQL语句中使用\001表示。
^Bctrl+B用来分割array元素、struct元素、map中kv和kv之间。在HQL建表语句中使用\002表示。
^Cctrl+V ctrl+C用于map的k和v之间的分割。在HQL建表语句中使用\003表示。
# 示例数据(vi编辑器)
1 张三 8000.0 2019-9-9 1 抽烟 喝酒 烫头 123456 中国银行 22334455 建设银行 北京 10010
2 李四 9000.0 2019-8-9 0 抽烟 喝酒 烫头 123456 中国银行 22334455 建设银行 郑州 45000
3 王五 7000.0 2019-7-9 1 喝酒 烫头 123456 中国银行 22334455 建设银行 北京 10010
4 赵6 1000.0 2019-10-9 0 抽烟 烫头 123456 中国银行 22334455 建设银行 郑州 45000
5 于谦 1000.0 2019-10-9 0 抽烟 喝酒 123456 中国银行 22334455 建设银行 北京 10010
6 郭德纲 1000.0 2019-10-9 1 抽烟 烫头 123456 中国银行 22334455 建设银行 天津 20010
# 数据导入:
# 从linux本地导入(overwrite是覆盖已有数据)
load data local inpath '/opt/data/hive/person.txt' [overwrite] into table t_person;
# 从hdfs系统中导入(overwrite是覆盖已有数据)
load data inpath '/data/hive/person.txt' [overwrite] into table t_person;
# 验证数据
1. 查询数据
select * from t_person;
2. 查看HDFS中的表对应的数据文件

在这里插入图片描述

2 自定义分隔符

# 分隔符设计
分隔符含义备注
,用来表示每个列的值之间分隔符。 fields
-用来分割array中每个元素,以及struct中的每个值,以及map中kv与kv之间。 collection items
|用来分割map的k和v之间 map keys
\n每条数据分割使用换行。 lines
# 建表
create table t_person1(
    id string,
    name string,
    salary double,
    birthday date,
    sex char(1),
    hobbies array<string>,
    cards map<string,string>,
    addr struct<city:string,zipCode:string>
) row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by '|'
lines terminated by '\n';
# 测试数据
1,张三,8000.0,2019-9-9,1,抽烟-喝酒-烫头,123456|中国银行-22334455|建设银行,北京-10010
2,李四,9000.0,2019-8-9,0,抽烟-喝酒-烫头,123456|中国银行-22334455|建设银行,郑州-45000
3,王五,7000.0,2019-7-9,1,喝酒-烫头,123456|中国银行-22334455|建设银行,北京-10010
4,赵6,100.0,2019-10-9,0,抽烟-烫头,123456|中国银行-22334455|建设银行,郑州-45000
5,于谦,1000.0,2019-10-9,0,抽烟-喝酒,123456|中国银行-22334455|建设银行,北京-10010
6,郭德纲,1000.0,2019-10-9,1,抽烟-烫头,123456|中国银行-22334455|建设银行,天津-20010
# 导入数据
# 在hive命令行中执行
-- overwrite 是覆盖的意思,可以省略。
load data local inpath ‘/opt/doc/person1.txt’ [overwrite] into table t_person;
# 本质上就是将数据上传到hdfs中(数据是受hive的管理,表删除hdfs中也就没有了)
dfs -ls /user/hive/warehouse/baizhi.db/t_person 
# 我们可以使用dfs命令,直接按照规定分隔符,上传文件到该目录下,也能查询到。

3. JSON分割符

# 1.本地创建json文件
{"id":1,"name":"zhangsan","sex":true,"birth":"1991-02-08"}
{"id":2,"name":"lisi","sex":true,"birth":"1991-02-08"}
添加格式解析器的jar(本地客户端命令)
# 在hive的客户端执行
add jar /opt/install/hive1.2.1/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar;
# 建表
create table t_person2(
    id string,
    name string,
    sex char(1),
    birth date
)row format serde 'org.apache.hive.hcatalog.data.JsonSerDe';
# 加载文件数据(本地客户端命令)
# 注意:导入的json数据dbeaver看不了。(因为导入后的表本质上就是该json文件。)
load data local inpath '/opt/data/hive/t_person2.json' into table t_person2;
# 查看数据
select * from t_person2;

在这里插入图片描述

4 CSV格式映射

# 数据
# 文件名 user_csv.csv
1001,张三,1,1999-9-9
1002,李四,0,1999-9-9
1003,王五,1,1999-9-9
# 导入解析vsc的文件jar
# 在hive的客户端执行
add jar /opt/installs/hive1.2.1/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar
# 建表
create table t_person3(
    id string,
    name string,
    sex char(1),
    birth date
)row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'--解析csv文件的类
with serdeproperties('separatorChar'=',');--指定分隔符
# 加载数据
load data local inpath '/opt/data/hive/t_person3.csv' into table t_person3;
# 查询表
select * from t_person3;

5 正则分隔符

# 数据:access.log
INFO 192.168.1.1 2019-10-19 QQ com.baizhi.service.IUserService#login
INFO 192.168.1.1 2019-10-19 QQ com.baizhi.service.IUserService#login
ERROR 192.168.1.3 2019-10-19 QQ com.baizhi.service.IUserService#save
WARN 192.168.1.2 2019-10-19 QQ com.baizhi.service.IUserService#login
DEBUG 192.168.1.3 2019-10-19 QQ com.baizhi.service.IUserService#login
ERROR 192.168.1.1 2019-10-19 QQ com.baizhi.service.IUserService#register
# 建表语句
create table t_service_log(
    level string,
    ip string,
    log_time date,
    app string,
    service string,
    method string
)row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'   
with serdeproperties("input.regex"="(.*)\\s(.*)\\s(.*)\\s(.*)\\s(.*)#(.*)");
--(.*) 表示任意字符 \\s表示空格   --正则表达式的格式转化类
# 导入数据
load data local inpath '/opt/data/hive/service.log' into table t_service_log;
# 查看数据
select * from t_service_log;

在这里插入图片描述

Hive中表分类

管理表(内部表)

需求:不希望hive删除table连通hdfs中的数据?

由Hive全权管理的表

​ 所谓的管理表指hive是否具备数据的管理权限,如果该表是管理表,当用户删除表的同时,hive也会将表所对应的数据删除,因此在生产环境下,为了防止误操作,带来数据损失,一般考虑将表修改为非管理表-外部表

缺点:数据不安全。

内部表:(hive对该表拥有全部的管理权限,统计,删除数据(drop table)),会将表从hive删除,同时,表对应的hdfs中的数据文件也会被删除。
	hive创建表
	load data 命令导入数据
	insert 插入的数据
缺点:可能会导致数据的不安全。

外部表

引用管理HDFS数据作为表管理,但无法删除数据

外部表和管理表最大的区别在于删除外部表,只是将MySQL中对应该表的元数据信息删除,并不会删除hdfs上的数据,因此外部表可以实现和第三方应用共享数据。在创建外表的时候需要添加一个关键字"external"即可。create external xxx()…

外部表:hive通过引用数据文件方式,映射该文件成为一个table。
	1.hive只是映射该文件的数据table
	2.将table的元数据信息存放在MySQL(MetaStore)。
	3.hive对该表的数据可以执行全部HQL查询操作。
	特点:hive一旦将外部表删除,只会删除hive在mysql中的表的元数据。
		但是HDFS中该数据文件依旧保留。
# 创建外部表
1. 准备数据文件t_personhdfs.txt
2. 上传至hdfs中,该数据文件必须被放在一个单独的文件夹内。该文件夹内的数据文件被作为表数据
 	hdfs dfs -mkdir -p /data/hive/btrc/personhdfs
3. 创建表: 在最后使用location 指定hdfs中数据文件所在的文件夹即可。
    create external table t_personhdfs(
        id int,
        name string,
        salary double,
        birthday date,
        sex char(1),
        hobbies array<string>,
        cards map<string,string>,
        addr struct<city:string,zipCode:string>
    )row format delimited
    fields terminated by ','
    collection items terminated by '-'
    map keys terminated by '|'
    lines terminated by '\n'
    location '/data/hive/btrc.db/personhdfs';
    
    --列的分割
    --数组 struct的属性 map的kv和kv之间

分区表

将表按照某个列的一定规则进行分区,减少海量数据情况下的数据检索范围,提高查询效率;

举例:电影表、用户表

分区方案:按照用户区域、电影类型

应用:依据实际业务功能,拿查询条件的列作为分区列来进行分区,缩小MapReduce的扫描范围,提高MapReduce的执行效率。

分区字段选择:

​ hive离线场景中,条件字段。(根据业务数据量和分布情况选择)

使用分区:

​ 查询使用分区字段,Hive优化器会自动根据分区查找数据。

创建分区表
# 数据源文件
# 文件"bj.txt" (china bj数据)
1001,张三,1999-1-9,1000.0
1002,李四,1999-2-9,2000.0
1008,孙帅,1999-9-8,50000.0
1010,王宇希,1999-10-9,10000.0
1009,刘春阳,1999-9-9,10.0
# 文件“tj.txt” (china tj数据)
1006,郭德纲,1999-6-9,6000.0
1007,胡鑫喆,1999-7-9,7000.0
# 创建分区表并导入数据
create external table t_user_part(
	id string,
	name string,
	birth date,
	salary double
)partitioned by(country string,city string)--指定分区列,按照国家和城市分区。
row format delimited
fields terminated by ',' 
lines terminated by '\n';
# 导入china和bj的数据
load data local inpath "/opt/data/hive/part/bj.txt" into table t_user_part partition(country='china',city='bj');
# 导入china和tj的数据
load data local inpath "/opt/data/hive/part/tj.txt" into table t_user_part partition(country='china',city='tj');
# 查看分区信息
	show partitions t_user_part;
# 使用分区查询:本质上只要查询条件在存在分区列
	select * from t_user_part where city = 'bj';
# 删除分区信息(会连同分区数据一块删除)
# 注意:会连同分区数据一块删除
	alter table t_user_part drop partition(country='china',city='bj');
# 添加分区
# 外部分区表,删除后,hive不管理数据,但是数据文件依然存在  
alter table t_user_part add partition(country='china',city='heb')location '/file/t_user_part/heb';
总结:
	table中的多个分区的数据是分区管理
	1.删除数据是按照分区删除的,如果删除某个分区,则将分区对应的数据也删除了(管理表,如果是外部表,数据删除,数据文件依然在)。
	2.查询统计的时候,多个分区被一个表管理起来。
		select * from 表 where 分区字段为条件。
分桶表

分区表是为了将文件按照分区文件夹进行粗粒度文件隔离,但是分桶表是将数据按照某个字段进行hash计算出所属的桶,然后在对桶内的数据进行排序 。

CREATE EXTERNAL TABLE t_employee_bucket(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2),
    deptno INT)
    CLUSTERED BY(id) SORTED BY(salary ASC) INTO 6 BUCKETS  
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY "\t"
    LOCATION '/hive/employee_bucket';
0: jdbc:hive2://CentOS:10000> set hive.enforce.bucketing = true;
No rows affected (0.024 seconds)
0: jdbc:hive2://CentOS:10000> INSERT INTO TABLE t_employee_bucket(id,name,job,manager,hiredate,salary,deptno) SELECT id,name,job,manager,hiredate,salary,deptno  FROM t_employee where deptno=10;

CLUSTERED BY(id)表示按照ID进行取值bucket,SORTED BY,当ID一样的时候按照salary排序。注意使用桶表

  • 临时表

临时表仅对当前session(一个Connection有效)可见,临时表的数据将存储在用户的暂存目录中,并在会话结束后删除。如果临时表与永久表表名相同,则对该表名的任何引用都将解析为临时表,而不是永久表。临时表还具有以下两个限制:不支持分区列;不支持创建索引.

CREATE TEMPORARY TABLE if not exists emp_temp(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2),
    deptno INT
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/emp_temp';
复制表结构
CREATE TEMPORARY EXTERNAL TABLE  IF NOT EXISTS  t_employee_copy2  LIKE t_employee_bucket location '/hive/t_empolyee_copy2'

仅仅复制表结构,并不拷贝数据。

修改表
  • 重命名
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_user1 RENAME TO t_user;
No rows affected (0.224 seconds)
0: jdbc:hive2://CentOS:10000> show tables;
+--------------------+--+
|      tab_name      |
+--------------------+--+
| t_employee         |
| t_employee_bucket  |
| t_employee_copy    |
| t_employee_copy2   |
| t_employee_copy3   |
| t_tem              |
| t_user             |
+--------------------+--+
7 rows selected (0.088 seconds)
  • 修改列(修改类型、顺序、新增)
0: jdbc:hive2://CentOS:10000> desc t_user;
+-----------+-------------------------------------+----------+--+
| col_name  |              data_type              | comment  |
+-----------+-------------------------------------+----------+--+
| id        | int                                 |          |
| name      | string                              |          |
| sex       | boolean                             |          |
| age       | int                                 |          |
| salary    | double                              |          |
| hobbies   | array<string>                       |          |
| card      | map<string,string>                  |          |
| address   | struct<country:string,city:string>  |          |
+-----------+-------------------------------------+----------+--+
8 rows selected (0.173 seconds)
0: jdbc:hive2://CentOS:10000> alter table t_user change id id bigint;
No rows affected (0.226 seconds)
0: jdbc:hive2://CentOS:10000> desc t_user;
+-----------+-------------------------------------+----------+--+
| col_name  |              data_type              | comment  |
+-----------+-------------------------------------+----------+--+
| id        | bigint                              |          |
| name      | string                              |          |
| sex       | boolean                             |          |
| age       | int                                 |          |
| salary    | double                              |          |
| hobbies   | array<string>                       |          |
| card      | map<string,string>                  |          |
| address   | struct<country:string,city:string>  |          |
+-----------+-------------------------------------+----------+--+
8 rows selected (0.194 seconds)
0: jdbc:hive2://CentOS:10000> alter table t_user change id id bigint after name;
No rows affected (0.236 seconds)
0: jdbc:hive2://CentOS:10000> desc t_user;
+-----------+-------------------------------------+----------+--+
| col_name  |              data_type              | comment  |
+-----------+-------------------------------------+----------+--+
| name      | string                              |          |
| id        | bigint                              |          |
| sex       | boolean                             |          |
| age       | int                                 |          |
| salary    | double                              |          |
| hobbies   | array<string>                       |          |
| card      | map<string,string>                  |          |
| address   | struct<country:string,city:string>  |          |
+-----------+-------------------------------------+----------+--+
8 rows selected (0.174 seconds)
0: jdbc:hive2://CentOS:10000> alter table t_user change name name string after id;
No rows affected (0.249 seconds)
0: jdbc:hive2://CentOS:10000> desc t_user;
+-----------+-------------------------------------+----------+--+
| col_name  |              data_type              | comment  |
+-----------+-------------------------------------+----------+--+
| id        | bigint                              |          |
| name      | string                              |          |
| sex       | boolean                             |          |
| age       | int                                 |          |
| salary    | double                              |          |
| hobbies   | array<string>                       |          |
| card      | map<string,string>                  |          |
| address   | struct<country:string,city:string>  |          |
+-----------+-------------------------------------+----------+--+
0: jdbc:hive2://CentOS:10000> alter table t_user add columns (deptno int);
No rows affected (0.256 seconds)
0: jdbc:hive2://CentOS:10000> desc t_user;
+-----------+-------------------------------------+----------+--+
| col_name  |              data_type              | comment  |
+-----------+-------------------------------------+----------+--+
| id        | bigint                              |          |
| name      | string                              |          |
| sex       | boolean                             |          |
| age       | int                                 |          |
| salary    | double                              |          |
| hobbies   | array<string>                       |          |
| card      | map<string,string>                  |          |
| address   | struct<country:string,city:string>  |          |
| deptno    | int                                 |          |
+-----------+-------------------------------------+----------+--+
9 rows selected (0.137 seconds)
清空表
0: jdbc:hive2://CentOS:10000> truncate table t_employee_copy; -- 必须内部表
0: jdbc:hive2://CentOS:10000> truncate table t_employee partition(deptno=20);-- 截断分区

要求被截断的表必须是管理表(内部表)

删除
0: jdbc:hive2://CentOS:10000> drop table t_employee_copy3
0: jdbc:hive2://CentOS:10000> show partitions t_employee;
+------------+--+
| partition  |
+------------+--+
| deptno=10  |
| deptno=20  |
+------------+--+
2 rows selected (0.185 seconds)
0: jdbc:hive2://CentOS:10000> show partitions t_employee;
+------------+--+
| partition  |
+------------+--+
| deptno=10  |
| deptno=20  |
+------------+--+
2 rows selected (0.185 seconds)
0: jdbc:hive2://CentOS:10000> alter table t_employee drop partition(deptno=20);
INFO  : Dropped the partition deptno=20
No rows affected (0.212 seconds)
0: jdbc:hive2://CentOS:10000> show partitions t_employee;
+------------+--+
| partition  |
+------------+--+
| deptno=10  |
+------------+--+
1 row selected (0.209 seconds)
0: jdbc:hive2://CentOS:10000> alter table t_employee add partition(deptno=20);

如果删除的是管理表,系统在删除表的时候会删除表里数据,如果用户开启HDFS的回收站功能,数据实际上移动到垃圾回收站,所以如果用户想跳过垃圾hhhh回收站drop table 表名字 PURGE;

附加:

core-site.xml

<property>
  <name>fs.trash.interval</name>
  <value>5</value> //最多容忍数据在垃圾回收站保留5分钟
</property>
  • 查询结果插入表中

*将查询结果插入表中 *

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]]   
select_statement1 FROM from_statement; --覆盖

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] 
select_statement1 FROM from_statement;--追加
{"EMPNO": 7369,"ENAME": "SMITH","JOB": "CLERK","MGR": 7902,"HIREDATE": "1980-12-17 00:00:00","SAL": 800.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7499,"ENAME": "ALLEN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-20 00:00:00","SAL": 1600.00,"COMM": 300.00,"DEPTNO": 30}
{"EMPNO": 7521,"ENAME": "WARD","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-22 00:00:00","SAL": 1250.00,"COMM": 500.00,"DEPTNO": 30}
{"EMPNO": 7566,"ENAME": "JONES","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-04-02 00:00:00","SAL": 2975.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7654,"ENAME": "MARTIN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-28 00:00:00","SAL": 1250.00,"COMM": 1400.00,"DEPTNO": 30}
{"EMPNO": 7698,"ENAME": "BLAKE","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-05-01 00:00:00","SAL": 2850.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7782,"ENAME": "CLARK","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-06-09 00:00:00","SAL": 2450.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7788,"ENAME": "SCOTT","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1987-04-19 00:00:00","SAL": 1500.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7839,"ENAME": "KING","JOB": "PRESIDENT","MGR": null,"HIREDATE": "1981-11-17 00:00:00","SAL": 5000.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7844,"ENAME": "TURNER","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-08 00:00:00","SAL": 1500.00,"COMM": 0.00,"DEPTNO": 30}
{"EMPNO": 7876,"ENAME": "ADAMS","JOB": "CLERK","MGR": 7788,"HIREDATE": "1987-05-23 00:00:00","SAL": 1100.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7900,"ENAME": "JAMES","JOB": "CLERK","MGR": 7698,"HIREDATE": "1981-12-03 00:00:00","SAL": 950.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7902,"ENAME": "FORD","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1981-12-03 00:00:00","SAL": 3000.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7934,"ENAME": "MILLER","JOB": "CLERK","MGR": 7782,"HIREDATE": "1982-01-23 00:00:00","SAL": 1300.00,"COMM": null,"DEPTNO": 10}
0: jdbc:hive2://CentOS:10000> add jar /usr/apache-hive-1.2.2-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.2.jar;
CREATE EXTERNAL TABLE t_emp_json(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
0: jdbc:hive2://CentOS:10000> load data local  inpath '/root/baizhi/t_emp_json' overwrite into table t_emp_json;
    
0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_emp_json;
CREATE TABLE t_employee(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2)
    )
    PARTITIONED BY (deptno INT)   
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE TABLE t_employee PARTITION (deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from t_emp_json where deptno=10;
0: jdbc:hive2://CentOS:10000> INSERT INTO TABLE t_employee PARTITION (deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from t_emp_json where deptno=10;

更常见的做法0: jdbc:hive2://CentOS:10000> create TEMPORARY table t_tmp as select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_emp_json where deptno=10;

将查询结果插入多个表

FROM from_statement
INSERT OVERWRITE TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
from t_emp_json 
INSERT OVERWRITE TABLE t_employee partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=10 
INSERT OVERWRITE TABLE t_employee partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=20 
INSERT OVERWRITE TABLE t_employee partition(deptno=30) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=30

插入动态分区

0: jdbc:hive2://CentOS:10000> set hive.exec.dynamic.partition.mode=nonstrict; #开启动态分区
No rows affected (0.016 seconds)
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE TABLE t_employee PARTITION (deptno)  SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_emp_json ;
结果写出到文件系统
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] 
  SELECT ... FROM ...
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE DIRECTORY '/employee'  row format delimited fields terminated by ',' STORED AS textfile SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_emp_json ;

HQL高级(Hive的SQL语言)

简单查询

# 0. 各个数据类型的字段访问(array、map、struct)
	select name,salary,hobbies[1],cards['123456'],addr.city from t_person;
# 1. 条件查询:= != >= <=
select * from t_person where addr.city='郑州';
# 2. and or between and
select * from t_person where salary>5000 and array_contains(hobbies,'抽烟');
# 3. order by[底层会启动mapreduce进行排序]
select * from t_person order by salary desc;
# 4. limit(hive没有起始下标)
select * from t_person order by salary desc limit 5;
# 5. 去重
select distinct addr.city from t_person;
select distinct(addr.city) from t_person;
# 5. 正则匹配字段名(查询id和所有string结尾的字段)
	# 开启正则检测
	set hive.support.quoted.identifiers=none;
	# 执行sql
	select uuid,`.*String` from logs;
# use btrc;
# 1.查询一下array和map struct
select id,name,salary,hobbies[1] hobby2,cards ['123456'] yinhang,addr.city city
from t_person;
# 2.条件查询 > <  >=  <= != =
select * from t_person where addr.zipcode='北京';

# 3.条件运算符 and or betwween
select * from t_person where addr.zipcode='北京' and salary>5000;

# 4.排序  oder by 字段 desc|asc
select * from t_person order by salary desc;

# 5.分页查询 topn语法 limit 数字  工资最高前三
# limit 3 为获得查询结果前三条
select * from t_person order by salary desc limit 3;

# 6.去重(distinct)
select distinct addr.zipcode from t_person;
select distinct(addr.zipcode) from t_person;

# 7.模糊匹配列(没做训练)  结果就是查询到userid以及所有以string结尾的字段
# 查询userid以及string结尾的列的信息
# 先开启正则检测列名   	
set hive.support.quoted.identifiers=none; 
select userid,'.*string' from logs;
# 模糊匹配列案例数据准备(为正则匹配列准备)
create table logs(
    uuid string,
    userid string ,
    fromUrl string ,
    dateString string,
    timeString string,
    ipAddress string,
    browserName string,
    pcSystemNameOrmobileBrandName string ,
    systemVersion string,
    language string,
    cityName string
)partitioned by (day string)
row format delimited fields terminated
by ' ';
# 导入数据
load data local inpath '/opt/data/hive/access.log' into table logs partition(day='2019-05-09');
# 开启正则检测列名   结果为查询到userid以及所有以string结尾的字段	
set hive.support.quoted.identifiers=none; 
select userid,'.*string' from logs;

HQL函数(单行函数)

# 单行函数(show functions 查看所有函数)
1.length(列):求该字符长度。
	select id,name,length(salary) from t_person;
2.concat(列,列)合并
	select concat('李','旭');
3.to_date(1999-9-10):将字符串转换为时间格式
	select to_date("1999-9-10");
4.获取日期的的年(year)
	select id,name,year(birthday) from t_person;
  获取日期的月份(month)
  	select id,name,month(birthday) from t_person;
5.对日期进行运算:加减天数
	date_add(date,天数)
	select id,name,date_add(birthday,9) from t_person;
	date_sub(date,天数)
	select id,name,date_sub(birthday,9) from t_person;
	等同于select id,name,date_add(birthday,-9) from t_person;
6.查看hive内置函数
	show functions;
7.array_contains(数组列,值):数组中包含某个值
	select * from t_person where array_contains(hobbies ,'喝酒');	
8.查看函数描述信息
	desc function max ;
9.查看函数详细信息
	desc function extended max;
# 分组和相关函数
	max、min、sum、avg、count等组函数。
	select max(salary) from t_person where addr.city='北京';
	select count(id) from t_person;
1.select count(*) from t_person where addr.zipcode='北京'; */查询北京城市多少人
2.group by(查看各个城市的均薪)(根基哪个字段条件分组必须要查询)
	select addr.city,avg(salary) from t_person group by addr.zipcode;
3.having(查看平均工资超过5000的城市的最高薪资)
	select max(salary) maxSalary,addr.city,avg(salary) from t_person1 group by addr.city having avg(salary)>5000;
4.having(查看平均工资超过5000的城市和均薪)
	select addr.city,avg(salary) from t_person1 group by addr.city having avg(salary)>5000;
# 子查询
hive不支持

Hive表连接

# 表连接
join
left join
right join
# 内连接
select ...from 表1 t1 join 表2 t2 on t1.列=t2.列
1. 查询性别不同,但是薪资相同的人员信息。
select 
	/*+streamtable(t1)*/   //此处为指定大表,减小hive内存压力
    t1.name,t1.sex,t1.salary,
    t2.name,t2.sex,t2.salary
from t_person1 t1 join t_person1 t2 on t1.salary = t2.salary 
where t1.sex != t2.sex;
2. 查询拥有相同第二爱好且来自不同城市的人信息。
SELECT 
	t1.name,t1.salary,t1.hobbies,t1.addr.city,
	t2.name,t2.salary,t2.hobbies,t2.addr.city
from t_person1 t1 join t_person1 t2 on t1.hobbies[1]==t2.hobbies[1]
where t1.addr.city != t2.addr.city;

在这里插入图片描述

Hive的Join优化

Hive做表连接时,是从左向右连接。Hive在做表连接过程时,会将左边的表缓存(内存)起来,然后再逐一进行比对跟后续的表进行表连接。

问题:如果表连接左边的表过大,则会造成大量的内存占用和空间浪费,导致效率降低

# 解决方案
1. 表连接时,尽量将小表放在左边。
2. 利用Hive的标记机制,告知Hive的查询优化器。
   在select后面添加/*+streamtable(大表的别名)*/
SELECT 
/*+streamtable(t1)*/ --hive查询优化器,会自动将该表作为大表后连接,将小表先缓存起来。
	t1.name,t1.salary,t1.hobbies,t1.addr.city,
	t2.name,t2.salary,t2.hobbies,t2.addr.city
from t_person1 t1 join t_person1 t2 on t1.hobbies[1]==t2.hobbies[1]
where t1.addr.city != t2.addr.city;

全局排序和局部排序

# Hive的排序:order by
	1.使用1个ReduceTask.
	2.数据全体排序
# 问题:
	缺点:ReduceTask的排序压力非厂大。
	优点:能够全体排序。
# 全局排序(背后实质MapReduce)
语法:select * from 表 order by 字段 asc|desc;
-- 按照薪资降序排序
select * from t_person1 order by salary desc;
# 局部排序
	局部排序关键词 sort by
	默认reducetask个数只有1个,所有分区也只有一个。所以默认和全排序效果一样。
	1.可以设置reducetask数量。
	2.每个reducetask做内部排序。
语法:select * from 表 sort by 字段 asc|desc;
# 使用案列:
-- 1. 开启reduce个数
	-- 设置reduce个数
	set mapreduce.job.reduces = 2;
    -- 查看reduce个数
    set mapreduce.job.reduces;
-- 2. 使用sort by排序
	select * from t_person1 sort by salary desc;
	//并没有排好序,因为城市等不一样,需要指定reduce分区依据
-- 3. Hive分区排序(指定reduce分区依据)
	-- 将城市相同的人放在一起。
	语法:select * from 表 distribute by 字段 sort by 字段 desc|asc;
	说明:
	select * from t_person1 distribute by addr.city sort by salary desc;

Hive自定义函数

函数查询命令

# 查看hive内置函数
	show functions;
# 查看函数描述信息
	desc function max;
# 查看函数详细信息代码案列
	desc function extended max;

# 单行函数
	特点:只能对未分组前的数据做处理,经过单行函数处理后数据条数不变。
	length、array_contains、to_date、date_add
# 组函数
	特点:一组数据产生一个结果
	count、max、sum、avg、min
# 确定性函数:结果可以确定(deterministic)
	to_date()
	date_add()
# 不确定性函数:结果不一定(nondeterministic)
	current_date()

用户自定义函数UDF

# 自定义函数步骤
# 0.导入依赖
	<dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>1.2.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.9.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.2.1</version>
        </dependency>
    </dependencies>
# 1.自定义一个类继承UDF
	1. 必须继承UDF
	2. 方法名必须是evaluate
	3.必须要有返回值
# 2.将类打包成jar,上传linux中,导入hive的classpath(使用beeline)
	mvn clean
	mvn package
	add jar linux本地jar路径:进入beeline模式,然后add jar /opt/data/hive-1.0-SNAPSHOT.jar;	
# 3.使用jar中类,添加function。
	create [temporary] function 函数名 as "UDF函数对应的类名";
	create temporary function hello as "function.HelloUDF";
	[temporary] 标识表示临时使用
# 4.使用
	select hello(name,addr.city) from t_person1;
导入奇葩的依赖方法
# 下载
https://public.nexus.pentaho.org/repository/proxied-pentaho-public-repos-group/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar
# 放在本地英文目录下
D:\work\pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar
# 执行mvn安装本地依赖的命令
D:\work> mvn install:install-file -DgroupId=org.pentaho -DartifactId=pentaho-aggdesigner-algorithm  -Dversion=5.1.5-jhyde  -Dpackaging=jar  -Dfile=pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar
# 1.自定义一个类继承UDF
package function;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
@Description(
        //这里的中文解释以后看的时候会有乱码,最好写英文
        name = "hello",          //函数名
        value = "hello(s1,s2) return 's1 hello s2 play good'",         //函数描述
        extended = "select hello(c1,c2) from table"       //函数案列
)
public class HelloUDF extends UDF {
    // 方法名必须叫做evaluate
    public String evaluate(String s1,String s2){
        return s1+" hello,"+s2+" good?";
    }
}
# 2.将类打包成jar,上传linux中,导入hive的classpath(使用beeline)
<properties>
    <!--解决编码的GBK的问题-->
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
     <finalName>DateStyleTrans</finalName>
     <plugins>
       <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-jar-plugin</artifactId>
          <version>2.4</version>
         <configuration>
             <includes>
                <!--将function包下的所有类和子包下左右类,进行打包-->
                <include>**/function/**</include>
             </includes>
          </configuration>
       </plugin>
     </plugins>
</build>
	mvn clean
	mvn package
	add jar linux本地jar路径:进入beeline模式,然后add jar /opt/data/hive-1.0-SNAPSHOT.jar;
# 3.使用jar中类,添加function。
	create [temporary] function 函数名 as "UDF函数对应的类名";
	create temporary function hello as "function.HelloUDF";
	[temporary] 标识表示临时使用
	删除导入的函数:
	drop [temporary] function hello;
# 4. 查看函数并使用函数
-- 1. 查看函数
desc function hello;
desc function extended hello;
-- 2. 使用函数进行查询
select hello(name,addr.city) from t_person1;

列自增长(不确定性函数)

# 定义一个函数 get_number()
select get_num() num,id,name,salary from t_person1;

在这里插入图片描述

# 编码
1. 定义一个java类
	继承UDF
	书写evaluate方法
package function;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;

/**
 * 该对象hive指挥创建一个,每行数据处理,回调用一次evaluate
 * UDF udf = new NumberUDF();
 * while(row.next){
 *     udf.evaluate();
 * }
 */
@UDFType(deterministic = false) //不确定性函数
public class NumberUDF extends UDF {
    private long index = 0;
    public  long evaluate(){
        index++;
        return index;
    }
}
2. 打包
	mvn clean package
3. 上传linux  /opt/data
4. 导入到hive的依赖库中
	add jar /opt/data/hive-1.0-SNAPSHOT.jar;
5. 创建函数
	create temporary function get_num as "function.NumberUDF";
6. 使用
	select get_num() num,id,name,salary from t_person1;

行列相转

行列相转案列

# 案例表和数据
--## 表(电影观看日志)
create table t_visit_video (
    username string,
    video_name string,
    video_date date
)row format delimited fields terminated by ',';

load data local inpath "/opt/data/hive/video.txt" into table t_visit_video;

--## 数据 video.txt
张三,大唐双龙传,2020-03-21
李四,天下无贼,2020-03-21
张三,神探狄仁杰,2020-03-21
李四,霸王别姬,2020-03-21
李四,霸王别姬,2020-03-21
王五,机器人总动员,2020-03-21
王五,放牛班的春天,2020-03-21
王五,盗梦空间,2020-03-21

在这里插入图片描述

# collect_list(组函数)
作用:对分组后的,每个组的某个列的值进行收集汇总。(不去重)
语法:select collect_list(列) from 表 group by 分组列;
select username,collect_list(video_name) from t_visit_video group by username;

在这里插入图片描述

# collect_set(组函数)
作用:对分组后的,每个组的某个列的值进行收集汇总。(去重)
语法:select collect_set(列) from 表 group by 分组列;
select username,collect_set(video_name) from t_visit_video group by username;

在这里插入图片描述

# concat_ws(单行函数)
作用:如果某个字段是数组,对该值得多个元素使用指定分隔符拼接。
语法:concat_ws(拼接符号,数组列)
select id,name,concat_ws(',',hobbies) from t_person;
--# 将t_visit_video数据转化为如下图效果
--统计每个人,2020-3-21看过的电影。
select username,concat_ws(',',collect_set(video_name)) from t_visit_video group by username;

在这里插入图片描述

将查询结果入表

将SQL的执行结果插入到另一个表中

create table 表 as select语句

示例:
create table t_visit_result as select username,concat_ws(',',collect_set(video_name)) from t_visit_video group by username;
结果:

在这里插入图片描述

Hive与HBase

区别是什么

Hive:构建在Hadoop基础设施之上的数据仓库。通过Hive可以使用HQL语言查询存放在HDFS上的数据。(不是用来做增删改查的)(延迟高)

本质:SQL执行引擎,将SQL转换为MapReduce程序,将MR转交给YARN执行,可以认为是工具。

HBase:基于HDFS构建的数据库,管理数据底层数据结构存储方式,内存管理。

​ 是一种Key/Value数据库,它运行在HDFS之上。

核心作用

Hive:SQL执行引擎。核心是用来简化MapReduce代码的。

HBase:构建在HDFS之上的分布式NoSQL数据库,专门用来做海量数据的实时检索。(存、管)

应用场景

Hive:海量离线数据分析,计算趋势或者网站的日志(离线一天)

HBase:海量数据实时查询。(实时,秒级)

分析查询

数据筹备

7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,\N,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,\N,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,\N,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,\N,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,1500,\N,20
7839,KING,PRESIDENT,\N,1981-11-17 00:00:00,5000,\N,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,\N,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,\N,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,\N,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,\N,10
CREATE TABLE t_employee(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
CREATE TABLE t_dept(
    DEPTNO INT,
    DNAME STRING,
    LOC STRING)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno  from t_employee;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
14 rows selected (0.047 seconds)
0: jdbc:hive2://CentOS:10000> select deptno,dname,loc from t_dept;
+---------+-------------+-----------+--+
| deptno  |    dname    |    loc    |
+---------+-------------+-----------+--+
| 10      | ACCOUNTING  | NEW YORK  |
| 20      | RESEARCH    | DALLAS    |
| 30      | SALES       | CHICAGO   |
| 40      | OPERATIONS  | BOSTON    |
+---------+-------------+-----------+--+
4 rows selected (0.046 seconds)
CREATE TABLE t_employee_partition(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2))
PARTITIONED BY(deptno INT) 
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
0: jdbc:hive2://CentOS:10000> set hive.exec.dynamic.partition.mode=nonstrict
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE TABLE t_employee_partition PARTITION (deptno)  SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee;

SQL查询

单表查询
0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno  from t_employee;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
14 rows selected (0.056 seconds)
WHERE查询
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee WHERE empno > 7782 AND deptno = 10;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
2 rows selected (0.067 seconds)
DISTINCT查询
0: jdbc:hive2://CentOS:10000> select distinct(job) from t_employee;
+------------+--+
|    job     |
+------------+--+
| ANALYST    |
| CLERK      |
| MANAGER    |
| PRESIDENT  |
| SALESMAN   |
+------------+--+
分区查询
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee_partition e  WHERE e.deptno >= 20 AND e.deptno <= 40;
+--------+---------+-----------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job    |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+-----------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK     | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7566   | JONES   | MANAGER   | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7788   | SCOTT   | ANALYST   | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7876   | ADAMS   | CLERK     | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7902   | FORD    | ANALYST   | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN  | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN  | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7654   | MARTIN  | SALESMAN  | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER   | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7844   | TURNER  | SALESMAN  | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7900   | JAMES   | CLERK     | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
+--------+---------+-----------+-------+------------------------+-------+-------+---------+--+
11 rows selected (0.123 seconds)
LIMIT查询
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee  ORDER BY sal DESC LIMIT 5;
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
| empno  | ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
| 7839   | KING   | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7902   | FORD   | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7566   | JONES  | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7698   | BLAKE  | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK  | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
5 rows selected (14.294 seconds)
GROUP BY查询
0: jdbc:hive2://CentOS:10000> set hive.map.aggr=true;
0: jdbc:hive2://CentOS:10000> SELECT deptno,SUM(sal) as total FROM t_employee GROUP BY deptno;
+---------+--------+--+
| deptno  | total  |
+---------+--------+--+
| 10      | 8750   |
| 20      | 9375   |
| 30      | 9400   |
+---------+--------+--+
3 rows selected (12.645 seconds)

hive.map.aggr控制程序如何进行聚合。默认值为false。如果设置为true,Hive会在map阶段就执行一次聚合。这可以提高聚合效率,但需要消耗更多内存。

ORDER AND SORT

可以使用ORDER BY或者Sort BY对查询结果进行排序,排序字段可以是整型也可以是字符串:如果是整型,则按照大小排序;如果是字符串,则按照字典序排序。ORDER BY 和 SORT BY 的区别如下:使用ORDER BY时会有一个Reducer对全部查询结果进行排序,可以保证数据的全局有序性;使用SORT BY时只会在每个Reducer中进行排序,这可以保证每个Reducer的输出数据是有序的,但不能保证全局有序。由于ORDER BY的时间可能很长,如果你设置了严格模式(hive.mapred.mode = strict),则其后面必须再跟一个limit子句。

  • sort by
0: jdbc:hive2://CentOS:10000> set mapreduce.job.reduces=2
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee sort by sal desc;
+--------+---------+-------+--+
| empno  |  ename  |  sal  |
+--------+---------+-------+--+
| 7902   | FORD    | 3000  |
| 7566   | JONES   | 2975  |
| 7844   | TURNER  | 1500  |
| 7788   | SCOTT   | 1500  |
| 7521   | WARD    | 1250  |
| 7654   | MARTIN  | 1250  |
| 7876   | ADAMS   | 1100  |
| 7900   | JAMES   | 950   |
| 7369   | SMITH   | 800   |
| 7839   | KING    | 5000  |
| 7698   | BLAKE   | 2850  |
| 7782   | CLARK   | 2450  |
| 7499   | ALLEN   | 1600  |
| 7934   | MILLER  | 1300  |
+--------+---------+-------+--+
14 rows selected (14.474 seconds)
  • order by
0: jdbc:hive2://CentOS:10000> set mapreduce.job.reduces=3;
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc;
+--------+---------+-------+--+
| empno  |  ename  |  sal  |
+--------+---------+-------+--+
| 7839   | KING    | 5000  |
| 7902   | FORD    | 3000  |
| 7566   | JONES   | 2975  |
| 7698   | BLAKE   | 2850  |
| 7782   | CLARK   | 2450  |
| 7499   | ALLEN   | 1600  |
| 7844   | TURNER  | 1500  |
| 7788   | SCOTT   | 1500  |
| 7934   | MILLER  | 1300  |
| 7654   | MARTIN  | 1250  |
| 7521   | WARD    | 1250  |
| 7876   | ADAMS   | 1100  |
| 7900   | JAMES   | 950   |
| 7369   | SMITH   | 800   |
+--------+---------+-------+--+
14 rows selected (13.049 seconds)
0: jdbc:hive2://CentOS:10000> set hive.mapred.mode = strict;
No rows affected (0.004 seconds)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc;
Error: Error while compiling statement: FAILED: SemanticException 1:48 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'sal' (state=42000,code=40000)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc limit 5; 
+--------+--------+-------+--+
| empno  | ename  |  sal  |
+--------+--------+-------+--+
| 7839   | KING   | 5000  |
| 7902   | FORD   | 3000  |
| 7566   | JONES  | 2975  |
| 7698   | BLAKE  | 2850  |
| 7782   | CLARK  | 2450  |
+--------+--------+-------+--+
5 rows selected (12.468 seconds)

8、HAVING过滤

0: jdbc:hive2://CentOS:10000> SELECT deptno,SUM(sal) total FROM t_employee GROUP BY deptno HAVING SUM(sal)>9000;
+---------+--------+--+
| deptno  | total  |
+---------+--------+--+
| 30      | 9400   |
| 20      | 9375   |
+---------+--------+--+
2 rows selected (18.361 seconds)
DISTRIBUTE BY

默认情况下,MapReduce程序会对Map输出结果的Key值进行散列,并均匀分发到所有Reducer上。如果想要把具有相同Key值的数据分发到同一个Reducer进行处理,这就需要使用DISTRIBUTE BY字句。需要注意的是,DISTRIBUTE BY虽然能保证具有相同Key值的数据分发到同一个Reducer,但是不能保证数据在Reducer上是有序的。

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee distribute BY deptno;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7654   | MARTIN  | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
| 7698   | BLAKE   | 2850  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7844   | TURNER  | 1500  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7934   | MILLER  | 1300  | 10      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7788   | SCOTT   | 1500  | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7876   | ADAMS   | 1100  | 20      |
| 7902   | FORD    | 3000  | 20      |
| 7369   | SMITH   | 800   | 20      |
+--------+---------+-------+---------+--+
14 rows selected (15.504 seconds)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee distribute BY deptno sort by sal desc;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7698   | BLAKE   | 2850  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7844   | TURNER  | 1500  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7654   | MARTIN  | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7934   | MILLER  | 1300  | 10      |
| 7902   | FORD    | 3000  | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7788   | SCOTT   | 1500  | 20      |
| 7876   | ADAMS   | 1100  | 20      |
| 7369   | SMITH   | 800   | 20      |
+--------+---------+-------+---------+--+
14 rows selected (16.528 seconds)
CLUSTER BY

如果SORT BYDISTRIBUTE BY指定的是相同字段,且SORT BY排序规则是ASC,此时可以使用CLUSTER BY进行替换。

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee cluster by deptno;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7934   | MILLER  | 1300  | 10      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7876   | ADAMS   | 1100  | 20      |
| 7788   | SCOTT   | 1500  | 20      |
| 7369   | SMITH   | 800   | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7902   | FORD    | 3000  | 20      |
| 7844   | TURNER  | 1500  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7698   | BLAKE   | 2850  | 30      |
| 7654   | MARTIN  | 1250  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
+--------+---------+-------+---------+--+
14 rows selected (25.847 seconds)
表Join查询

Hive支持内连接,外连接,左外连接,右外连接,笛卡尔连接,这和传统数据库中的概念是一致的。需要特别强调:JOIN语句的关联条件必须用ON指定,不能用WHERE指定,否则就会先做笛卡尔积,再过滤,这会导致你得不到预期的结果。

  • 内连接
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e JOIN t_dept d ON e.deptno = d.deptno WHERE e.empno=7369;
+----------+----------+--------+-----------+-----------+--+
| e.empno  | e.ename  | e.sal  |  d.dname  | d.deptno  |
+----------+----------+--------+-----------+-----------+--+
| 7369     | SMITH    | 800    | RESEARCH  | 20        |
+----------+----------+--------+-----------+-----------+--+
1 row selected (10.419 seconds)
  • 外连接
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e LEFT OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
+----------+----------+--------+-------------+-----------+--+
14 rows selected (11.424 seconds)
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e RIGHT OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| NULL     | NULL     | NULL   | OPERATIONS  | 40        |
+----------+----------+--------+-------------+-----------+--+
15 rows selected (11.063 seconds)
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e FULL OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| NULL     | NULL     | NULL   | OPERATIONS  | 40        |
+----------+----------+--------+-------------+-----------+--+
15 rows selected (24.703 seconds)

12、LEFT SEMI JOIN

LEFT SEMI JOIN (左半连接)是 IN/EXISTS 子查询的一种更高效的实现。

  • JOIN 子句中右边的表只能在 ON 子句中设置过滤条件;
  • 查询结果只包含左边表的数据,所以只能SELECT左表中的列。
0: jdbc:hive2://CentOS:10000> SELECT e.empno,e.ename,d.dname FROM t_employee e LEFT SEMI JOIN t_dept d ON e.deptno = d.deptno AND d.loc="NEW YORK";
+----------+----------+-----------+--+
| e.empno  | e.ename  | e.deptno  |
+----------+----------+-----------+--+
| 7782     | CLARK    | 10        |
| 7839     | KING     | 10        |
| 7934     | MILLER   | 10        |
+----------+----------+-----------+--+
3 rows selected (10.119 seconds)
JOIN优化
  • STREAMTABLE

在多表进行join的时候,如果每个ON子句都使用到共同的列,此时Hive会进行优化,将多表JOIN在同一个map / reduce作业上进行。同时假定查询的最后一个表是最大的一个表,在对每行记录进行JOIN操作时,它将尝试将其他的表缓存起来,然后扫描最后那个表进行计算。因此用户需要保证查询的表的大小从左到右是依次增加的。

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key) JOIN c ON (c.key = b.key)

然而用户并非需要总是把最大的表放在查询语句的最后面,Hive提供了/*+ STREAMTABLE() */标志,使用该标识来指出大表,能避免数据表过大导致占用内存过多而产生的问题。示例如下:

0: jdbc:hive2://CentOS:10000> SELECT /*+ STREAMTABLE(e) */ e.empno,e.ename,d.dname,d.deptno FROM t_employee e JOIN t_dept d ON e.deptno = d.deptno WHERE job='CLERK';
+----------+----------+-------------+-----------+--+
| e.empno  | e.ename  |   d.dname   | d.deptno  |
+----------+----------+-------------+-----------+--+
| 7369     | SMITH    | RESEARCH    | 20        |
| 7876     | ADAMS    | RESEARCH    | 20        |
| 7900     | JAMES    | SALES       | 30        |
| 7934     | MILLER   | ACCOUNTING  | 10        |
+----------+----------+-------------+-----------+--+
4 rows selected (11.645 seconds)
  • MAPJOIN

如果在进行join操作时,有一个表很小,则可以将join操作调整到map阶段执行。这就是典型的极大表和极小表关联问题。有两种解决方式:1.增加**/*+ MAPJOIN(b) */标示;2.设置参数hive.optimize.bucketmapjoin = true**,在

0: jdbc:hive2://CentOS:10000> SELECT /*+ MAPJOIN(d) */ e.empno, e.ename,d.dname FROM t_employee e  JOIN t_dept d ON d.deptno = e.deptno;
+----------+----------+-------------+--+
| e.empno  | e.ename  |   d.dname   |
+----------+----------+-------------+--+
| 7369     | SMITH    | RESEARCH    |
| 7499     | ALLEN    | SALES       |
| 7521     | WARD     | SALES       |
| 7566     | JONES    | RESEARCH    |
| 7654     | MARTIN   | SALES       |
| 7698     | BLAKE    | SALES       |
| 7782     | CLARK    | ACCOUNTING  |
| 7788     | SCOTT    | RESEARCH    |
| 7839     | KING     | ACCOUNTING  |
| 7844     | TURNER   | SALES       |
| 7876     | ADAMS    | RESEARCH    |
| 7900     | JAMES    | SALES       |
| 7902     | FORD     | RESEARCH    |
| 7934     | MILLER   | ACCOUNTING  |
+----------+----------+-------------+--+
14 rows selected (11.416 seconds)
开窗函数
0: jdbc:hive2://CentOS:10000> select e.empno ,e.ename,e.sal,e.deptno,rank() over(partition by e.deptno order by e.sal) as rank from t_employee e; 
+----------+----------+--------+-----------+-------+--+
| e.empno  | e.ename  | e.sal  | e.deptno  | rank  |
+----------+----------+--------+-----------+-------+--+
| 7839     | KING     | 5000   | 10        | 1     |
| 7782     | CLARK    | 2450   | 10        | 2     |
| 7934     | MILLER   | 1300   | 10        | 3     |
| 7902     | FORD     | 3000   | 20        | 1     |
| 7566     | JONES    | 2975   | 20        | 2     |
| 7788     | SCOTT    | 1500   | 20        | 3     |
| 7876     | ADAMS    | 1100   | 20        | 4     |
| 7369     | SMITH    | 800    | 20        | 5     |
| 7698     | BLAKE    | 2850   | 30        | 1     |
| 7499     | ALLEN    | 1600   | 30        | 2     |
| 7844     | TURNER   | 1500   | 30        | 3     |
| 7654     | MARTIN   | 1250   | 30        | 4     |
| 7521     | WARD     | 1250   | 30        | 4     |
| 7900     | JAMES    | 950    | 30        | 6     |
+----------+----------+--------+-----------+-------+--+
0: jdbc:hive2://CentOS:10000> select e.empno ,e.ename,e.sal,e.deptno,dense_rank() over(partition by e.deptno order by e.sal desc) as rank from t_employee e; 
+----------+----------+--------+-----------+-------+--+
| e.empno  | e.ename  | e.sal  | e.deptno  | rank  |
+----------+----------+--------+-----------+-------+--+
| 7839     | KING     | 5000   | 10        | 1     |
| 7782     | CLARK    | 2450   | 10        | 2     |
| 7934     | MILLER   | 1300   | 10        | 3     |
| 7902     | FORD     | 3000   | 20        | 1     |
| 7566     | JONES    | 2975   | 20        | 2     |
| 7788     | SCOTT    | 1500   | 20        | 3     |
| 7876     | ADAMS    | 1100   | 20        | 4     |
| 7369     | SMITH    | 800    | 20        | 5     |
| 7698     | BLAKE    | 2850   | 30        | 1     |
| 7499     | ALLEN    | 1600   | 30        | 2     |
| 7844     | TURNER   | 1500   | 30        | 3     |
| 7654     | MARTIN   | 1250   | 30        | 4     |
| 7521     | WARD     | 1250   | 30        | 4     |
| 7900     | JAMES    | 950    | 30        | 5     |
+----------+----------+--------+-----------+-------+--+
14 rows selected (24.262 seconds)

1,查询员工信息,展示该员工所在部门的平均薪资

select e1.empno,e1.ename,e3.avg,e3.dno from t_employee e1 left join (select e2.deptno dno,avg(e2.sal) avg from t_employee e2 group by e2.deptno) e3 on e1.deptno=e3.dno;
开窗等价
select e1.empno,e1.ename, avg(e1.sal) over (partition by e1.deptno) from t_employee e1;

2,查询员工信息,展示该员工所在部门薪资排名?

select e1.empno,e1.ename,e1.sal,e1.deptno,sum(1) over (partition by e1.deptno ) from t_employee e1;
等价写法
select e1.empno,e1.ename,e1.sal,e1.deptno,sum(1) over (partition by e1.deptno order by e1.sal rows between  unbounded preceding and unbounded following ) from t_employee e1;

select e1.empno,e1.ename,e1.sal,e1.deptno,sum(1) over (partition by e1.deptno order by e1.sal desc rows between unbounded preceding and current row ) from t_employee e1 order by e1.deptno;
               无限的     前面           无限的      后面
rows between  unbounded preceding and unbounded following  
               无限的                    当前行
rows between  unbounded preceding and current row 
              当前行             无限的   后续的
rows between  current row  and unbounded following  
               前1个           后一个
rows between  -1 preceding and 1 following  
               前1个           后一个
rows between  -1 preceding and 1 following  
Cube分析
0: jdbc:hive2://CentOS:10000> select e.deptno,e.job,avg(e.sal) avg,max(e.sal) max,min(e.sal) min from t_employee e group by e.deptno,e.job with cube;
+-----------+------------+--------------+-------+-------+--+
| e.deptno  |   e.job    |     avg      |  max  |  min  |
+-----------+------------+--------------+-------+-------+--+
| NULL      | ANALYST    | 2250         | 3000  | 1500  |
| 10        | CLERK      | 1300         | 1300  | 1300  |
| 20        | CLERK      | 950          | 1100  | 800   |
| 30        | CLERK      | 950          | 950   | 950   |
| 20        | ANALYST    | 2250         | 3000  | 1500  |
| NULL      | PRESIDENT  | 5000         | 5000  | 5000  |
| 10        | PRESIDENT  | 5000         | 5000  | 5000  |
| NULL      | SALESMAN   | 1400         | 1600  | 1250  |
| NULL      | MANAGER    | 2758.333333  | 2975  | 2450  |
| 30        | SALESMAN   | 1400         | 1600  | 1250  |
| 10        | MANAGER    | 2450         | 2450  | 2450  |
| 20        | MANAGER    | 2975         | 2975  | 2975  |
| 30        | MANAGER    | 2850         | 2850  | 2850  |
| NULL      | NULL       | 1966.071429  | 5000  | 800   |
| NULL      | CLERK      | 1037.5       | 1300  | 800   |
| 10        | NULL       | 2916.666667  | 5000  | 1300  |
| 20        | NULL       | 1875         | 3000  | 800   |
| 30        | NULL       | 1566.666667  | 2850  | 950   |
+-----------+------------+--------------+-------+-------+--+
18 rows selected (25.037 seconds)
行转列
1,语文,100
1,数学,100
1,英语,100
2,数学,79
2,语文,80
2,英语,100
CREATE TABLE t_student(
    id INT,
    course STRING,
    score double)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
0: jdbc:hive2://CentOS:10000> select * from t_student;
+---------------+-------------------+------------------+--+
| t_student.id  | t_student.course  | t_student.score  |
+---------------+-------------------+------------------+--+
| 1             | 语文                | 100.0            |
| 1             | 数学                | 100.0            |
| 1             | 英语                | 100.0            |
| 2             | 数学                | 79.0             |
| 2             | 语文                | 80.0             |
| 2             | 英语                | 100.0            |
+---------------+-------------------+------------------+--+
6 rows selected (0.05 seconds)
0: jdbc:hive2://CentOS:10000> select id,max(case course when '语文' then score else 0 end) as chinese,max(case course when '数学' then score else 0 end ) as math,max(case course when '英语' then score else 0 end ) as english from t_student group by id ;

+-----+----------+--------+----------+--+
| id  | chinese  |  math  | english  |
+-----+----------+--------+----------+--+
| 1   | 100.0    | 100.0  | 100.0    |
| 2   | 80.0     | 79.0   | 100.0    |
+-----+----------+--------+----------+--+
2 rows selected (25.617 seconds)

SELECT id,concat_ws(’,’, collect_set(concat(course, ‘:’, score))) 成绩 FROM t_student GROUP BY id

UDF&UDTF

UDF函数其实就是一个简单的函数,执行过程就是在Hive转换成MapReduce程序后,执行java方法,类似于像MapReduce执行过程中加入一个插件,方便扩展。目前Hive除去一些内置的函数以外,还提供了一些内置的函数的扩扎接口:
UDF:针对单行数据操作,需要继承UDF
UDTF:操作一个数据行,产生多个数据行或者是多个列,需要用户继承GenericUDTF
UDAF:操作多个数据行,产生一个数据行,主要用以聚合统计,需要继承AbstractGenericUDAFResolver
目前由于Hive中已经提供了强大的聚合函数,本篇文章就UDF和UDTF实现给出以下两个案例:

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.2.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.9.2</version>
</dependency>

UDF

package com.jiangzz;

import org.apache.hadoop.hive.ql.exec.UDF;

public class DeptUDF extends UDF {
    public String evaluate(String value){
        return value+"部门";
    }
}

这里的evaluate方法的参数和返回值需要用户根据自己的需求定义

UDTF

package com.jiangzz;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class UserUDTF extends GenericUDTF {

    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        List<ObjectInspector> objectInspectors=new ArrayList<ObjectInspector>();

        objectInspectors.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
        objectInspectors.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        objectInspectors.add(PrimitiveObjectInspectorFactory.javaBooleanObjectInspector);

        //返回的类型参数
        return ObjectInspectorFactory.getStandardStructObjectInspector(Arrays.asList("id","name","sex"),objectInspectors);
    }

    public void process(Object[] args) throws HiveException {

        String value = args[0].toString();
        String sep= args[1].toString();
        String[] tokens = value.split(sep);
        forward(Arrays.asList(Integer.parseInt(tokens[0]),tokens[1],Boolean.valueOf(tokens[2])));

    }

    public void close() throws HiveException {}
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值