大数据组件笔记 -- Hive

最新推荐文章于 2022-10-11 08:02:41 发布

L小Ray想有腮

最新推荐文章于 2022-10-11 08:02:41 发布

阅读量896

点赞数

分类专栏： BigData

本文链接：https://blog.csdn.net/weixin_42480750/article/details/114377874

版权

BigData 专栏收录该内容

15 篇文章 12 订阅

订阅专栏

文章目录

一、基本概念
- 1.1 Hive和数据库比较
- 1.2 Hive 安装
- 1.3 Hive 启动
- 1.4 Hive 使用
- - 1.4.1 shell beeline
  - 1.4.2 DBeaver
二、数据类型
- 2.1 基本数据类型
- 2.2 集合数据类型
- 2.3 类型转换
三、DDL 数据定义
- 3.1 库
- 3.2 管理表
- 3.3 外部表
- 3.4 分区表
- 3.5 分桶表
四、DML数据操作
- 4.1 数据导入
- 4.2 数据导出
- 4.3 数据删除
五、查询
- 5.1 基本查询
- 5.2 多表查询
- 5.3 查询排序
- 5.4 分桶抽样查询
六、函数
- 5.1 常用查询函数
- 5.2 系统内置函数
- 5.3 日期函数
- 5.4 自定义函数
七、压缩和存储
- 7.1 压缩
- 7.2 文件存储格式
- 7.3 主流文件存储格式对比实验
- 7.4 存储和压缩结合
八、企业级调优
- 8.1 Fetch 抓取
- 8.2 本地模式
- 8.3 表的优化
- 8.4 合理设置Map及Reduce数
- 8.5 并行执行
- 8.6 严格模式
- 8.7 JVM 重用
- 8.8 推测执行
- 8.8 执行计划
九、Hive实战之谷粒影音
- 9.1 准备工作
- 9.2 业务分析
十、权限管理
- 10.1 权限控制
- 10.2 角色管理

一、基本概念

什么是 Hive

Hive：由Facebook开源用于解决海量结构化日志的数据统计工具。
Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供类SQL查询功能。
本质是：将HQL转化成MapReduce程序

在这里插入图片描述

Hive处理的数据存储在HDFS
Hive分析数据底层的实现是MapReduce
执行程序运行在Yarn上

优点

操作接口采用类SQL语法，提供快速开发的能力（简单、容易上手）。
避免了去写MapReduce，减少开发人员的学习成本。
Hive的执行延迟比较高，因此Hive常用于数据分析，对实时性要求不高的场合。
Hive优势在于处理大数据，对于处理小数据没有优势，因为Hive的执行延迟比较高。
Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数。

缺点

Hive的HQL表达能力有限
1.1 迭代式算法无法表达
1.2 数据挖掘方面不擅长，由于MapReduce数据处理流程的限制，效率更高的算法却无法实现。
Hive的效率比较低
2.1 Hive自动生成的MapReduce作业，通常情况下不够智能化
2.2 Hive调优比较困难，粒度较粗

架构原理

在这里插入图片描述

用户接口 Client：CLI（command-line interface）、JDBC/ODBC(jdbc访问hive)、WEBUI（浏览器访问hive）
元数据 Metastore：元数据包括表名、表所属的数据库（默认是default）、表的拥有者、列/分区字段、表的类型（是否是外部表）、表的数据所在目录等；默认存储在自带的derby数据库中，推荐使用MySQL存储Metastore
Hadoop：使用HDFS进行存储，使用MapReduce进行计算。
驱动器 Driver
4.1 解析器（SQL Parser）：将SQL字符串转换成抽象语法树AST，这一步一般都用第三方工具库完成，比如antlr；对AST进行语法分析，比如表是否存在、字段是否存在、SQL语义是否有误。
4.2 编译器（Physical Plan）：将AST编译生成逻辑执行计划。
4.3 优化器（Query Optimizer）：对逻辑执行计划进行优化。
4.4 执行器（Execution）：把逻辑执行计划转换成可以运行的物理计划。对于Hive来说，就是MR/Spark。

运行机制

在这里插入图片描述

Hive通过给用户提供的一系列交互接口，接收到用户的指令(SQL)；
使用自己的Driver，结合元数据(MetaStore)，将这些指令翻译成MapReduce，提交到Hadoop中执行；
最后，将执行返回的结果输出到用户交互接口。

1.1 Hive和数据库比较

由于 Hive 采用了类似SQL 的查询语言 HQL(Hive Query Language)，因此很容易将 Hive 理解为数据库。其实从结构上来看，Hive 和数据库除了拥有类似的查询语言，再无类似之处。数据库可以用在 Online 的应用中，但是Hive 是为数据仓库而设计的。

查询语言

由于SQL被广泛的应用在数据仓库中，因此，专门针对Hive的特性设计了类SQL的查询语言HQL。
熟悉SQL开发的开发者可以很方便的使用Hive进行开发。

数据更新

由于Hive是针对数据仓库应用设计的，而数据仓库的内容是读多写少的。
因此，Hive中不建议对数据的改写，所有的数据都是在加载的时候确定好的。
而数据库中的数据通常是需要经常进行修改的，因此可以使用 INSERT INTO … VALUES 添加数据，使用 UPDATE … SET修改数据。

执行延迟

Hive 在查询数据的时候，由于没有索引，需要扫描整个表，因此延迟较高。
另外一个导致 Hive 执行延迟高的因素是 MapReduce框架。由于MapReduce 本身具有较高的延迟，因此在利用MapReduce 执行Hive查询时，也会有较高的延迟。相对的，数据库的执行延迟较低。
当然，这个低是有条件的，即数据规模较小，当数据规模大到超过数据库的处理能力的时候，Hive的并行计算显然能体现出优势。

数据规模

由于Hive建立在集群上并可以利用MapReduce进行并行计算，因此可以支持很大规模的数据；
对应的，数据库可以支持的数据规模较小。

1.2 Hive 安装

MySQL 安装参考：https://blog.csdn.net/weixin_42480750/article/details/108678732

安装 Hive

[omm@bigdata01 soft]$ pwd
/opt/soft
[omm@bigdata01 soft]$ ll *hive*
-rw-rw-r-- 1 omm omm 278813748 Mar  4 21:20 apache-hive-3.1.2-bin.tar.gz
[omm@bigdata01 soft]$ tar -zxf apache-hive-3.1.2-bin.tar.gz -C /opt/module/
[omm@bigdata01 soft]$ ln -s /opt/module/apache-hive-3.1.2-bin /opt/module/hive
[omm@bigdata01 soft]$ cd /opt/module/
[omm@bigdata01 module]$ cp /opt/module/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /opt/module/hive/lib/ # JAR版本冲突
[omm@bigdata01 module]$ rm /opt/module/hive/lib/guava-19.0.jar 
[omm@bigdata01 module]$ mv /opt/module/hive/lib/log4j-slf4j-impl-2.10.0.jar{,.bak} # JAR包冲突
[omm@bigdata01 module]$

配置环境变量

[omm@bigdata01 module]$ vi /etc/profile
[omm@bigdata01 module]$ sudo vi /etc/profile
[omm@bigdata01 module]$ tail -3 /etc/profile
# Hive
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin
[omm@bigdata01 module]$ source /etc/profile
[omm@bigdata01 module]$

Hive元数据配置到MySql

[omm@bigdata01 ~]$ cp /opt/soft/mysql-connector-java-5.1.48.jar $HIVE_HOME/lib # 拷贝驱动
[omm@bigdata01 ~]$ echo $HIVE_HOME
/opt/module/hive
[omm@bigdata01 ~]$ vi $HIVE_HOME/conf/hive-site.xml # 配置Metastore到MySql
[omm@bigdata01 ~]$ cat /opt/module/hive/conf/hive-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://bigdata01:3306/metastore?useSSL=false</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>abcd1234..</value>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>

    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>

    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://bigdata01:9083</value>
    </property>

    <property>
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    </property>

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>bigdata01</value>
    </property>

    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>

</configuration>
[omm@bigdata01 ~]$

安装TEZ引擎

Tez是一个Hive的运行引擎，性能优于MR。为什么优于MR呢？看下图。

在这里插入图片描述

用Hive直接编写MR程序，假设有四个有依赖关系的MR作业，上图中，绿色是Reduce Task，云状表示写屏蔽，需要将中间结果持久化写到HDFS。

Tez可以将多个有依赖的作业转换为一个作业，这样只需写一次HDFS，且中间节点较少，从而大大提升作业的计算性能。

[omm@bigdata01 ~]$ # 安装TEZ
[omm@bigdata01 ~]$ tar -zxf /opt/soft/apache-tez-0.9.2-bin.tar.gz -C /opt/module/
[omm@bigdata01 ~]$ ln -s /opt/module/apache-tez-0.9.2-bin /opt/module/tez
[omm@bigdata01 ~]$ # 上传tez依赖到HDFS
[omm@bigdata01 ~]$ hadoop fs -rm -r hdfs://bigdata01:8020/*
[omm@bigdata01 ~]$ hadoop fs -mkdir /tez
[omm@bigdata01 ~]$ cd /opt/module/tez/share/
[omm@bigdata01 share]$ ll
total 45172
-rw-r--r-- 1 omm omm 46254263 Mar 20  2019 tez.tar.gz
[omm@bigdata01 share]$ hadoop fs -put tez.tar.gz /tez
[omm@bigdata01 ~]$ # 新建tez-site.xml
[omm@bigdata01 ~]$ cd /opt/module/hadoop/etc/hadoop/
[omm@bigdata01 hadoop]$ vi tez-site.xml
[omm@bigdata01 hadoop]$ cat tez-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>tez.lib.uris</name>
    <value>${fs.defaultFS}/tez/tez.tar.gz</value>
  </property>
  <property>
    <name>tez.use.cluster.hadoop-libs</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.history.logging.service.class</name>
    <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
  </property>
</configuration>
[omm@bigdata01 hadoop]$ # 修改Hadoop环境变量
[omm@bigdata01 hadoop]$ pwd
/opt/module/hadoop/etc/hadoop
[omm@bigdata01 hadoop]$ vi hadoop-env.sh 
[omm@bigdata01 hadoop]$ tail -3 hadoop-env.sh 
export TEZ_CONF_DIR=$HADOOP_HOME/etc/hadoop
export TEZ_JARS=/opt/module/tez
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${TEZ_CONF_DIR}/*:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
[omm@bigdata01 ~]$ # 修改Hive的计算引擎
[omm@bigdata01 ~]$ vi /opt/module/hive/conf/hive-site.xml 
[omm@bigdata01 ~]$ tail -5 $HIVE_HOME/conf/hive-site.xml
  <property>
    <name>hive.execution.engine</name>
    <value>tez</value>
  </property>
</configuration>
[omm@bigdata01 ~]$ # 解决日志Jar包冲突
[omm@bigdata01 ~]$ rm /opt/module/tez/lib/slf4j-log4j12-1.7.10.jar

1.3 Hive 启动

[omm@bigdata01 ~]$ mysql -uroot -pabcd1234..

mysql> create database metastore;
Query OK, 1 row affected (0.00 sec)

mysql> quit
Bye
[omm@bigdata01 ~]$

[omm@bigdata01 ~]$ schematool -initSchema -dbType mysql -verbose
...
beeline> 
beeline> Initialization script completed
schemaTool completed
[omm@bigdata01 ~]$

[omm@bigdata01 hadoop]$ cd $HIVE_HOME/conf
[omm@bigdata01 conf]$ cp hive-log4j2.properties.template hive-log4j2.properties
[omm@bigdata01 conf]$ vi hive-log4j2.properties
[omm@bigdata01 conf]$ grep logs hive-log4j2.properties
property.hive.log.dir = /opt/module/hive/logs
[omm@bigdata01 conf]$

[omm@bigdata01 hadoop]$ vim $HIVE_HOME/bin/hiveservices.sh
[omm@bigdata01 hadoop]$ cat $HIVE_HOME/bin/hiveservices.sh
#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs
META_PID=/tmp/meta.pid
SERVER_PID=/tmp/server.pid

if [ ! -d $HIVE_LOG_DIR ]; then mkdir -p $HIVE_LOG_DIR; fi

function hive_start()
{
    nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &
echo $! > $META_PID
sleep 8
    nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveserver2.log 2>&1 &
    echo $! > $SERVER_PID
}

function hive_stop()
{
    if [ -f $META_PID ]
    then
        cat $META_PID | xargs kill -9
        rm $META_PID
    else
        echo "Meta PID文件丢失，请手动关闭服务"
    fi
    if [ -f $SERVER_PID ]
    then
        cat $SERVER_PID | xargs kill -9
        rm $SERVER_PID
    else
        echo "Server2 PID文件丢失，请手动关闭服务"
    fi

}

case $1 in
"start")
    hive_start
    ;;
"stop")
    hive_stop
    ;;
"restart")
hive_stop
sleep 2
    hive_start
    ;;
*)
    echo Invalid Args!
    echo 'Usage: '$(basename $0)' start|stop|restart'
    ;;
esac
[omm@bigdata01 hadoop]$ chmod +x $HIVE_HOME/bin/hiveservices.sh

[omm@bigdata01 ~]$ hiveservices.sh start
[omm@bigdata01 ~]$ jps
2864 NodeManager
3236 RunJar
2249 NameNode
3098 RunJar
2364 DataNode
3342 Jps
[omm@bigdata01 ~]$ ss -tlunp | grep -E 3236\|3098
tcp    LISTEN     0      50     [::]:10000              [::]:*                   users:(("java",pid=3236,fd=534))
tcp    LISTEN     0      50     [::]:10002              [::]:*                   users:(("java",pid=3236,fd=535))
tcp    LISTEN     0      50     [::]:9083               [::]:*                   users:(("java",pid=3098,fd=544))
[omm@bigdata01 ~]$

1.4 Hive 使用

1.4.1 shell beeline

创建表

[omm@bigdata01 ~]$ beeline -u jdbc:hive2://bigdata01:10000 -n omm
0: jdbc:hive2://bigdata01:10000> show databases;
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (0.138 seconds)
0: jdbc:hive2://bigdata01:10000> show tables;
+-----------+
| tab_name  |
+-----------+
+-----------+
No rows selected (0.038 seconds)
0: jdbc:hive2://bigdata01:10000> create table student(id int, value string);
0: jdbc:hive2://bigdata01:10000> show tables;
+-----------+
| tab_name  |
+-----------+
| student   |
+-----------+
1 row selected (0.044 seconds)
0: jdbc:hive2://bigdata01:10000>

插入数据

0: jdbc:hive2://bigdata01:10000> insert into table student values(1001,"rayslee");

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.11 s     
----------------------------------------------------------------------------------------------

0: jdbc:hive2://bigdata01:10000> select id,count(*) from student group by id;

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.04 s     
----------------------------------------------------------------------------------------------

+-------+------+
|  id   | _c1  |
+-------+------+
| 1001  | 1    |
+-------+------+
1 row selected (3.93 seconds)
0: jdbc:hive2://bigdata01:10000>

描述表

0: jdbc:hive2://bigdata01:10000> desc student;

+-----------+------------+----------+
| col_name  | data_type  | comment  |
+-----------+------------+----------+
| id        | int        |          |
| value     | string     |          |
+-----------+------------+----------+
2 rows selected (0.06 seconds)

0: jdbc:hive2://bigdata01:10000> desc formatted student;

+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| id                            | int                                                |                                                    |
| value                         | string                                             |                                                    |
|                               | NULL                                               | NULL                                               |
| # Detailed Table Information  | NULL                                               | NULL                                               |
| Database:                     | default                                            | NULL                                               |
| OwnerType:                    | USER                                               | NULL                                               |
| Owner:                        | omm                                                | NULL                                               |
| CreateTime:                   | Fri Mar 05 10:37:30 CST 2021                       | NULL                                               |
| LastAccessTime:               | UNKNOWN                                            | NULL                                               |
| Retention:                    | 0                                                  | NULL                                               |
| Location:                     | hdfs://bigdata01:8020/user/hive/warehouse/student  | NULL                                               |
| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |
| Table Parameters:             | NULL                                               | NULL                                               |
|                               | COLUMN_STATS_ACCURATE                              | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"id\":\"true\",\"value\":\"true\"}} |
|                               | bucketing_version                                  | 2                                                  |
|                               | numFiles                                           | 1                                                  |
|                               | numRows                                            | 1                                                  |
|                               | rawDataSize                                        | 12                                                 |
|                               | totalSize                                          | 13                                                 |
|                               | transient_lastDdlTime                              | 1614912691                                         |
|                               | NULL                                               | NULL                                               |
| # Storage Information         | NULL                                               | NULL                                               |
| SerDe Library:                | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL                                               |
| InputFormat:                  | org.apache.hadoop.mapred.TextInputFormat           | NULL                                               |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                                               |
| Compressed:                   | No                                                 | NULL                                               |
| Num Buckets:                  | -1                                                 | NULL                                               |
| Bucket Columns:               | []                                                 | NULL                                               |
| Sort Columns:                 | []                                                 | NULL                                               |
| Storage Desc Params:          | NULL                                               | NULL                                               |
|                               | serialization.format                               | 1                                                  |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
32 rows selected (0.085 seconds)

0: jdbc:hive2://bigdata01:10000>

1.4.2 DBeaver

DBeaver 下载地址：https://dbeaver.io/download/

在这里插入图片描述

Hive JDBC 驱动：解压缩 apache-hive-3.1.2-bin.tar.gz 依次找到 apache-hive-3.1.2-bin\jdbc\hive-jdbc-3.1.2-standalone.jar

在这里插入图片描述

二、数据类型

2.1 基本数据类型

对于Hive的String类型相当于数据库的varchar类型，该类型是一个可变的字符串，不过它不能声明其中最多能存储多少个字符，理论上它可以存储2GB的字符数。

Hive数据类型	Java数据类型	长度	例子
TINYINT	byte	1byte有符号整数	20
SMALINT	short	2byte有符号整数	20
INT	int	4byte有符号整数	20
BIGINT	long	8byte有符号整数	20
BOOLEAN	boolean	布尔类型，true或者false	TRUE FALSE
FLOAT	float	单精度浮点数	3.14159
DOUBLE	double	双精度浮点数	3.14159
STRING	string	字符系列。可以指定字符集。可以使用单引号或者双引号。	‘now is the time’ “for all good men”
TIMESTAMP		时间类型
BINARY		字节数组

2.2 集合数据类型

类型说明

Hive有三种复杂数据类型ARRAY、MAP 和 STRUCT。ARRAY和MAP与Java中的Array和Map类似，而STRUCT与C语言中的Struct类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套。

数据类型	描述	语法示例
STRUCT	和c语言中的struct类似，都可以通过“点”符号访问元素内容。例如，如果某个列的数据类型是STRUCT{first STRING, last STRING},那么第1个元素可以通过字段.first来引用。	struct() 例如struct<street:string, city:string>
MAP	MAP是一组键-值对元组集合，使用数组表示法可以访问数据。例如，如果某个列的数据类型是MAP，其中键->值对是’first’->’John’和’last’->’Doe’，那么可以通过字段名[‘last’]获取最后一个元素	map() 例如map<string, int>
ARRAY	数组是一组具有相同类型和名称的变量的集合。这些变量称为数组的元素，每个数组元素都有一个编号，编号从零开始。例如，数组值为[‘John’, ‘Doe’]，那么第2个元素可以通过数组名[1]进行引用。	Array() 例如array<string>

案例实操

假设某表有如下一行，我们用JSON格式来表示其数据结构。在Hive下访问的格式为

{
    "name": "songsong",
    "friends": ["bingbing" , "lili"] ,       //列表Array, 
    "children": {                      //键值Map,
        "xiao song": 18 ,
        "xiaoxiao song": 19
    }
    "address": {                      //结构Struct,
        "street": "hui long guan" ,
        "city": "beijing" 
    }
}

基于上述数据结构，我们在Hive里创建对应的表，并导入数据。
创建本地测试文件test.txt

注意：MAP，STRUCT和ARRAY里的元素间关系都可以用同一个字符表示，这里用“_”。

songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing

Hive上创建测试表test

create table test(
  name string,
  friends array<string>,
  children map<string, int>,
  address struct<street:string, city:string>
)
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ':';

字段解释：

row format delimited fields terminated by ','  -- 列分隔符
collection items terminated by '_'  	--MAP STRUCT 和 ARRAY 的分隔符(数据分割符号)
map keys terminated by ':'				-- MAP中的key与value的分隔符
lines terminated by '\n';					-- 行分隔符

导入文本数据到测试表

[omm@bigdata01 ~]$ cd /opt/module/
[omm@bigdata01 module]$ mkdir datas
[omm@bigdata01 module]$ cd datas/
[omm@bigdata01 datas]$ ll
total 383008
-rw-rw-r-- 1 omm omm 120734753 Mar  5 12:56 bigtable
-rw-rw-r-- 1 omm omm       266 Mar  5 12:56 business.txt
-rw-rw-r-- 1 omm omm       129 Mar  5 12:56 constellation.txt
-rw-rw-r-- 1 omm omm        71 Mar  5 12:56 dept.txt
-rw-rw-r-- 1 omm omm        78 Mar  5 12:56 emp_sex.txt
-rw-rw-r-- 1 omm omm       656 Mar  5 12:56 emp.txt
-rw-rw-r-- 1 omm omm        37 Mar  5 12:56 location.txt
-rw-rw-r-- 1 omm omm  19014993 Mar  5 12:56 log.data
-rw-rw-r-- 1 omm omm       136 Mar  5 12:56 movie.txt
-rw-rw-r-- 1 omm omm 118645854 Mar  5 12:56 nullid
-rw-rw-r-- 1 omm omm 121734744 Mar  5 12:56 ori
-rw-rw-r-- 1 omm omm       213 Mar  5 12:56 score.txt
-rw-rw-r-- 1 omm omm  12018355 Mar  5 12:56 smalltable
-rw-rw-r-- 1 omm omm       144 Mar  5 12:56 test.txt
[omm@bigdata01 datas]$

0: jdbc:hive2://bigdata01:10000> load data local inpath '/opt/module/datas/test.txt' into table test;

0: jdbc:hive2://bigdata01:10000> select * from test;

+------------+----------------------+--------------------------------------+----------------------------------------------+
| test.name  |     test.friends     |            test.children             |                 test.address                 |
+------------+----------------------+--------------------------------------+----------------------------------------------+
| songsong   | ["bingbing","lili"]  | {"xiao song":18,"xiaoxiao song":19}  | {"street":"hui long guan","city":"beijing"}  |
| yangyang   | ["caicai","susu"]    | {"xiao yang":18,"xiaoxiao yang":19}  | {"street":"chao yang","city":"beijing"}      |
+------------+----------------------+--------------------------------------+----------------------------------------------+
2 rows selected (0.148 seconds)

0: jdbc:hive2://bigdata01:10000>

访问三种集合列里的数据，以下分别是ARRAY，MAP，STRUCT的访问方式

0: jdbc:hive2://bigdata01:10000> select friends[1],children['xiao song'],address.city from test where name="songsong";

+-------+------+----------+
|  _c0  | _c1  |   city   |
+-------+------+----------+
| lili  | 18   | beijing  |
+-------+------+----------+
1 row selected (0.204 seconds)

0: jdbc:hive2://bigdata01:10000>

2.3 类型转换

隐式转换

Hive的原子数据类型是可以进行隐式转换的，类似于Java的类型转换，例如某表达式使用INT类型，TINYINT会自动转换为INT类型；
但是Hive不会进行反向转化，例如，某表达式使用TINYINT类型，INT不会自动转换为TINYINT类型，它会返回错误，除非使用CAST操作。

隐式类型转换规则

任何整数类型都可以隐式地转换为一个范围更广的类型，如TINYINT可以转换成INT，INT可以转换成BIGINT。
所有整数类型、FLOAT和STRING类型都可以隐式地转换成DOUBLE。
TINYINT、SMALLINT、INT都可以转换为FLOAT。
BOOLEAN类型不可以转换为任何其它的类型。

强制类型转换

例如CAST(‘1’ AS INT)将把字符串’1’ 转换成整数1；如果强制类型转换失败，如执行CAST(‘X’ AS INT)，表达式返回空值 NULL。

0: jdbc:hive2://bigdata01:10000> select '1'+2, cast('1'as int) + 2;

+------+------+
| _c0  | _c1  |
+------+------+
| 3.0  | 3    |
+------+------+
1 row selected (0.171 seconds)

0: jdbc:hive2://bigdata01:10000>

三、DDL 数据定义

3.1 库

创建

CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];

默认情况况下建立在：/user/hive/warehouse/*.db

0: jdbc:hive2://bigdata01:10000> create database db_hive2;

0: jdbc:hive2://bigdata01:10000> dfs -ls /user/hive/warehouse;
+----------------------------------------------------+
|                     DFS Output                     |
+----------------------------------------------------+
| Found 3 items                                      |
| drwxr-xr-x   - omm supergroup          0 2021-03-05 13:37 /user/hive/warehouse/db_hive2.db |
| drwxr-xr-x   - omm supergroup          0 2021-03-05 10:51 /user/hive/warehouse/student |
| drwxr-xr-x   - omm supergroup          0 2021-03-05 12:59 /user/hive/warehouse/test |
+----------------------------------------------------+
4 rows selected (0.065 seconds)

0: jdbc:hive2://bigdata01:10000>

可以指定 HDFS 路径

0: jdbc:hive2://bigdata01:10000> create database db_hive location '/db_hive.db';

0: jdbc:hive2://bigdata01:10000> dfs -ls /;

+----------------------------------------------------+
|                     DFS Output                     |
+----------------------------------------------------+
| Found 4 items                                      |
| drwxr-xr-x   - omm supergroup          0 2021-03-05 13:40 /db_hive.db |
| drwxr-xr-x   - omm supergroup          0 2021-03-04 22:09 /tez |
| drwx-wx-wx   - omm supergroup          0 2021-03-05 10:46 /tmp |
| drwxr-xr-x   - omm supergroup          0 2021-03-05 10:37 /user |
+----------------------------------------------------+
5 rows selected (0.015 seconds)

0: jdbc:hive2://bigdata01:10000>

查询

命令	说明
use [db_name];	切换数据库
show databases;	查看所有数据库
show databases like ‘db_*’;	查看特定数据库
desc database [extended] db_hive;	查看数据库信息

修改

用户可以使用ALTER DATABASE命令为某个数据库的DBPROPERTIES设置键-值对属性值，来描述这个数据库的属性信息。
数据库的其他元数据信息都是不可更改的，包括数据库名和数据库所在的目录位置。

0: jdbc:hive2://bigdata01:10000> alter database db_hive set dbproperties('createtime'='20170830');

0: jdbc:hive2://bigdata01:10000> desc database extended db_hive;

+----------+----------+-----------------------------------+-------------+-------------+------------------------+
| db_name  | comment  |             location              | owner_name  | owner_type  |       parameters       |
+----------+----------+-----------------------------------+-------------+-------------+------------------------+
| db_hive  |          | hdfs://bigdata01:8020/db_hive.db  | omm         | USER        | {createtime=20170830}  |
+----------+----------+-----------------------------------+-------------+-------------+------------------------+

0: jdbc:hive2://bigdata01:10000>

删除

命令	说明
drop database [db_name];	删除空数据库
drop database [db_name] `cascade`;	删除飞控书库

3.2 管理表

建表语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement];

字段解释说明

字段	说明
CREATE TABLE	创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 IF NOT EXISTS 选项来忽略这个异常。
EXTERNAL	关键字可以让用户创建一个外部表，在建表的同时可以指定一个指向实际数据的路径（LOCATION），在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。
COMMENT	为表和列添加注释。
PARTITIONED BY	创建分区表
CLUSTERED BY	创建分桶表
SORTED BY	不常用，对桶中的一个或多个列另外排序
ROW FORMAT	`DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] \| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]` 用户在建表的时候可以自定义SerDe或者使用自带的SerDe。如果没有指定ROW FORMAT 或者ROW FORMAT DELIMITED，将会使用自带的SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的SerDe，Hive通过SerDe确定表的具体的列的数据。SerDe是Serialize/Deserilize的简称， hive使用Serde进行行对象的序列与反序列化。
STORED AS	指定存储文件类型。常用的存储文件类型：SEQUENCEFILE（二进制序列文件）、TEXTFILE（文本）、RCFILE（列式存储格式文件）；如果文件数据是纯文本，可以使用STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCEFILE。
LOCATION	指定表在HDFS上的存储位置。
AS	后跟查询语句，根据查询结果创建表。
LIKE	允许用户复制现有的表结构，但是不复制数据。

创建表

CREATE TABLE test2
(id int COMMENT "ID", name string COMMENT "Name")
COMMENT "Test Table"
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION "/test2"
TBLPROPERTIES("time"="2021-03-06 09:58:42");

查看表

desc test2;
desc formatted test2;

修改表

# 重命名
ALTER TABLE table_name RENAME TO new_table_name;
alter table test2 rename to new_test2;
# 更新列
ALTER TABLE table_name CHANGE col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name];
alter table new_test2 change id id string;
# 增加和替换列
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...);
alter table new_test2 add columns(new_field string);
alter table new_test2 replace columns(id string,name string);

删除表

drop table new_test2;

3.3 外部表

定义

CREATE EXTERNAL TABLE table_name …

通过关键字 EXTERNAL 可创建外部表；
删除外部表表并不会删除掉这份数据，不过描述表的元数据信息会被删除掉。

转换

# 内转外
alter table student2 set tblproperties('EXTERNAL'='TRUE');
# 外转内
alter table student2 set tblproperties('EXTERNAL'='FALSE');

3.4 分区表

因为 Hive 不像一般数据库那样可以建立索引，为了减少数据扫描量，可以将其按目录进行分区，或按照某列的Hash值进行分桶。

分区

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。
Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。
在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

分桶

分桶则是指定分桶表的某一列，让该列数据按照哈希取模的方式随机、均匀的分发到各个桶文件中。
分桶改变了数据的存储方式，它会把哈希取模相同或者在某一个区间的数据行放在同一个桶文件中,如此一来便可以提高查询效率。
如果我们需要对两张在同一个列上进行了分桶操作的表进行JOIN操作的时候，只需要对保存相同列值的桶进行JOIN操作即可。

一级分区表

--建立一张分区表
create table stu_par
(id int, name string)
partitioned by (class string)
row format delimited fields terminated by '\t';

--向表中插入数据
load data local inpath '/opt/module/datas/student.txt' into table stu_par partition (class='01');
load data local inpath '/opt/module/datas/student.txt' into table stu_par partition (class='02');
load data local inpath '/opt/module/datas/student.txt' into table stu_par partition (class='03');

--查表时，选择分区，可以减小数据扫描量
select * from stu_par where class="01";
select * from stu_par where id=1001;

--查询分区表的分区
show partitions stu_par;

--如果提前准备数据，但是没有元数据，修复方式
--1. 添加分区
alter table stu_par add partition(class="03");
--2. 直接修复
msck repair table stu_par;
--3. 上传时候直接带分区
load data local inpath '/opt/module/datas/student.txt' into table stu_par
partition (class='03');

二级分区表

--建立二级分区表
create table stu_par2
(id int, name string)
partitioned by (grade string, class string)
row format delimited fields terminated by '\t';

--插入数据，指定到二级分区
load data local inpath '/opt/module/datas/student.txt' into table stu_par2
partition (grade='01', class='03');

分区的增删改查

--增加分区
alter table stu_par add partition(class="05");
--一次增加多个分区
alter table stu_par add partition(class="06") partition(class="07");
--删除分区
alter table stu_par drop partition(class="05");
--一次删除多个分区
alter table stu_par drop partition(class="06"), partition(class="07");

3.5 分桶表

分桶的意义

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区（如 ID 列）。
对于一张表或者分区，Hive 可以进一步组织成桶，也就是更为细粒度的数据范围划分。
分桶是将数据集分解成更容易管理的若干部分的另一个技术。
分区针对的是数据的存储路径；分桶针对的是数据文件。
Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

0: jdbc:hive2://bigdata01:10000> create table stu_buck(id int, name string)
. . . . . . . . . . . . . . . .> clustered by(id) into 4 buckets
. . . . . . . . . . . . . . . .> row format delimited fields terminated by '\t';
No rows affected (0.054 seconds)
0: jdbc:hive2://bigdata01:10000> set hive.enforce.bucketing=true;
No rows affected (0.004 seconds)
0: jdbc:hive2://bigdata01:10000> set mapreduce.job.reduces=-1;
No rows affected (0.003 seconds)
0: jdbc:hive2://bigdata01:10000> load data local inpath '/opt/module/datas/student.txt' into table stu_buck;

在这里插入图片描述

四、DML数据操作

4.1 数据导入

语法

load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];

load data:表示加载数据
local:表示从本地加载数据到hive表；否则从HDFS加载数据到hive表
inpath:表示加载数据的路径
overwrite:表示覆盖表中已有数据，否则表示追加
into table:表示加载到哪张表
student:表示具体的表
partition:表示上传到指定分区

--从本地磁盘或者DHFS导入数据
load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];

--例子
load data local inpath '/opt/module/datas/student.txt' overwrite into table student;

--先在hdfs://hadoop102:8020/xxx文件夹上传一份student.txt
--HDFS的导入是移动，而本地导入是复制
load data inpath '/xxx/student.txt' overwrite into table student;

--Insert导入
insert into table student select id, name from stu_par where class="01";

--建表时候用as select
create table stu_result as select * from stu_par where id=1001;

--建表时候通过location加载
--先在hdfs://hadoop102:8020/xxx文件夹上传一份student.txt
create external table student2
(id int, name string)
row format delimited fields terminated by '\t'
location '/xxx';

4.2 数据导出

--Insert导出
insert overwrite local directory '/opt/module/datas/export/student'
select * from student;

--带格式导出
insert overwrite local directory '/opt/module/datas/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from student;

#bash命令行导出
hive -e "select * from default.student;" > /opt/module/datas/export/test.txt

--整张表export到HDFS
export table student to '/export/student';

--从导出结果导入到Hive
import table student3 from '/export/student';

4.3 数据删除

--只删表数据，不删表本身
truncate table student;

五、查询

说明

SQL 语言大小写不敏感。
SQL 可以写在一行或者多行
关键字不能被缩写也不能分行
各子句一般要分行写。
使用缩进提高语句的可读性。

准备工作

create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';

create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string, 
sal double, 
comm double,
deptno int)
row format delimited fields terminated by '\t';

load data local inpath '/opt/module/datas/dept.txt' into table dept;
load data local inpath '/opt/module/datas/emp.txt' into table emp;

5.1 基本查询

全表查询

0: jdbc:hive2://bigdata01:10000> select * from dept;
+--------------+-------------+-----------+
| dept.deptno  | dept.dname  | dept.loc  |
+--------------+-------------+-----------+
| 10           | ACCOUNTING  | 1700      |
| 20           | RESEARCH    | 1800      |
| 30           | SALES       | 1900      |
| 40           | OPERATIONS  | 1700      |
+--------------+-------------+-----------+
4 rows selected (1.034 seconds)
0: jdbc:hive2://bigdata01:10000> select * from emp;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| emp.empno  | emp.ename  |  emp.job   | emp.mgr  | emp.hiredate  | emp.sal  | emp.comm  | emp.deptno  |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| 7369       | SMITH      | CLERK      | 7902     | 1980-12-17    | 800.0    | NULL      | 20          |
| 7499       | ALLEN      | SALESMAN   | 7698     | 1981-2-20     | 1600.0   | 300.0     | 30          |
| 7521       | WARD       | SALESMAN   | 7698     | 1981-2-22     | 1250.0   | 500.0     | 30          |
| 7566       | JONES      | MANAGER    | 7839     | 1981-4-2      | 2975.0   | NULL      | 20          |
| 7654       | MARTIN     | SALESMAN   | 7698     | 1981-9-28     | 1250.0   | 1400.0    | 30          |
| 7698       | BLAKE      | MANAGER    | 7839     | 1981-5-1      | 2850.0   | NULL      | 30          |
| 7782       | CLARK      | MANAGER    | 7839     | 1981-6-9      | 2450.0   | NULL      | 10          |
| 7788       | SCOTT      | ANALYST    | 7566     | 1987-4-19     | 3000.0   | NULL      | 20          |
| 7839       | KING       | PRESIDENT  | NULL     | 1981-11-17    | 5000.0   | NULL      | 10          |
| 7844       | TURNER     | SALESMAN   | 7698     | 1981-9-8      | 1500.0   | 0.0       | 30          |
| 7876       | ADAMS      | CLERK      | 7788     | 1987-5-23     | 1100.0   | NULL      | 20          |
| 7900       | JAMES      | CLERK      | 7698     | 1981-12-3     | 950.0    | NULL      | 30          |
| 7902       | FORD       | ANALYST    | 7566     | 1981-12-3     | 3000.0   | NULL      | 20          |
| 7934       | MILLER     | CLERK      | 7782     | 1982-1-23     | 1300.0   | NULL      | 10          |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
14 rows selected (0.129 seconds)
0: jdbc:hive2://bigdata01:10000>

列别名

1．重命名一个列
2．便于计算
3．紧跟列名，也可以在列名和别名之间加入关键字‘AS’

0: jdbc:hive2://bigdata01:10000> select empno AS no,ename name from emp limit 3;
+-------+--------+
|  no   |  name  |
+-------+--------+
| 7369  | SMITH  |
| 7499  | ALLEN  |
| 7521  | WARD   |
+-------+--------+
3 rows selected (0.115 seconds)
0: jdbc:hive2://bigdata01:10000>

算术运算符

运算符	描述
A+B	A和B 相加
A-B	A减去B
A*B	A和B 相乘
A/B	A除以B
A%B	A对B取余
A&B	A和B按位取与
A	B
A^B	A和B按位取异或
~A	A按位取反

常用函数

# 求总行数（count）
hive (default)> select count(*) cnt from emp;
# 求工资的最大值（max）
hive (default)> select max(sal) max_sal from emp;
# 求工资的最小值（min）
hive (default)> select min(sal) min_sal from emp;
# 求工资的总和（sum）
hive (default)> select sum(sal) sum_sal from emp; 
# 求工资的平均值（avg）
hive (default)> select avg(sal) avg_sal from emp;

比较运算符

操作符	支持的数据类型	描述
A=B	基本数据类型	如果A等于B则返回TRUE，反之返回FALSE
A<=>B	基本数据类型	如果A和B都为NULL，则返回TRUE，其他的和等号（=）操作符的结果一致，如果任一为NULL则结果为NULL
A<>B, A!=B	基本数据类型	A或者B为NULL则返回NULL；如果A不等于B，则返回TRUE，反之返回FALSE
A<B	基本数据类型	A或者B为NULL，则返回NULL；如果A小于B，则返回TRUE，反之返回FALSE
A<=B	基本数据类型	A或者B为NULL，则返回NULL；如果A小于等于B，则返回TRUE，反之返回FALSE
A>B	基本数据类型	A或者B为NULL，则返回NULL；如果A大于B，则返回TRUE，反之返回FALSE
A>=B	基本数据类型	A或者B为NULL，则返回NULL；如果A大于等于B，则返回TRUE，反之返回FALSE
A [NOT] BETWEEN B AND C	基本数据类型	如果A，B或者C任一为NULL，则结果为NULL。如果A的值大于等于B而且小于或等于C，则结果为TRUE，反之为FALSE。如果使用NOT关键字则可达到相反的效果。
A IS NULL	所有数据类型	如果A等于NULL，则返回TRUE，反之返回FALSE
A IS NOT NULL	所有数据类型	如果A不等于NULL，则返回TRUE，反之返回FALSE
IN(数值1, 数值2)	所有数据类型	使用 IN运算显示列表中的值
A [NOT] LIKE B	STRING 类型	B是一个SQL下的简单正则表达式，也叫通配符模式，如果A与其匹配的话，则返回TRUE；反之返回FALSE。B的表达式说明如下：‘x%’表示A必须以字母‘x’开头，‘%x’表示A必须以字母’x’结尾，而‘%x%’表示A包含有字母’x’,可以位于开头，结尾或者字符串中间。如果使用NOT关键字则可达到相反的效果。
A RLIKE B, A REGEXP B	STRING 类型	B是基于java的正则表达式，如果A与其匹配，则返回TRUE；反之返回FALSE。匹配使用的是JDK中的正则表达式接口实现的，因为正则也依据其中的规则。例如，正则表达式必须和整个字符串A相匹配，而不是只需与其字符串匹配。

# 查询出薪水等于5000的所有员工
hive (default)> select * from emp where sal =5000;
# 查询工资在500到1000的员工信息
hive (default)> select * from emp where sal between 500 and 1000;
# 查询comm为空的所有员工信息
hive (default)> select * from emp where comm is null;
# 查询工资是1500或5000的员工信息
hive (default)> select * from emp where sal IN (1500, 5000);

Like和RLike

使用LIKE运算选择类似的值
选择条件可以包含字符或数字：% 代表零个或多个字符(任意个字符)；_ 代表一个字符。
RLIKE子句是Hive中这个功能的一个扩展，其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。

# 查找以2开头薪水的员工信息
hive (default)> select * from emp where sal LIKE '2%';
# 查找第二个数值为2的薪水的员工信息
hive (default)> select * from emp where sal LIKE '_2%';
# 查找薪水中含有2的员工信息
hive (default)> select * from emp where sal RLIKE '[2]';

逻辑运算符

# 查询薪水大于1000，部门是30
hive (default)> select * from emp where sal>1000 and deptno=30;
# 查询薪水大于1000，或者部门是30
hive (default)> select * from emp where sal>1000 or deptno=30;
# 查询除了20部门和30部门以外的员工信息
hive (default)> select * from emp where deptno not IN(30, 20);

分组

GROUP BY语句通常会和聚合函数一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作。

# 计算emp表每个部门的平均工资
hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
# 计算emp每个部门中每个岗位的最高薪水
hive (default)> select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;

Having 语句

where 后面不能写分组函数，而 having 后面可以使用分组函数。
having 只用于 group by 分组统计语句。

# 求每个部门的平均薪水大于2000的部门
hive (default)> select deptno, avg(sal) from emp group by deptno;
# 求每个部门的平均薪水大于2000的部门
hive (default)> select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

5.2 多表查询

内连接

Hive支持通常的SQL JOIN语句，但是只支持等值连接，不支持非等值连接。

# 根据员工表和部门表中的部门编号相等，查询员工编号、员工名称和部门名称；
hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e join dept d on e.deptno = d.deptno;

左外连接

JOIN操作符左边表中符合WHERE子句的所有记录将会被返回。

0: jdbc:hive2://bigdata01:10000> select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno = d.deptno;
+----------+-----------+-----------+
| e.empno  |  e.ename  | d.deptno  |
+----------+-----------+-----------+
| 7335     | zhangsan  | NULL      |
| 7369     | SMITH     | 20        |
| 7499     | ALLEN     | 30        |
| 7521     | WARD      | 30        |
| 7566     | JONES     | 20        |
| 7654     | MARTIN    | 30        |
| 7698     | BLAKE     | 30        |
| 7782     | CLARK     | 10        |
| 7788     | SCOTT     | 20        |
| 7839     | KING      | 10        |
| 7844     | TURNER    | 30        |
| 7876     | ADAMS     | 20        |
| 7900     | JAMES     | 30        |
| 7902     | FORD      | 20        |
| 7934     | MILLER    | 10        |
+----------+-----------+-----------+
15 rows selected (16.069 seconds)
0: jdbc:hive2://bigdata01:10000>

右外连接

JOIN操作符右边表中符合WHERE子句的所有记录将会被返回。

0: jdbc:hive2://bigdata01:10000> select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno;
+----------+----------+-----------+
| e.empno  | e.ename  | d.deptno  |
+----------+----------+-----------+
| 7782     | CLARK    | 10        |
| 7839     | KING     | 10        |
| 7934     | MILLER   | 10        |
| 7369     | SMITH    | 20        |
| 7566     | JONES    | 20        |
| 7788     | SCOTT    | 20        |
| 7876     | ADAMS    | 20        |
| 7902     | FORD     | 20        |
| 7499     | ALLEN    | 30        |
| 7521     | WARD     | 30        |
| 7654     | MARTIN   | 30        |
| 7698     | BLAKE    | 30        |
| 7844     | TURNER   | 30        |
| 7900     | JAMES    | 30        |
| NULL     | NULL     | 40        |
+----------+----------+-----------+
15 rows selected (16.161 seconds)
0: jdbc:hive2://bigdata01:10000>

满外连接

将会返回所有表中符合WHERE语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用NULL值替代。

0: jdbc:hive2://bigdata01:10000> select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno;
+----------+-----------+-----------+
| e.empno  |  e.ename  | d.deptno  |
+----------+-----------+-----------+
| 7934     | MILLER    | 10        |
| 7839     | KING      | 10        |
| 7782     | CLARK     | 10        |
| 7876     | ADAMS     | 20        |
| 7369     | SMITH     | 20        |
| 7788     | SCOTT     | 20        |
| 7566     | JONES     | 20        |
| 7902     | FORD      | 20        |
| 7844     | TURNER    | 30        |
| 7698     | BLAKE     | 30        |
| 7654     | MARTIN    | 30        |
| 7521     | WARD      | 30        |
| 7499     | ALLEN     | 30        |
| 7900     | JAMES     | 30        |
| NULL     | NULL      | 40        |
| 7335     | zhangsan  | NULL      |
+----------+-----------+-----------+
16 rows selected (19.338 seconds)
0: jdbc:hive2://bigdata01:10000>

多表连接

准备工作

0: jdbc:hive2://bigdata01:10000> create table if not exists location(
. . . . . . . . . . . . . . . .> loc int,
. . . . . . . . . . . . . . . .> loc_name string
. . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . .> row format delimited fields terminated by '\t';
No rows affected (0.121 seconds)
0: jdbc:hive2://bigdata01:10000> load data local inpath '/opt/module/datas/location.txt' into table location;
No rows affected (0.223 seconds)
0: jdbc:hive2://bigdata01:10000> select * from location;
+---------------+--------------------+
| location.loc  | location.loc_name  |
+---------------+--------------------+
| 1700          | Beijing            |
| 1800          | London             |
| 1900          | Tokyo              |
+---------------+--------------------+
3 rows selected (0.13 seconds)
0: jdbc:hive2://bigdata01:10000>

执行

0: jdbc:hive2://bigdata01:10000> SELECT e.ename, d.dname, l.loc_name
. . . . . . . . . . . . . . . .> FROM   emp e 
. . . . . . . . . . . . . . . .> JOIN   dept d
. . . . . . . . . . . . . . . .> ON     d.deptno = e.deptno 
. . . . . . . . . . . . . . . .> JOIN   location l
. . . . . . . . . . . . . . . .> ON     d.loc = l.loc;
+----------+-------------+-------------+
| e.ename  |   d.dname   | l.loc_name  |
+----------+-------------+-------------+
| SMITH    | RESEARCH    | London      |
| ALLEN    | SALES       | Tokyo       |
| WARD     | SALES       | Tokyo       |
| JONES    | RESEARCH    | London      |
| MARTIN   | SALES       | Tokyo       |
| BLAKE    | SALES       | Tokyo       |
| CLARK    | ACCOUNTING  | Beijing     |
| SCOTT    | RESEARCH    | London      |
| KING     | ACCOUNTING  | Beijing     |
| TURNER   | SALES       | Tokyo       |
| ADAMS    | RESEARCH    | London      |
| JAMES    | SALES       | Tokyo       |
| FORD     | RESEARCH    | London      |
| MILLER   | ACCOUNTING  | Beijing     |
+----------+-------------+-------------+
14 rows selected (17.856 seconds)
0: jdbc:hive2://bigdata01:10000>

大多数情况下，Hive 会对每对 JOIN 连接对象启动一个 MapReduce 任务。本例中会首先启动一个 MapReduce job 对表 e 和表 d 进行连接操作，然后会再启动一个 MapReduce job 将第一个 MapReduce job 的输出和表l;进行连接操作。
当对3个或者更多表进行join连接时，如果每个 on 子句都使用相同的连接键的话，那么只会产生一个 MapReduce job。

笛卡尔积

笛卡尔集会在下面条件下产生：

省略连接条件
连接条件无效
所有表中的所有行互相连接

0: jdbc:hive2://bigdata01:10000> select empno, dname from emp, dept;
+--------+-------------+
| empno  |    dname    |
+--------+-------------+
| 7335   | ACCOUNTING  |
| 7335   | RESEARCH    |
| 7335   | SALES       |
| 7335   | OPERATIONS  |
| 7369   | ACCOUNTING  |
| 7369   | RESEARCH    |
| 7369   | SALES       |
| 7369   | OPERATIONS  |
+--------+-------------+
60 rows selected (16.239 seconds)
0: jdbc:hive2://bigdata01:10000>

5.3 查询排序

全局排序

0: jdbc:hive2://bigdata01:10000> select * from emp order by sal;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| emp.empno  | emp.ename  |  emp.job   | emp.mgr  | emp.hiredate  | emp.sal  | emp.comm  | emp.deptno  |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| 7335       | zhangsan   | NULL       | NULL     | NULL          | NULL     | NULL      | 50          |
| 7369       | SMITH      | CLERK      | 7902     | 1980-12-17    | 800.0    | NULL      | 20          |
| 7900       | JAMES      | CLERK      | 7698     | 1981-12-3     | 950.0    | NULL      | 30          |
| 7876       | ADAMS      | CLERK      | 7788     | 1987-5-23     | 1100.0   | NULL      | 20          |
| 7521       | WARD       | SALESMAN   | 7698     | 1981-2-22     | 1250.0   | 500.0     | 30          |
| 7654       | MARTIN     | SALESMAN   | 7698     | 1981-9-28     | 1250.0   | 1400.0    | 30          |
| 7934       | MILLER     | CLERK      | 7782     | 1982-1-23     | 1300.0   | NULL      | 10          |
| 7844       | TURNER     | SALESMAN   | 7698     | 1981-9-8      | 1500.0   | 0.0       | 30          |
| 7499       | ALLEN      | SALESMAN   | 7698     | 1981-2-20     | 1600.0   | 300.0     | 30          |
| 7782       | CLARK      | MANAGER    | 7839     | 1981-6-9      | 2450.0   | NULL      | 10          |
| 7698       | BLAKE      | MANAGER    | 7839     | 1981-5-1      | 2850.0   | NULL      | 30          |
| 7566       | JONES      | MANAGER    | 7839     | 1981-4-2      | 2975.0   | NULL      | 20          |
| 7788       | SCOTT      | ANALYST    | 7566     | 1987-4-19     | 3000.0   | NULL      | 20          |
| 7902       | FORD       | ANALYST    | 7566     | 1981-12-3     | 3000.0   | NULL      | 20          |
| 7839       | KING       | PRESIDENT  | NULL     | 1981-11-17    | 5000.0   | NULL      | 10          |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
15 rows selected (17.044 seconds)
0: jdbc:hive2://bigdata01:10000>

局部排序

Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序，对全局结果集来说不是排序。

在这里插入图片描述

分区排序

在有些情况下，我们需要控制某个特定行应该到哪个reducer，通常是为了进行后续的聚集操作，distribute by 子句可以做这件事。
distribute by类似MR中partition（自定义分区），进行分区，结合sort by使用。

在这里插入图片描述

distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后，余数相同的分到一个区。
Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。

Cluster By

当distribute by和sorts by字段相同时，可以使用cluster by方式。
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序，不能指定排序规则为ASC或者DESC。

0: jdbc:hive2://bigdata01:10000> select * from emp cluster by deptno;
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| emp.empno  | emp.ename  |  emp.job   | emp.mgr  | emp.hiredate  | emp.sal  | emp.comm  | emp.deptno  |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
| 7654       | MARTIN     | SALESMAN   | 7698     | 1981-9-28     | 1250.0   | 1400.0    | 30          |
| 7900       | JAMES      | CLERK      | 7698     | 1981-12-3     | 950.0    | NULL      | 30          |
| 7698       | BLAKE      | MANAGER    | 7839     | 1981-5-1      | 2850.0   | NULL      | 30          |
| 7521       | WARD       | SALESMAN   | 7698     | 1981-2-22     | 1250.0   | 500.0     | 30          |
| 7844       | TURNER     | SALESMAN   | 7698     | 1981-9-8      | 1500.0   | 0.0       | 30          |
| 7499       | ALLEN      | SALESMAN   | 7698     | 1981-2-20     | 1600.0   | 300.0     | 30          |
| 7934       | MILLER     | CLERK      | 7782     | 1982-1-23     | 1300.0   | NULL      | 10          |
| 7839       | KING       | PRESIDENT  | NULL     | 1981-11-17    | 5000.0   | NULL      | 10          |
| 7782       | CLARK      | MANAGER    | 7839     | 1981-6-9      | 2450.0   | NULL      | 10          |
| 7788       | SCOTT      | ANALYST    | 7566     | 1987-4-19     | 3000.0   | NULL      | 20          |
| 7566       | JONES      | MANAGER    | 7839     | 1981-4-2      | 2975.0   | NULL      | 20          |
| 7876       | ADAMS      | CLERK      | 7788     | 1987-5-23     | 1100.0   | NULL      | 20          |
| 7902       | FORD       | ANALYST    | 7566     | 1981-12-3     | 3000.0   | NULL      | 20          |
| 7369       | SMITH      | CLERK      | 7902     | 1980-12-17    | 800.0    | NULL      | 20          |
| 7335       | zhangsan   | NULL       | NULL     | NULL          | NULL     | NULL      | 50          |
+------------+------------+------------+----------+---------------+----------+-----------+-------------+
15 rows selected (24.525 seconds)
0: jdbc:hive2://bigdata01:10000>

5.4 分桶抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求。

# 语句：将数据按照 ID，分成4份取其中的一份。
0: jdbc:hive2://bigdata01:10000> select * from stu_buck tablesample(bucket 1 out of 4 on id);
+--------------+----------------+
| stu_buck.id  | stu_buck.name  |
+--------------+----------------+
| 1016         | ss16           |
| 1004         | ss4            |
| 1009         | ss9            |
| 1002         | ss2            |
| 1003         | ss3            |
+--------------+----------------+
5 rows selected (0.07 seconds)
0: jdbc:hive2://bigdata01:10000>

六、函数

5.1 常用查询函数

空字段赋值

NVL：给值为NULL的数据赋值，它的格式是NVL( value，default_value)。
它的功能是如果value为NULL，则NVL函数返回default_value的值，否则返回value的值，如果两个参数都为NULL ，则返回NULL。

# 用默认值代替
0: jdbc:hive2://bigdata01:10000> select ename,job,NVL(comm, -1) from emp limit 3;
+-----------+-----------+--------+
|   ename   |    job    |  _c2   |
+-----------+-----------+--------+
| zhangsan  | NULL      | -1.0   |
| SMITH     | CLERK     | -1.0   |
| ALLEN     | SALESMAN  | 300.0  |
+-----------+-----------+--------+
3 rows selected (0.075 seconds)
0: jdbc:hive2://bigdata01:10000> 

# 用其某列的值代替
0: jdbc:hive2://bigdata01:10000> select ename,mgr,nvl(comm,mgr) from emp limit 3;
+-----------+-------+---------+
|   ename   |  mgr  |   _c2   |
+-----------+-------+---------+
| zhangsan  | NULL  | NULL    |
| SMITH     | 7902  | 7902.0  |
| ALLEN     | 7698  | 300.0   |
+-----------+-------+---------+
3 rows selected (0.086 seconds)
0: jdbc:hive2://bigdata01:10000>

Case When

0: jdbc:hive2://bigdata01:10000> create table emp_sex(
. . . . . . . . . . . . . . . .> name string, 
. . . . . . . . . . . . . . . .> dept_id string, 
. . . . . . . . . . . . . . . .> sex string) 
. . . . . . . . . . . . . . . .> row format delimited fields terminated by "\t";
No rows affected (0.068 seconds)
0: jdbc:hive2://bigdata01:10000> load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;
No rows affected (0.119 seconds)
0: jdbc:hive2://bigdata01:10000> select 
. . . . . . . . . . . . . . . .>   dept_id,
. . . . . . . . . . . . . . . .>   sum(case sex when '男' then 1 else 0 end) male_count,
. . . . . . . . . . . . . . . .>   sum(case sex when '女' then 1 else 0 end) female_count
. . . . . . . . . . . . . . . .> from 
. . . . . . . . . . . . . . . .>   emp_sex
. . . . . . . . . . . . . . . .> group by
. . . . . . . . . . . . . . . .>   dept_id;
+----------+-------------+---------------+
| dept_id  | male_count  | female_count  |
+----------+-------------+---------------+
| A        | 2           | 1             |
| B        | 1           | 2             |
+----------+-------------+---------------+
2 rows selected (17.335 seconds)
0: jdbc:hive2://bigdata01:10000>

行转列

CONCAT(string A/col, string B/col…)：返回输入字符串连接后的结果，支持任意个输入字符串;
CONCAT_WS(separator, str1, str2,...)：它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL，返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;
COLLECT_SET(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生array类型字段。

0: jdbc:hive2://bigdata01:10000> create table person_info(
. . . . . . . . . . . . . . . .> name string, 
. . . . . . . . . . . . . . . .> constellation string, 
. . . . . . . . . . . . . . . .> blood_type string) 
. . . . . . . . . . . . . . . .> row format delimited fields terminated by "\t";
No rows affected (0.073 seconds)
0: jdbc:hive2://bigdata01:10000> load data local inpath "/opt/module/datas/constellation.txt" into table person_info;
No rows affected (0.12 seconds)
0: jdbc:hive2://bigdata01:10000> select
. . . . . . . . . . . . . . . .>     t1.base,
. . . . . . . . . . . . . . . .>     concat_ws('|', collect_set(t1.name)) name
. . . . . . . . . . . . . . . .> from
. . . . . . . . . . . . . . . .>     (select
. . . . . . . . . . . . . . . .>         name,
. . . . . . . . . . . . . . . .>         concat(constellation, ",", blood_type) base
. . . . . . . . . . . . . . . .>     from
. . . . . . . . . . . . . . . .>         person_info) t1
. . . . . . . . . . . . . . . .> group by
. . . . . . . . . . . . . . . .>     t1.base;
+----------+----------+
| t1.base  |   name   |
+----------+----------+
| 射手座,A    | 大海|凤姐    |
| 白羊座,A    | 孙悟空|猪八戒  |
| 白羊座,B    | 宋宋|苍老师   |
+----------+----------+
3 rows selected (17.668 seconds)
0: jdbc:hive2://bigdata01:10000> select * from person_info;
+-------------------+----------------------------+-------------------------+
| person_info.name  | person_info.constellation  | person_info.blood_type  |
+-------------------+----------------------------+-------------------------+
| 孙悟空               | 白羊座                        | A                       |
| 大海                | 射手座                        | A                       |
| 宋宋                | 白羊座                        | B                       |
| 猪八戒               | 白羊座                        | A                       |
| 凤姐                | 射手座                        | A                       |
| 苍老师               | 白羊座                        | B                       |
+-------------------+----------------------------+-------------------------+
6 rows selected (0.072 seconds)
0: jdbc:hive2://bigdata01:10000>

列转行

EXPLODE(col)：将hive一列中复杂的array或者map结构拆分成多行。
LATERAL VIEW
2.1 用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias
2.2 解释：用于和 split, explode 等 UDTF 一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

0: jdbc:hive2://bigdata01:10000> create table movie_info(
. . . . . . . . . . . . . . . .>     movie string, 
. . . . . . . . . . . . . . . .>     category array<string>) 
. . . . . . . . . . . . . . . .> row format delimited fields terminated by "\t"
. . . . . . . . . . . . . . . .> collection items terminated by ",";
No rows affected (0.052 seconds)
0: jdbc:hive2://bigdata01:10000> load data local inpath "/opt/module/datas/movie.txt" into table movie_info;
No rows affected (0.152 seconds)
0: jdbc:hive2://bigdata01:10000> select
. . . . . . . . . . . . . . . .>     movie,
. . . . . . . . . . . . . . . .>     category_name
. . . . . . . . . . . . . . . .> from 
. . . . . . . . . . . . . . . .>     movie_info lateral view explode(category) table_tmp as category_name;
+--------------+----------------+
|    movie     | category_name  |
+--------------+----------------+
| 《疑犯追踪》       | 悬疑             |
| 《疑犯追踪》       | 动作             |
| 《疑犯追踪》       | 科幻             |
| 《疑犯追踪》       | 剧情             |
| 《Lie to me》  | 悬疑             |
| 《Lie to me》  | 警匪             |
| 《Lie to me》  | 动作             |
| 《Lie to me》  | 心理             |
| 《Lie to me》  | 剧情             |
| 《战狼2》        | 战争             |
| 《战狼2》        | 动作             |
| 《战狼2》        | 灾难             |
+--------------+----------------+
12 rows selected (0.059 seconds)
0: jdbc:hive2://bigdata01:10000> select * from movie_info;
+-------------------+-----------------------------+
| movie_info.movie  |     movie_info.category     |
+-------------------+-----------------------------+
| 《疑犯追踪》            | ["悬疑","动作","科幻","剧情"]       |
| 《Lie to me》       | ["悬疑","警匪","动作","心理","剧情"]  |
| 《战狼2》             | ["战争","动作","灾难"]            |
+-------------------+-----------------------------+
3 rows selected (0.075 seconds)
0: jdbc:hive2://bigdata01:10000>

窗口函数（开窗函数）

OVER()：指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化。

参数	说明
CURRENT ROW	当前行
n PRECEDING	往前n行数据
n FOLLOWING	往后n行数据
UNBOUNDED	起点，UNBOUNDED PRECEDING 表示从前面的起点， UNBOUNDED FOLLOWING表示到后面的终点
LAG(col,n,default_val)	往前第n行数据
LEAD(col,n, default_val)	往后第n行数据
NTILE(n)	把有序窗口的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，NTILE返回此行所属的组的编号。注意：n必须为int类型。

0: jdbc:hive2://bigdata01:10000> create table business(
. . . . . . . . . . . . . . . .> name string, 
. . . . . . . . . . . . . . . .> orderdate string,
. . . . . . . . . . . . . . . .> cost int
. . . . . . . . . . . . . . . .> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
No rows affected (0.082 seconds)
0: jdbc:hive2://bigdata01:10000> load data local inpath "/opt/module/datas/business.txt" into table business;
No rows affected (0.12 seconds)
0: jdbc:hive2://bigdata01:10000> 

# （1）查询在2017年4月份购买过的顾客及总人数
0: jdbc:hive2://bigdata01:10000> select name,count(*) over () 
. . . . . . . . . . . . . . . .> from business 
. . . . . . . . . . . . . . . .> where substring(orderdate,1,7) = '2017-04' 
. . . . . . . . . . . . . . . .> group by name;
+-------+-----------------+
| name  | count_window_0  |
+-------+-----------------+
| mart  | 2               |
| jack  | 2               |
+-------+-----------------+
2 rows selected (35.215 seconds)

# （2）查询顾客的购买明细及月购买总额
0: jdbc:hive2://bigdata01:10000> select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from
. . . . . . . . . . . . . . . .>  business;
+-------+-------------+-------+---------------+
| name  |  orderdate  | cost  | sum_window_0  |
+-------+-------------+-------+---------------+
| jack  | 2017-01-01  | 10    | 205           |
| jack  | 2017-01-08  | 55    | 205           |
| tony  | 2017-01-07  | 50    | 205           |
| jack  | 2017-01-05  | 46    | 205           |
| tony  | 2017-01-04  | 29    | 205           |
| tony  | 2017-01-02  | 15    | 205           |
| jack  | 2017-02-03  | 23    | 23            |
| mart  | 2017-04-13  | 94    | 341           |
| jack  | 2017-04-06  | 42    | 341           |
| mart  | 2017-04-11  | 75    | 341           |
| mart  | 2017-04-09  | 68    | 341           |
| mart  | 2017-04-08  | 62    | 341           |
| neil  | 2017-05-10  | 12    | 12            |
| neil  | 2017-06-12  | 80    | 80            |
+-------+-------------+-------+---------------+
14 rows selected (18.13 seconds)

# （3）上述的场景, 将每个顾客的cost按照日期进行累加
0: jdbc:hive2://bigdata01:10000> select name,orderdate,cost, 
. . . . . . . . . . . . . . . .> sum(cost) over() as sample1,--所有行相加 
. . . . . . . . . . . . . . . .> sum(cost) over(partition by name) as sample2,--按name分组，组内数据相加 
. . . . . . . . . . . . . . . .> sum(cost) over(partition by name order by orderdate) as sample3,--按name分组，组内数据累加 
. . . . . . . . . . . . . . . .> sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合 
. . . . . . . . . . . . . . . .> sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合 
. . . . . . . . . . . . . . . .> sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行 
. . . . . . . . . . . . . . . .> sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行 
. . . . . . . . . . . . . . . .> from business;
+-------+-------------+-------+----------+----------+----------+----------+----------+----------+----------+
| name  |  orderdate  | cost  | sample1  | sample2  | sample3  | sample4  | sample5  | sample6  | sample7  |
+-------+-------------+-------+----------+----------+----------+----------+----------+----------+----------+
| jack  | 2017-01-01  | 10    | 661      | 176      | 10       | 10       | 10       | 56       | 176      |
| jack  | 2017-01-05  | 46    | 661      | 176      | 56       | 56       | 56       | 111      | 166      |
| jack  | 2017-01-08  | 55    | 661      | 176      | 111      | 111      | 101      | 124      | 120      |
| jack  | 2017-02-03  | 23    | 661      | 176      | 134      | 134      | 78       | 120      | 65       |
| jack  | 2017-04-06  | 42    | 661      | 176      | 176      | 176      | 65       | 65       | 42       |
| mart  | 2017-04-08  | 62    | 661      | 299      | 62       | 62       | 62       | 130      | 299      |
| mart  | 2017-04-09  | 68    | 661      | 299      | 130      | 130      | 130      | 205      | 237      |
| mart  | 2017-04-11  | 75    | 661      | 299      | 205      | 205      | 143      | 237      | 169      |
| mart  | 2017-04-13  | 94    | 661      | 299      | 299      | 299      | 169      | 169      | 94       |
| neil  | 2017-05-10  | 12    | 661      | 92       | 12       | 12       | 12       | 92       | 92       |
| neil  | 2017-06-12  | 80    | 661      | 92       | 92       | 92       | 92       | 92       | 80       |
| tony  | 2017-01-02  | 15    | 661      | 94       | 15       | 15       | 15       | 44       | 94       |
| tony  | 2017-01-04  | 29    | 661      | 94       | 44       | 44       | 44       | 94       | 79       |
| tony  | 2017-01-07  | 50    | 661      | 94       | 94       | 94       | 79       | 79       | 50       |
+-------+-------------+-------+----------+----------+----------+----------+----------+----------+----------+
14 rows selected (48.472 seconds)

# （4）查看顾客上次的购买时间
0: jdbc:hive2://bigdata01:10000> select name,orderdate,cost, 
. . . . . . . . . . . . . . . .> lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2 
. . . . . . . . . . . . . . . .> from business;
+-------+-------------+-------+-------------+-------------+
| name  |  orderdate  | cost  |    time1    |    time2    |
+-------+-------------+-------+-------------+-------------+
| jack  | 2017-01-01  | 10    | 1900-01-01  | NULL        |
| jack  | 2017-01-05  | 46    | 2017-01-01  | NULL        |
| jack  | 2017-01-08  | 55    | 2017-01-05  | 2017-01-01  |
| jack  | 2017-02-03  | 23    | 2017-01-08  | 2017-01-05  |
| jack  | 2017-04-06  | 42    | 2017-02-03  | 2017-01-08  |
| mart  | 2017-04-08  | 62    | 1900-01-01  | NULL        |
| mart  | 2017-04-09  | 68    | 2017-04-08  | NULL        |
| mart  | 2017-04-11  | 75    | 2017-04-09  | 2017-04-08  |
| mart  | 2017-04-13  | 94    | 2017-04-11  | 2017-04-09  |
| neil  | 2017-05-10  | 12    | 1900-01-01  | NULL        |
| neil  | 2017-06-12  | 80    | 2017-05-10  | NULL        |
| tony  | 2017-01-02  | 15    | 1900-01-01  | NULL        |
| tony  | 2017-01-04  | 29    | 2017-01-02  | NULL        |
| tony  | 2017-01-07  | 50    | 2017-01-04  | 2017-01-02  |
+-------+-------------+-------+-------------+-------------+
14 rows selected (18.216 seconds)

# （5）查询前20%时间的订单信息
0: jdbc:hive2://bigdata01:10000> select * from (
. . . . . . . . . . . . . . . .>     select name,orderdate,cost, ntile(5) over(order by orderdate) sorted
. . . . . . . . . . . . . . . .>     from business
. . . . . . . . . . . . . . . .> ) t
. . . . . . . . . . . . . . . .> where sorted = 1;
+---------+--------------+---------+-----------+
| t.name  | t.orderdate  | t.cost  | t.sorted  |
+---------+--------------+---------+-----------+
| jack    | 2017-01-01   | 10      | 1         |
| tony    | 2017-01-02   | 15      | 1         |
| tony    | 2017-01-04   | 29      | 1         |
+---------+--------------+---------+-----------+
3 rows selected (17.041 seconds)
0: jdbc:hive2://bigdata01:10000> 

# (6) 查询连续两台来店的顾客
0: jdbc:hive2://bigdata01:10000> select *,row_number() over(partition by name order by orderdate) from business;
+----------------+---------------------+----------------+----------------------+
| business.name  | business.orderdate  | business.cost  | row_number_window_0  |
+----------------+---------------------+----------------+----------------------+
| jack           | 2017-01-01          | 10             | 1                    |
| jack           | 2017-01-05          | 46             | 2                    |
| jack           | 2017-01-08          | 55             | 3                    |
| jack           | 2017-02-03          | 23             | 4                    |
| jack           | 2017-04-06          | 42             | 5                    |
| mart           | 2017-04-08          | 62             | 1                    |
| mart           | 2017-04-09          | 68             | 2                    |
| mart           | 2017-04-11          | 75             | 3                    |
| mart           | 2017-04-13          | 94             | 4                    |
| neil           | 2017-05-10          | 12             | 1                    |
| neil           | 2017-06-12          | 80             | 2                    |
| tony           | 2017-01-02          | 15             | 1                    |
| tony           | 2017-01-04          | 29             | 2                    |
| tony           | 2017-01-07          | 50             | 3                    |
+----------------+---------------------+----------------+----------------------+
14 rows selected (18.233 seconds)
0: jdbc:hive2://bigdata01:10000>

0: jdbc:hive2://bigdata01:10000> select *,date_sub(orderdate, rn) temp from (select *,row_number() over(partition by name order by orderdate) rn from business) t1;
+----------+---------------+----------+--------+-------------+
| t1.name  | t1.orderdate  | t1.cost  | t1.rn  |    temp     |
+----------+---------------+----------+--------+-------------+
| jack     | 2017-01-01    | 10       | 1      | 2016-12-31  |
| jack     | 2017-01-05    | 46       | 2      | 2017-01-03  |
| jack     | 2017-01-08    | 55       | 3      | 2017-01-05  |
| jack     | 2017-02-03    | 23       | 4      | 2017-01-30  |
| jack     | 2017-04-06    | 42       | 5      | 2017-04-01  |
| mart     | 2017-04-08    | 62       | 1      | 2017-04-07  |
| mart     | 2017-04-09    | 68       | 2      | 2017-04-07  |
| mart     | 2017-04-11    | 75       | 3      | 2017-04-08  |
| mart     | 2017-04-13    | 94       | 4      | 2017-04-09  |
| neil     | 2017-05-10    | 12       | 1      | 2017-05-09  |
| neil     | 2017-06-12    | 80       | 2      | 2017-06-10  |
| tony     | 2017-01-02    | 15       | 1      | 2017-01-01  |
| tony     | 2017-01-04    | 29       | 2      | 2017-01-02  |
| tony     | 2017-01-07    | 50       | 3      | 2017-01-04  |
+----------+---------------+----------+--------+-------------+
14 rows selected (17.486 seconds)
0: jdbc:hive2://bigdata01:10000> 

0: jdbc:hive2://bigdata01:10000> select name,temp,count(*) c from (select *,date_sub(orderdate, rn) temp from (select *,row_number() over(partition by name order by orderdate) rn from business) t1) t2 group by name,temp;
+-------+-------------+----+
| name  |    temp     | c  |
+-------+-------------+----+
| jack  | 2016-12-31  | 1  |
| jack  | 2017-01-03  | 1  |
| jack  | 2017-01-05  | 1  |
| jack  | 2017-01-30  | 1  |
| jack  | 2017-04-01  | 1  |
| mart  | 2017-04-07  | 2  |
| mart  | 2017-04-08  | 1  |
| mart  | 2017-04-09  | 1  |
| neil  | 2017-05-09  | 1  |
| neil  | 2017-06-10  | 1  |
| tony  | 2017-01-01  | 1  |
| tony  | 2017-01-02  | 1  |
| tony  | 2017-01-04  | 1  |
+-------+-------------+----+
13 rows selected (35.076 seconds)
0: jdbc:hive2://bigdata01:10000> 

0: jdbc:hive2://bigdata01:10000> select name,temp,count(*) c from (select *,date_sub(orderdate, rn) temp from (select *,row_number() over(partition by name order by orderdate) rn from business) t1) t2 group by name,temp having c>=2;
+-------+-------------+----+
| name  |    temp     | c  |
+-------+-------------+----+
| mart  | 2017-04-07  | 2  |
+-------+-------------+----+
1 row selected (35.229 seconds)
0: jdbc:hive2://bigdata01:10000>

Rank

RANK() 排序相同时会重复，总数不会变
DENSE_RANK() 排序相同时会重复，总数会减少
ROW_NUMBER() 会根据顺序计算

0: jdbc:hive2://bigdata01:10000> create table score(
. . . . . . . . . . . . . . . .> name string,
. . . . . . . . . . . . . . . .> subject string, 
. . . . . . . . . . . . . . . .> score int) 
. . . . . . . . . . . . . . . .> row format delimited fields terminated by "\t";
No rows affected (0.042 seconds)
0: jdbc:hive2://bigdata01:10000> load data local inpath '/opt/module/datas/score.txt' into table score;
No rows affected (0.122 seconds)
0: jdbc:hive2://bigdata01:10000> select name,
. . . . . . . . . . . . . . . .> subject,
. . . . . . . . . . . . . . . .> score,
. . . . . . . . . . . . . . . .> rank() over(partition by subject order by score desc) rp,
. . . . . . . . . . . . . . . .> dense_rank() over(partition by subject order by score desc) drp,
. . . . . . . . . . . . . . . .> row_number() over(partition by subject order by score desc) rmp
. . . . . . . . . . . . . . . .> from score;
+-------+----------+--------+-----+------+------+
| name  | subject  | score  | rp  | drp  | rmp  |
+-------+----------+--------+-----+------+------+
| 孙悟空   | 数学       | 95     | 1   | 1    | 1    |
| 宋宋    | 数学       | 86     | 2   | 2    | 2    |
| 婷婷    | 数学       | 85     | 3   | 3    | 3    |
| 大海    | 数学       | 56     | 4   | 4    | 4    |
| 宋宋    | 英语       | 84     | 1   | 1    | 1    |
| 大海    | 英语       | 84     | 1   | 1    | 2    |
| 婷婷    | 英语       | 78     | 3   | 2    | 3    |
| 孙悟空   | 英语       | 68     | 4   | 3    | 4    |
| 大海    | 语文       | 94     | 1   | 1    | 1    |
| 孙悟空   | 语文       | 87     | 2   | 2    | 2    |
| 婷婷    | 语文       | 65     | 3   | 3    | 3    |
| 宋宋    | 语文       | 64     | 4   | 4    | 4    |
+-------+----------+--------+-----+------+------+
12 rows selected (17.044 seconds)
0: jdbc:hive2://bigdata01:10000>

5.2 系统内置函数

# 查看系统自带的函数
show functions;

# 显示自带的函数的用法
0: jdbc:hive2://bigdata01:10000> desc function upper;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
+----------------------------------------------------+
1 row selected (0.015 seconds)

# 详细显示自带的函数的用法
0: jdbc:hive2://bigdata01:10000> desc function extended upper;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| upper(str) - Returns str with all characters changed to uppercase |
| Synonyms: ucase                                    |
| Example:                                           |
|   > SELECT upper('Facebook') FROM src LIMIT 1;     |
|   'FACEBOOK'                                       |
| Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper |
| Function type:BUILTIN                              |
+----------------------------------------------------+
7 rows selected (0.013 seconds)
0: jdbc:hive2://bigdata01:10000>

5.3 日期函数

0: jdbc:hive2://bigdata01:10000> show functions like "*date*";
+---------------+
|   tab_name    |
+---------------+
| current_date  |
| date_add      |
| date_format   |
| date_sub      |
| datediff      |
| to_date       |
+---------------+
6 rows selected (0.013 seconds)
0: jdbc:hive2://bigdata01:10000> select current_date();
+-------------+
|     _c0     |
+-------------+
| 2021-03-22  |
+-------------+
1 row selected (0.065 seconds)
0: jdbc:hive2://bigdata01:10000> select date_add(current_date(),7);
+-------------+
|     _c0     |
+-------------+
| 2021-03-29  |
+-------------+
1 row selected (0.059 seconds)
0: jdbc:hive2://bigdata01:10000> select date_format("2021-03-22 20:28:53", "yyyy-MM-dd HH:mm:ss");
+----------------------+
|         _c0          |
+----------------------+
| 2021-03-22 20:28:53  |
+----------------------+
1 row selected (0.062 seconds)
0: jdbc:hive2://bigdata01:10000> select date_sub(current_date(),7);
+-------------+
|     _c0     |
+-------------+
| 2021-03-15  |
+-------------+
1 row selected (0.073 seconds)
0: jdbc:hive2://bigdata01:10000> select datediff(current_date, "1997-06-14");
+-------+
|  _c0  |
+-------+
| 8682  |
+-------+
1 row selected (0.074 seconds)
0: jdbc:hive2://bigdata01:10000> SELECT to_date('2021-03-22');
+-------------+
|     _c0     |
+-------------+
| 2021-03-22  |
+-------------+
1 row selected (0.061 seconds)
0: jdbc:hive2://bigdata01:10000>

5.4 自定义函数

用户自定义函数类别

UDF（User-Defined-Function）：一进一出
UDAF（User-Defined Aggregation Function）：聚集函数，多进一出
UDTF（User-Defined Table-Generating Functions）：一进多出

自定义函数操作步骤

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
</dependency>

public class MyUDF extends GenericUDF {

    @Override
    public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException {
        if (objectInspectors.length != 1)
            throw new UDFArgumentLengthException("Wrong arguments length, accept one");
        if(!objectInspectors[0].getCategory().equals(ObjectInspector.Category.PRIMITIVE))
            throw new UDFArgumentTypeException(0, "Wrong argument type, accept primitive");
        return PrimitiveObjectInspectorFactory.javaIntObjectInspector;
    }

    @Override
    public Object evaluate(DeferredObject[] deferredObjects) throws HiveException {
        Object o = deferredObjects[0].get();

        if(o == null)
            return 0;

        return o.toString().length();
    }

    @Override
    public String getDisplayString(String[] strings) {
        return "";
    }
}

[omm@bigdata01 lib]$ ll | grep hive-1.0
-rw-r--r-- 1 omm omm     3141 Mar 23 20:29 hive-1.0-SNAPSHOT.jar
[omm@bigdata01 lib]$ hiveservices.sh restart
[omm@bigdata01 lib]$

[omm@bigdata01 ~]$ beeline -u jdbc:hive2://bigdata01:10000 -n omm

0: jdbc:hive2://bigdata01:10000> create temporary function my_len as "com.simwor.bigdata.hive.MyUDF";
No rows affected (2.3 seconds)

0: jdbc:hive2://bigdata01:10000> select ename,my_len(ename) from emp;
+-----------+------+
|   ename   | _c1  |
+-----------+------+
| zhangsan  | 8    |
| SMITH     | 5    |
| ALLEN     | 5    |
| WARD      | 4    |
| JONES     | 5    |
| MARTIN    | 6    |
| BLAKE     | 5    |
| CLARK     | 5    |
| SCOTT     | 5    |
| KING      | 4    |
| TURNER    | 6    |
| ADAMS     | 5    |
| JAMES     | 5    |
| FORD      | 4    |
| MILLER    | 6    |
+-----------+------+
15 rows selected (5.95 seconds)

0: jdbc:hive2://bigdata01:10000> select ename,my_len(ename,ename) from emp;
Error: Error while compiling statement: FAILED: SemanticException [Error 10015]: Line 1:13 Arguments length mismatch 'ename': Wrong arguments length, accept one (state=21000,code=10015)

0: jdbc:hive2://bigdata01:10000> select my_len(split("aa,bb", ","));
Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:14 Argument type mismatch '","': Wrong argument type, accept primitive (state=42000,code=10016)
0: jdbc:hive2://bigdata01:10000>

七、压缩和存储

7.1 压缩

Hadoop 支持的压缩格式

[omm@bigdata01 ~]$ hadoop checknative
Native library checking:
hadoop:  true /opt/module/hadoop-3.1.4/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
zstd  :  false 
snappy:  true /lib64/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib64/libbz2.so.1
openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared object file: No such file or directory)!
ISA-L:   false libhadoop was built without ISA-L support
PMDK:    false The native code was built without PMDK support.
[omm@bigdata01 ~]$

Hadoop 编码/解码器

压缩格式	对应的编码/解码器
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

压缩性能的比较

压缩算法	原始文件大小	压缩文件大小	压缩速度	解压速度
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB/s
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s

压缩参数配置

参数	默认值	阶段	建议
io.compression.codecs（在core-site.xml中配置）	org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.Lz4Codec	输入压缩	Hadoop使用文件扩展名判断是否支持某种编解码器
mapreduce.map.output.compress	false	mapper输出	这个参数设为true启用压缩
mapreduce.map.output.compress.codec	org.apache.hadoop.io.compress.DefaultCodec	mapper输出	使用LZO、LZ4或snappy编解码器在此阶段压缩数据
mapreduce.output.fileoutputformat.compress	false	reducer输出	这个参数设为true启用压缩
mapreduce.output.fileoutputformat.compress.codec	org.apache.hadoop.io.compress. DefaultCodec	reducer输出	使用标准工具或者编解码器，如gzip和bzip2
mapreduce.output.fileoutputformat.compress.type	RECORD	reducer输出	SequenceFile输出使用的压缩类型：NONE和BLOCK

7.2 文件存储格式

Hive支持的存储数据的格式主要有：TEXTFILE 、SEQUENCEFILE、ORC、PARQUET。

列式存储和行式存储

在这里插入图片描述

行存储的特点：查询满足条件的一整行数据的时候，列存储则需要去每个聚集的字段找到对应的每个列的值，行存储只需要找到其中一个值，其余的值都在相邻地方，所以此时行存储查询的速度更快。
列存储的特点：因为每个字段的数据聚集存储，在查询只需要少数几个字段的时候，能大大减少读取的数据量；每个字段的数据类型一定是相同的，列式存储可以针对性的设计更好的设计压缩算法。
TEXTFILE 和 SEQUENCEFILE 的存储格式都是基于行存储的；ORC和PARQUET是基于列式存储的。

TextFile格式

默认格式，数据不做压缩，磁盘开销大，数据解析开销大。
可结合Gzip、Bzip2使用，但使用Gzip这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作。

Orc格式

每个Orc文件由1个或多个stripe组成，每个stripe一般为HDFS的块大小，每一个stripe包含多条记录，这些记录按照列进行独立存储，对应到Parquet中的row group的概念。
每个Stripe里有三部分组成，分别是Index Data，Row Data，Stripe Footer：

在这里插入图片描述

Index Data：一个轻量级的index，默认是每隔1W行做一个索引。这里做的索引应该只是记录某行的各字段在Row Data中的offset。
Row Data：存的是具体的数据，先取部分行，然后对这些行按列进行存储。对每个列进行了编码，分成多个Stream来存储。
Stripe Footer：存的是各个Stream的类型，长度等信息。每个文件有一个File Footer，这里面存的是每个Stripe的行数，每个Column的数据类型信息等；每个文件的尾部是一个PostScript，这里面记录了整个文件的压缩类型以及FileFooter的长度信息等。在读取文件时，会seek到文件尾部读PostScript，从里面解析到File Footer长度，再读FileFooter，从里面解析到各个Stripe信息，再读各个Stripe，即从后往前读。

Parquet格式

Parquet文件是以二进制方式存储的，所以是不可以直接读取的，文件中包括该文件的数据和元数据，因此Parquet格式文件是自解析的。

行组(Row Group)：每一个行组包含一定的行数，在一个HDFS文件中至少存储一个行组，类似于orc的stripe的概念。
列块(Column Chunk)：在一个行组中每一列保存在一个列块中，行组中的所有列连续的存储在这个行组文件中。一个列块中的值都是相同类型的，不同的列块可能使用不同的算法进行压缩。
页(Page)：每一个列块划分为多个页，一个页是最小的编码的单位，在同一个列块的不同页可能使用不同的编码方式。

通常情况下，在存储Parquet数据的时候会按照Block大小设置行组的大小，由于一般情况下每一个Mapper任务处理数据的最小单位是一个Block，这样可以把每一个行组由一个Mapper任务处理，增大任务执行并行度。

在这里插入图片描述

上图展示了一个Parquet文件的内容，一个文件中可以存储多个行组，文件的首位都是该文件的Magic Code，用于校验它是否是一个Parquet文件，Footer length记录了文件元数据的大小，通过该值和文件长度可以计算出元数据的偏移量，文件的元数据中包括每一个行组的元数据信息和该文件存储数据的Schema信息。除了文件中每一个行组的元数据，每一页的开始都会存储该页的元数据，在Parquet中，有三种类型的页：数据页、字典页和索引页。数据页用于存储当前行组中该列的值，字典页存储该列值的编码字典，每一个列块中最多包含一个字典页，索引页用来存储当前行组下该列的索引，目前Parquet中还不支持索引页。

7.3 主流文件存储格式对比实验

建表

--建立文本表格
create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile;

--插入数据
load data local inpath '/opt/module/datas/log.data' into table log_text;

hive (default)> dfs -du -h /user/hive/warehouse/log_text;
18.1 M  /user/hive/warehouse/log_text/log.data

--建立非压缩的orc格式
create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc
tblproperties("orc.compress"="NONE");

--插入数据
insert into log_orc select * from log_text;

hive (default)> dfs -du -h /user/hive/warehouse/log_orc/ ;
2.8 M  /user/hive/warehouse/log_orc/000000_0

--建立parquet格式
create table log_par(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;

--插入数据
insert into log_par select * from log_text;

hive (default)> dfs -du -h /user/hive/warehouse/log_parquet/ ;
13.1 M  /user/hive/warehouse/log_parquet/000000_0

对比

存储文件的压缩比：ORC > Parquet > textFile，查询速度相近

# 1. TextFile
hive (default)> select count(*) from log_text;
_c0
100000
Time taken: 21.54 seconds, Fetched: 1 row(s)
Time taken: 21.08 seconds, Fetched: 1 row(s)
Time taken: 19.298 seconds, Fetched: 1 row(s)

# 2．ORC
hive (default)> select count(*) from log_orc;
_c0
100000
Time taken: 20.867 seconds, Fetched: 1 row(s)
Time taken: 22.667 seconds, Fetched: 1 row(s)
Time taken: 18.36 seconds, Fetched: 1 row(s)

# 3．Parquet
hive (default)> select count(*) from log_parquet;
_c0
100000
Time taken: 22.922 seconds, Fetched: 1 row(s)
Time taken: 21.074 seconds, Fetched: 1 row(s)
Time taken: 18.384 seconds, Fetched: 1 row(s)

7.4 存储和压缩结合

ORC存储方式的压缩

Key	Default	Notes
orc.compress	ZLIB	high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size	262,144	number of bytes in each compression chunk
orc.stripe.size	268,435,456	number of bytes in each stripe
orc.row.index.stride	10,000	number of rows between index entries (must be >= 1000)
orc.create.index	true	whether to create row indexes
orc.bloom.filter.columns	“”	comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp	0.05	false positive probability for bloom filter (must >0.0 and <1.0)

测试存储和压缩

在实际的项目开发当中，hive表的数据存储格式一般选择：orc或parquet。压缩方式一般选择snappy，lzo。

--zlib压缩的orc格式
create table log_orc_zlib(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc
tblproperties("orc.compress"="ZLIB");

--插入数据
insert into log_orc_zlib select * from log_text;

--snappy压缩的orc格式
create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc
tblproperties("orc.compress"="SNAPPY");

--插入数据
insert into log_orc_snappy select * from log_text;

--snappy压缩的parquet格式
create table log_par_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet
tblproperties("parquet.compression"="SNAPPY");

--插入数据
insert into log_par_snappy select * from log_text;

八、企业级调优

8.1 Fetch 抓取

Fetch抓取是指，Hive中对某些情况的查询可以不必使用MapReduce计算。例如：SELECT * FROM employees;在这种情况下，Hive可以简单地读取employee对应的存储目录下的文件，然后输出查询结果到控制台。
在hive-default.xml.template文件中hive.fetch.task.conversion默认是more，老版本hive默认是minimal，该属性修改为more以后，在全局查找、字段查找、limit查找等都不走mapreduce。

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
  </property>

8.2 本地模式

大多数的Hadoop Job是需要Hadoop提供的完整的可扩展性来处理大数据集的。
不过，有时Hive的输入数据量是非常小的。
在这种情况下，为查询触发执行任务消耗的时间可能会比实际job的执行时间要多的多。
对于大多数这种情况，Hive可以通过本地模式在单台机器上处理所有的任务。对于小数据集，执行时间可以明显被缩短。

set hive.exec.mode.local.auto=true;  //开启本地mr
//设置local mr的最大输入数据量，当输入数据量小于这个值时采用local  mr的方式，默认为134217728，即128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//设置local mr的最大输入文件个数，当输入文件个数小于这个值时采用local mr的方式，默认为4
set hive.exec.mode.local.auto.input.files.max=10;

8.3 表的优化

8.3.1 大表Join大表

空KEY过滤

有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。
此时我们应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，我们需要在SQL语句中进行过滤。

# 创建原始数据表、空id表、合并后数据表
// 创建原始表
create table ori(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 创建空id表
create table nullidtable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 创建join后表的语句
create table jointable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

# 分别加载原始数据和空id数据到对应表中
hive (default)> load data local inpath '/opt/module/datas/ori' into table ori;
hive (default)> load data local inpath '/opt/module/datas/nullid' into table nullidtable;

# 测试不过滤空id
hive (default)> insert overwrite table jointable select n.* from nullidtable n left join ori o on n.id = o.id;
Time taken: 42.038 seconds
Time taken: 37.284 seconds

# 测试过滤空id
hive (default)> insert overwrite table jointable select n.* from (select * from nullidtable where id is not null ) n  left join ori o on n.id = o.id;
Time taken: 31.725 seconds
Time taken: 28.876 seconds

空key转换

有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中；
此时我们可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上。

insert overwrite table jointable
select n.* from nullidtable n full join ori o on 
nvl(n.id,rand()) = o.id;

8.3.2 MapJoin（小表join大表）

如果不指定MapJoin或者不符合MapJoin的条件，那么Hive解析器会将Join操作转换成Common Join，即：在Reduce阶段完成join。容易发生数据倾斜。
可以用MapJoin把小表全部加载到内存在map端进行join，避免reducer处理。

在这里插入图片描述

# （1）设置自动选择Mapjoin
set hive.auto.convert.join = true; 默认为true
# （2）大表小表的阈值设置（默认25M一下认为是小表）：
set hive.mapjoin.smalltable.filesize=25000000;
# （3）执行小表JOIN大表语句
insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
left join bigtable  b
on s.id = b.id;
Time taken: 24.594 seconds
# （4）执行大表JOIN小表语句
insert overwrite table jointable
select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable  b
left join smalltable  s
on s.id = b.id;
Time taken: 24.315 seconds

8.3.3 Group By

默认情况下，Map阶段同一Key数据分发给一个reduce，当一个key数据过大时就倾斜了。

在这里插入图片描述

并不是所有的聚合操作都需要在Reduce端完成，很多聚合操作都可以先在Map端进行部分聚合，最后在Reduce端得出最终结果。

# 开启Map端聚合参数设置
# （1）是否在Map端进行聚合，默认为True
set hive.map.aggr = true
#（2）在Map端进行聚合操作的条目数目
set hive.groupby.mapaggr.checkinterval = 100000
#（3）有数据倾斜的时候进行负载均衡（默认是false）
set hive.groupby.skewindata = true

当选项设定为 true，生成的查询计划会有两个MR Job。第一个MR Job中，Map的输出结果会随机分布到Reduce中，每个Reduce做部分聚合操作，并输出结果，这样处理的结果是相同的Group By Key有可能被分发到不同的Reduce中，从而达到负载均衡的目的；
第二个MR Job再根据预处理的数据结果按照Group By Key分布到Reduce中（这个过程可以保证相同的Group By Key被分布到同一个Reduce中），最后完成最终的聚合操作。

8.3.4 Count(Distinct) 去重统计

数据量小的时候无所谓，数据量大的情况下，由于COUNT DISTINCT的全聚合操作，即使设定了reduce task个数，set mapred.reduce.tasks=100；hive也只会启动一个reducer。
这就造成一个Reduce处理的数据量太大，导致整个Job很难完成，一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换：

# 1．创建一张大表
hive (default)> create table bigtable(id bigint, t bigint, uid string, keyword
string, url_rank int, click_num int, click_url string) row format delimited
fields terminated by '\t';

# 2．加载数据
hive (default)> load data local inpath '/opt/module/datas/bigtable' into table
 bigtable;

# 3．设置5个reduce个数
set mapreduce.job.reduces = 5;

# 4．执行去重id查询
hive (default)> select count(distinct id) from bigtable;
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.12 sec   HDFS Read: 120741990 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 120 msec
OK
c0
100001
Time taken: 23.607 seconds, Fetched: 1 row(s)

# 5．采用GROUP by去重id
hive (default)> select count(id) from (select id from bigtable group by id) a;
Stage-Stage-1: Map: 1  Reduce: 5   Cumulative CPU: 17.53 sec   HDFS Read: 120752703 HDFS Write: 580 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.29 sec   HDFS Read: 9409 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 820 msec
OK
_c0
100001
Time taken: 50.795 seconds, Fetched: 1 row(s)

虽然会多用一个Job来完成，但在数据量大的情况下，这个绝对是值得的。

8.3.5 笛卡尔积

尽量避免笛卡尔积，join的时候不加on条件，或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积。

8.5.4 行列过滤

列处理：在SELECT中，只拿需要的列，如果有，尽量使用分区过滤，少用SELECT *。
行处理：在分区剪裁中，当使用外关联时，如果将副表的过滤条件写在Where后面，那么就会先全表关联，之后再过滤，比如：

# 1．测试先关联两张表，再用where条件过滤
hive (default)> select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;
Time taken: 34.406 seconds, Fetched: 100 row(s)

# 2．通过子查询后，再关联表
hive (default)> select b.id from bigtable b
join (select id from ori where id <= 10 ) o on b.id = o.id;
Time taken: 30.058 seconds, Fetched: 100 row(s)

8.5.5 动态分区调整

关系型数据库中，对分区表Insert数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过，使用Hive的动态分区，需要进行相应的配置。

开启动态分区参数设置

开启动态分区功能（默认true，开启）：hive.exec.dynamic.partition=true
设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）：hive.exec.dynamic.partition.mode=nonstrict
在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000
，hive.exec.max.dynamic.partitions=1000
在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。
hive.exec.max.dynamic.partitions.pernode=100
整个MR Job中，最大可以创建多少个HDFS文件。默认100000
hive.exec.max.created.files=100000
当有空分区生成时，是否抛出异常。一般不需要设置。默认false hive.error.on.empty.partition=false

案例实操

需求：将dept表中的数据按照地区（loc字段），插入到目标表dept_partition的相应分区中。

# （1）创建目标分区表
hive (default)> create table dept_partition(id int, name string) partitioned
by (location int) row format delimited fields terminated by '\t';
# （2）设置动态分区
set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition partition(location) select deptno, dname, loc from dept;
# （3）查看目标分区表的分区情况
hive (default)> show partitions dept_partition;

8.4 合理设置Map及Reduce数

通常情况下，作业会通过input的目录产生一个或者多个map任务。主要的决定因素有：input的文件总个数，input的文件大小，集群设置的文件块大小。
是不是map数越多越好？答案是否定的。如果一个任务有很多小文件（远远小于块大小128m），则每个小文件也会被当做一个块，用一个map任务来完成，而一个map任务启动和初始化的时间远远大于逻辑处理的时间，就会造成很大的资源浪费。而且，同时可执行的map数是受限的。
是不是保证每个map处理接近128m的文件块，就高枕无忧了？答案也是不一定。比如有一个127m的文件，正常会用一个map去完成，但这个文件只有一个或者两个小字段，却有几千万的记录，如果map处理的逻辑比较复杂，用一个map任务去做，肯定也比较耗时。

8.4.1 复杂文件增加Map数

当input的文件都很大，任务逻辑复杂，map执行非常慢的时候，可以考虑增加Map数，来使得每个map处理的数据量减少，从而提高任务的执行效率。
增加map的方法为：根据computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M公式，调整maxSize最大值。让maxSize最大值低于blocksize就可以增加map的个数。

# 1．执行查询
hive (default)> select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
# 2．设置最大切片值为100个字节
hive (default)> set mapreduce.input.fileinputformat.split.maxsize=100;
hive (default)> select count(*) from emp;
Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1

8.4.2 小文件进行合并

#（1）在map执行前合并小文件，减少map数：CombineHiveInputFormat具有对小文件进行合并的功能（系统默认的格式）。HiveInputFormat没有对小文件合并功能。
set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
#（2）在Map-Reduce的任务结束时合并小文件的设置：
# 在map-only任务结束时合并小文件，默认true
SET hive.merge.mapfiles = true;
# 在map-reduce任务结束时合并小文件，默认false
SET hive.merge.mapredfiles = true;
# 合并文件的大小，默认256M
SET hive.merge.size.per.task = 268435456;
# 当输出文件的平均大小小于该值时，启动一个独立的map-reduce任务进行文件merge
SET hive.merge.smallfiles.avgsize = 16777216;

8.4.3 合理设置Reduce数

# 1．调整reduce个数方法一
#（1）每个Reduce处理的数据量默认是256MB
hive.exec.reducers.bytes.per.reducer=256000000
#（2）每个任务最大的reduce数，默认为1009
hive.exec.reducers.max=1009
#（3）计算reducer数的公式
N=min(参数2，总输入数据量/参数1)
# 2．调整reduce个数方法二
# 在hadoop的mapred-default.xml文件中修改
# 设置每个job的Reduce个数
set mapreduce.job.reduces = 15;

educe个数并不是越多越好

过多的启动和初始化reduce也会消耗时间和资源；
另外，有多少个reduce，就会有多少个输出文件，如果生成了很多个小文件，那么如果这些小文件作为下一个任务的输入，则也会出现小文件过多的问题；
在设置reduce个数的时候也需要考虑这两个原则：处理大数据量利用合适的reduce数；使单个reduce任务处理数据量大小要合适；

8.5 并行执行

Hive会将一个查询转化成一个或者多个阶段。这样的阶段可以是MapReduce阶段、抽样阶段、合并阶段、limit阶段。或者Hive执行过程中可能需要的其他阶段。默认情况下，Hive一次只会执行一个阶段。不过，某个特定的job可能包含众多的阶段，而这些阶段可能并非完全互相依赖的，也就是说有些阶段是可以并行执行的，这样可能使得整个job的执行时间缩短。不过，如果有更多的阶段可以并行执行，那么job可能就越快完成。
通过设置参数hive.exec.parallel值为true，就可以开启并发执行。不过，在共享集群中，需要注意下，如果job中并行阶段增多，那么集群利用率就会增加。

set hive.exec.parallel=true;              //打开任务并行执行
set hive.exec.parallel.thread.number=16;  //同一个sql允许最大并行度，默认为8。

当然，得是在系统资源比较空闲的时候才有优势，否则，没资源，并行也起不来。

8.6 严格模式

Hive提供了一个严格模式，可以防止用户执行那些可能意想不到的不好的影响的查询。通过设置属性hive.mapred.mode值为默认是非严格模式nonstrict 。开启严格模式需要修改hive.mapred.mode值为strict，开启严格模式可以禁止3种类型的查询。

<property>
    <name>hive.mapred.mode</name>
    <value>strict</value>
    <description>
      The mode in which the Hive operations are being performed. 
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
	</description>
</property>

对于分区表，除非where语句中含有分区字段过滤条件来限制范围，否则不允许执行。换句话说，就是用户不允许扫描所有分区。进行这个限制的原因是，通常分区表都拥有非常大的数据集，而且数据增加迅速。没有进行分区限制的查询可能会消耗令人不可接受的巨大资源来处理这个表。
对于使用了order by语句的查询，要求必须使用limit语句。因为order by为了执行排序过程会将所有的结果数据分发到同一个Reducer中进行处理，强制要求用户增加这个LIMIT语句可以防止Reducer额外执行很长一段时间。
限制笛卡尔积的查询。对关系型数据库非常了解的用户可能期望在执行JOIN查询的时候不使用ON语句而是使用where语句，这样关系数据库的执行优化器就可以高效地将WHERE语句转化成那个ON语句。不幸的是，Hive并不会执行这种优化，因此，如果表足够大，那么这个查询就会出现不可控的情况。

8.7 JVM 重用

JVM重用是Hadoop调优参数的内容，其对Hive的性能具有非常大的影响，特别是对于很难避免小文件的场景或task特别多的场景，这类场景大多数执行时间都很短。
Hadoop的默认配置通常是使用派生JVM来执行map和Reduce任务的。这时JVM的启动过程可能会造成相当大的开销，尤其是执行的job包含有成百上千task任务的情况。JVM重用可以使得JVM实例在同一个job中重新使用N次。N的值可以在Hadoop的mapred-site.xml文件中进行配置。通常在10-20之间，具体多少需要根据具体业务场景测试得出。

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit. 
  </description>
</property>

这个功能的缺点是，开启JVM重用将一直占用使用到的task插槽，以便进行重用，直到任务完成后才能释放。如果某个“不平衡的”job中有某几个reduce task执行的时间要比其他Reduce task消耗的时间多的多的话，那么保留的插槽就会一直空闲着却无法被其他的job使用，直到所有的task都结束了才会释放。

8.8 推测执行

在分布式集群环境下，因为程序Bug（包括Hadoop本身的bug），负载不均衡或者资源分布不均等原因，会造成同一个作业的多个任务之间运行速度不一致，有些任务的运行速度可能明显慢于其他任务（比如一个作业的某个任务进度只有50%，而其他所有任务已经运行完毕），则这些任务会拖慢作业的整体执行进度。为了避免这种情况发生，Hadoop采用了推测执行（Speculative Execution）机制，它根据一定的法则推测出“拖后腿”的任务，并为这样的任务启动一个备份任务，让该任务与原始任务同时处理同一份数据，并最终选用最先成功运行完成任务的计算结果作为最终结果。
设置开启推测执行参数：Hadoop的mapred-site.xml文件中进行配置，默认是true

<property>
  <name>mapreduce.map.speculative</name>
  <value>true</value>
  <description>If true, then multiple instances of some map tasks 
               may be executed in parallel.</description>
</property>

<property>
  <name>mapreduce.reduce.speculative</name>
  <value>true</value>
  <description>If true, then multiple instances of some reduce tasks 
               may be executed in parallel.</description>
</property>

不过hive本身也提供了配置项来控制reduce-side的推测执行：默认是true

 <property>
    <name>hive.mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    <description>Whether speculative execution for reducers should be turned on. </description>
  </property>

关于调优这些推测执行变量，还很难给一个具体的建议。如果用户对于运行时的偏差非常敏感的话，那么可以将这些功能关闭掉。如果用户因为输入数据量很大而需要执行长时间的map或者Reduce task的话，那么启动推测执行造成的浪费是非常巨大大。

8.8 执行计划

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query;

#（1）查看下面这条语句的执行计划
hive (default)> explain select * from emp;
hive (default)> explain select deptno, avg(sal) avg_sal from emp group by deptno;
#（2）查看详细执行计划
hive (default)> explain extended select * from emp;
hive (default)> explain extended select deptno, avg(sal) avg_sal from emp group by deptno;

九、Hive实战之谷粒影音

9.1 准备工作

视频表

字段	备注	详细描述
video id	视频唯一id（String）	11位字符串
uploader	视频上传者（String）	上传视频的用户名String
age	视频年龄（int）	视频在平台上的整数天
category	视频类别（Array）	上传视频指定的视频分类
length	视频长度（Int）	整形数字标识的视频长度
views	观看次数（Int）	视频被浏览的次数
rate	视频评分（Double）	满分5分
Ratings	流量（Int）	视频的流量，整型数字
conments	评论数（Int）	一个视频的整数评论数

用户表

字段	备注	字段类型
uploader	上传者用户名	string
videos	上传视频数	int
friends	朋友数量	int

原始数据准备

[omm@bigdata01 datas]$ ls user
user.txt
[omm@bigdata01 datas]$ ls video/
0.txt  1.txt  2.txt  3.txt  4.txt
[omm@bigdata01 datas]$ head -3 user/user.txt 
barelypolitical	151	5106
bonk65	89	144
camelcars	26	674
[omm@bigdata01 datas]$ head -1 video/
0.txt  1.txt  2.txt  3.txt  4.txt  
[omm@bigdata01 datas]$ head -1 video/0.txt 
LKh7zAJ4nwo	TheReceptionist	653	Entertainment	424	13021	4.34	1305	744	DjdA-5oKYFQ	NxTDlnOuybo   c-8VuICzXtU	DH56yrIO5nI	W1Uo5DQTtzc	E-3zXq_r4w0	1TCeoRPg5dE	yAr26YhuYNY	2ZgXx72XmoE	-7ClGo-YgZ0   vmdPOOd6cxI	KRHfMQqSHpk	pIMpORZthYw	1tUDzOp10pk	heqocRij5P0	_XIuvoH6rUg	LGVU5DsezE0	uO2kj6_D8B4   xiDqywcDQRM	uX81lMev6_o
[omm@bigdata01 datas]$ hdfs dfs -mkdir -p /gulivideo/user
[omm@bigdata01 datas]$ hdfs dfs -mkdir -p /gulivideo/video
[omm@bigdata01 datas]$ hdfs dfs -put user/* /gulivideo/user/
[omm@bigdata01 datas]$ hdfs dfs -put video/* /gulivideo/video/
[omm@bigdata01 datas]$

原始数据 ETL

通过观察原始数据形式，可以发现，视频可以有多个所属分类，每个所属分类用&符号分割，且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用“\t”进行分割。
为了分析数据时方便对存在多个子元素的数据进行操作，我们首先进行数据重组清洗操作。
即：将所有的类别用“&”分割，同时去掉两边空格，多个相关视频id也使用“&”进行分割。

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client-api</artifactId>
    <version>3.1.3</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client-runtime</artifactId>
    <version>3.1.3</version>
</dependency>

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class VideoMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
    private StringBuilder sb = new StringBuilder();
    private Text result = new Text();

    @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split("\t");

        if(fields.length < 9)
            return;

        fields[3] = fields[3].replace(" ", "");
        sb.setLength(0);
        for(int i=0; i<fields.length; i++) {
            if(i == fields.length-1)
                sb.append(fields[i]);
            else if (i <= 8)
                sb.append(fields[i]).append("\t");
            else
                sb.append(fields[i]).append("&");
        }
        result.set(sb.toString());
        context.write(result, NullWritable.get());
    }
}

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class VideoDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance();

        job.setJarByClass(VideoDriver.class);

        job.setMapperClass(VideoMapper.class);
        job.setNumReduceTasks(0);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

[omm@bigdata01 lib]$ yarn jar etl-1.0-SNAPSHOT.jar com.simwor.etl.VideoDriver /gulivideo/video /gulivideo/video_etl

ETL 数据转表

--video表
create external table video_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
location '/gulivideo/video_etl';

--user表
create external table user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited fields terminated by "\t" 
location '/gulivideo/user';

--video_orc表
create table video_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");

--user_orc表
create table user_orc(
    uploader string,
    videos int,
    friends int)
stored as orc
tblproperties("orc.compress"="SNAPPY");

--从外部表中插入数据
insert into table video_orc select * from video_ori;
insert into table user_orc select * from user_ori;

9.2 业务分析

9.2.1 统计视频观看数top10

SELECT
    videoid,
    views
FROM
    video_orc
ORDER BY
    views DESC
LIMIT 10;

9.2.2 统计视频类别热度Top10

定义视频类别热度（假设按照类别下视频的个数来决定）
炸开类别

  SELECT
      videoid,
      cate
  FROM
      video_orc LATERAL VIEW explode(category) tbl as cate;

在上表基础上，统计各个类别有多少视频，并排序取前十

SELECT
    cate,
    COUNT(videoid) n
FROM
    t1
GROUP BY
    cate
ORDER BY
    n desc limit 10;

9.2.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

统计前20视频和其类别

SELECT
    videoid,
    views,
    category
FROM
    video_orc
ORDER BY
    views DESC
LIMIT 20;

打散类别

SELECT
    videoid,
    cate
FROM
    t1 LATERAL VIEW explode(category) tbl as cate;

按照类别统计个数

SELECT
    cate,
    COUNT(videoid) n
FROM
    t2
GROUP BY
    cate
ORDER BY
    n DESC;

9.2.4 统计视频观看数Top50所关联视频的所属类别排序

统计观看数前50的视频的关联视频

SELECT
    videoid,
    views,
    relatedid
FROM
    video_orc
ORDER BY
    views DESC
LIMIT 50;

炸开关联视频

SELECT
    explode(relatedid) videoid
FROM
    t1;

和原表Join获取关联视频的类别

SELECT
    DISTINCT t2.videoid,
    v.category
FROM
    t2
JOIN video_orc v on
    t2.videoid = v.videoid;

炸开类别

SELECT
    explode(category) cate
FROM
    t3;

和类别热度表（t5）Join，排序

SELECT
    DISTINCT t4.cate,
    t5.n
FROM
    t4
JOIN t5 ON
    t4.cate = t5.cate
ORDER BY
    t5.n DESC;

9.2.5 统计每个类别中的视频热度Top10，以Music为例

把视频表的类别炸开，生成中间表格video_category

CREATE
    TABLE
        video_category STORED AS orc TBLPROPERTIES("orc.compress"="SNAPPY") AS SELECT
            videoid,
            uploader,
            age,
            cate,
            length,
            views,
            rate,
            ratings,
            comments,
            relatedid
        FROM
            video_orc LATERAL VIEW explode(category) tbl as cate;

从video_category直接查询Music类的前10视频

SELECT
    videoid,
    views
FROM
    video_category
WHERE
    cate ="Music"
ORDER BY
    views DESC
LIMIT 10;

9.2.6 统计每个类别中视频流量Top10，以Music为例

从video_category直接查询Music类的流量前10视频

SELECT
    videoid,
    ratings
FROM
    video_category
WHERE
    cate ="Music"
ORDER BY
    ratings DESC
LIMIT 10;

9.2.7 统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

理解一：前十用户每人前20

统计视频上传最多的用户Top10

SELECT
    uploader,
    videos
FROM
    user_orc
ORDER BY
    videos DESC
LIMIT 10;

和video_orc联立，找出这些用户上传的视频，并按照热度排名

SELECT
    t1.uploader,
    v.videoid,
    RANK() OVER(PARTITION BY t1.uploader ORDER BY v.views DESC) hot
FROM
    t1
LEFT JOIN video_orc v ON
    t1.uploader = v.uploader;

求每个人前20

SELECT
    t2.uploader,
    t2.videoid,
    t2.hot
FROM
    t2
WHERE
    hot <= 20;

理解二：前十用户总榜前20

统计视频上传最多的用户Top10

SELECT
    uploader,
    videos
FROM
    user_orc
ORDER BY
    videos DESC
LIMIT 10;

观看数前20的视频

SELECT
    videoid,
    uploader,
    views
FROM
    video_orc
ORDER BY
    views DESC
LIMIT 20;

联立两表，看看有没有他们上传的

SELECT
    t1.uploader,
    t2.videoid
FROM
    t1
LEFT JOIN t2 ON
    t1.uploader = t2.uploader;

9.2.8 统计每个类别视频观看数Top10

从video_category表查出每个类别视频观看数排名

SELECT
    cate,
    videoid,
    views,
    RANK() OVER(PARTITION BY cate ORDER BY views DESC) hot
FROM
    video_category;

取每个类别的Top10

SELECT
    cate,
    videoid,
    views
FROM
    t1
WHERE
    hot <= 10;

十、权限管理

10.1 权限控制

开启

为了使用Hive的授权机制，有两个参数必须在hive-site.xml中设置：

<property>
	<name>hive.security.authorization.enabled</name>
	<value>true</value>
	<description>enable or disable the hive client authorization</description>
</property>

<property>
	<name>hive.security.authorization.createtable.owner.grants</name>
	<value>ALL</value>
	<description>the privileges automatically granted to the owner whenever a table gets created. An example like "select,drop" will grant select and drop privilege to the owner of the table</description>
</property>

说明

目前hive支持简单的权限管理，默认情况下是不开启，这样所有的用户都具有相同的权限；
Hive可以是基于元数据的权限管理，也可以基于文件存储级别的权限管理；
Hive授权的核心就是用户、组、角色，用户和组使用的是Linux机器上的用户和组，而角色必须自己创建；
Hive中的角色可以理解为一部分有一些相同“属性”的用户或组或角色的集合，一个角色可以是一些角色的集合。

分类

SELECT 特权–授予对对象的读取访问权限。
INSERT 特权–提供将数据添加到对象(表)的能力。
UPDATE 特权–提供对对象(表)运行更新查询的能力。
DELETE 特权–提供删除对象(表)中数据的功能。
ALL PRIVILEGES –授予所有特权(将其转换为以上所有特权)。

10.2 角色管理

--创建和删除角色  
create role role_name;  
drop role role_name;  

--展示所有roles  
show roles  

--赋予角色权限  
grant select on database db_name to role role_name;    
grant select on [table] t_name to role role_name;    

--查看角色权限  
show grant role role_name on database db_name;   
show grant role role_name on [table] t_name;   

--角色赋予用户  
grant role role_name to user user_name  

--回收角色权限  
revoke select on database db_name from role role_name;  
revoke select on [table] t_name from role role_name;  

--查看某个用户所有角色  
show role grant user user_name;