Apache Hive编程指南-蒋中洲

Apache Hive

概述

Hive是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具,可以用来进行数据提取转化加载(ETL),这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。 Hive 查询操作过程严格遵守Hadoop MapReduce 的作业执行模型,Hive 将用户的HiveQL 语句通过解释器转换为提交到Hadoop 集群上,Hadoop 监控作业执行过程,然后返回作业执行结果给用户。 Hive 并非为联机事务处理而设计,Hive 并不提供实时的查询和基于行级的数据更新操作。Hive 的最佳使用场合是大数据集的批处理作业,例如,网络日志分析等。
hive结构
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。同时,Hive也是建立在Hadoop上的数据仓库基础构架。它提供了一系列的工具,可以用来进行数据提取转化加载(ETL),并且提供了存储、查询和分析Hadoop中的大规模数据的机制。Hive定义了简单的类SQL查询语言,称为HQL,它允许熟悉SQL的用户查询数据。这个语言也允许熟悉MapReduce的开发者的设计自定义的Mapper和Reducer来处理内建的Mapper和Reducer无法完成的复杂的分析工作。

架构

Hive提供了一个SQL命令的操作接口,允许用户可以使用类似SQL的Hive的Query Language执行一些离线的SQL分析。但是Hive和传统的数据库不同,Hive只是构建在Hadoop的MapReduce之上的SQL解析工具,并不参与数据的管理和存储,Hive中所有的数据都是在运行任务的时候才会真正的加载。
hive执行原理
总的来说Hive有以下几点特性:

  • Hive和关系数据库存储文件的系统不同,Hive使用的是Hadoop的HDFS,关系数据库则是服务器本地的文件系统。

  • Hive使用的计算模型是MapReduce,而关系数据库则是自己设计的计算模型。

  • 关系数据库都是为OLTP进行设计的,而Hive则是为海量数据做数据挖掘设计的,实时性很差,实时性的区别导致Hive的应用场景和关系数据库有很大的不同。

  • Hive很容易扩展自己的存储能力和计算能力,这个是继承Hadoop的特性,而关系数据库在这个方面要比Hive差很多。

Hive安装

由于hive需要存储一些一些建表的元数据信息,因此在生产环境下安装Hive需要事先安装MySQL数据库服务(注意:MySQL数据库的编码必须是Latin1编码)。其次Hive是构建在Hadoop存储和计算之上的工具,因此还需要在配置和安装hive之前保证Hadoop的hdfs和MapReduce正常运行。最后Hive的安装主机上必须配置HADOOP_HOME环境变量,这样Hive可以通过该环境变量感知用户的Hadoop计算集群的位置和规模。

MySQL配置

[root@CentOS ~]# yum install -y mysql-server
[root@CentOS ~]# service mysqld start
[root@CentOS ~]# chkconfig mysqld on
[root@CentOS ~]# mysqladmin -uroot password 'root'
[root@CentOS ~]# mysql -u root -proot
mysql> use mysql
mysql> delete from user where password='';
mysql> grant all privileges on *.* to 'root'@'%' identified by 'root' with grant option;
mysql> flush privileges;
mysql> create database hive;
mysql> show database;

Hadoop配置

1、配置etc/hadoop/core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://CentOS:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>

2、配置etc/hadoop/hdfs-site.xml

<!--block副本因子-->
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<!--配置Sencondary namenode所在物理主机-->
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>CentOS:50090</value>
</property>
<!--设置datanode最大文件操作数-->
<property>
        <name>dfs.datanode.max.xcievers</name>
        <value>4096</value>
</property>
<!--设置datanode并行处理能力-->
<property>
        <name>dfs.datanode.handler.count</name>
        <value>6</value>
</property>

3、配置etc/hadoop/mapred-site.xml

<!--MapRedcue框架资源管理器的实现-->
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

4、配置etc/hadoop/yarn-site.xml

<!--配置MapReduce计算框架的核心实现Shuffle-洗牌-->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<!--配置资源管理器所在的目标主机-->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>CentOS</value>
</property>
<!--关闭物理内存检查-->
<property>
        <name>yarn.nodemanager.pmem-check-enabled</name>
        <value>false</value>
</property>
<!--关闭虚拟内存检查-->
<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>

5、在.bashrc下配置HADOOP_HOME环境变量

HADOOP_HOME=/usr/hadoop-2.9.2
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export PATH
export HADOOP_HOME
export CLASSPATH

6、启动Hadoop的存储和计算服务

[root@CentOS ~]# start-dfs.sh
[root@CentOS ~]# start-yarn.sh

Hive配置

1、解压hive安装包,解压到/usr目录下

[root@CentOS ~]# tar -zxf apache-hive-1.2.2-bin.tar.gz -C /usr/
[root@CentOS ~]# cd /usr/apache-hive-1.2.2-bin/

2、创建conf/hive-site.xml,配置以下内容

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://CentOS:3306/hive?createDatabaseIfNotExist=true</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
  </property>

</configuration>

3、将MySQL的驱动jar拷贝到lib目录下

MacBookPro:~ admin$ scp .m2/repository/mysql/mysql-connector-java/5.1.48/mysql-connector-java-5.1.48.jar root@CentOS:/usr/apache-hive-1.2.2-bin/lib
root@centos's password: 
mysql-connector-java-5.1.48.jar               100%  983KB 103.5MB/s   00:00  

4、配置HIVE_HOME环境变量

HADOOP_HOME=/usr/hadoop-2.9.2
HIVE_HOME=/usr/apache-hive-1.2.2-bin
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export HADOOP_HOME
export CLASSPATH
export HIVE_HOME

5、启动Hive

  • 单用户模式

    [root@CentOS ~]# hive
    
    Logging initialized using configuration in jar:file:/usr/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
    hive> show tables;
    OK
    Time taken: 0.726 seconds
    hive> 
    
    
  • 多用户模式

    [root@CentOS apache-hive-1.2.2-bin]#  ./bin/hiveserver2  >/dev/null 2>&1 &
    [3] 26279
    
    [root@CentOS apache-hive-1.2.2-bin]# ./bin/beeline -u jdbc:hive2://CentOS:10000 -n root
    Connecting to jdbc:hive2://CentOS:10000
    Connected to: Apache Hive (version 1.2.2)
    Driver: Hive JDBC (version 1.2.2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 1.2.2 by Apache Hive
    0: jdbc:hive2://CentOS:10000> 
    

表操作

Hive支持原始数据类型和复杂类型,原始类型包括数值型,Boolean,字符串,时间戳。复杂类型包括数组,map,struct。下面是Hive数据类型的一个总结:

原始类型

类型描述字面量示例
BOOLEAN布尔值,可选值true/falsetrue
TINYINT1字节的有符号整数 -128~1271Y
SMALLINT2个字节的有符号整数,-32768~327671S
INT4个字节的带符号整数1
BIGINT8字节带符号整数1L
FLOAT4字节单精度浮点数1.0
DOUBLE8字节双精度浮点数1.0
DEICIMAL任意精度的带符号小数1.0
STRING字符串,变长“Abc”
VARCHAR变长字符串“Aac”
CHAR固定长度字符串“a”,’b’
BINARY字节数组
TIMESTAMP时间戳,纳秒精度122327493795
DATE日期‘2020-01-01

复杂类型

类型描述字面量示例
ARRAY有序的的同类型的集合array(1,2)
MAPkey-value,key必须为原始类型,value可以任意类型map(‘a’,1,’b’,2)
STRUCT字段集合,类型可以不同struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)

案例测试

1,链接hive服务,创建test数据库

[root@CentOS ~]# beeline -u jdbc:hive2://CentOS:10000 -n root
Connecting to jdbc:hive2://CentOS:10000
Connected to: Apache Hive (version 1.2.2)
Driver: Hive JDBC (version 1.2.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.2 by Apache Hive
0: jdbc:hive2://CentOS:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
+----------------+--+
1 row selected (0.051 seconds)
0: jdbc:hive2://CentOS:10000> create database test;
No rows affected (0.193 seconds)
0: jdbc:hive2://CentOS:10000> use test;
No rows affected (0.044 seconds)
0: jdbc:hive2://CentOS:10000> select current_database();
+-------+--+
|  _c0  |
+-------+--+
| test  |
+-------+--+

2、在test库下创建t_employee表

create table if not exists t_employee (
  id int,
  name varchar(32),
  age int,
  salary double,
  birthDay date,
  hobbies array<string>,
  address struct<street:string,country:string>,
  detail map<string,double>
);

3、查看建表详情,用户可以使用desc formatted 表名查看更多的表信息

0: jdbc:hive2://CentOS:10000> desc t_employee;
+-----------+---------------------------------------+----------+--+
| col_name  |               data_type               | comment  |
+-----------+---------------------------------------+----------+--+
| id        | int                                   |          |
| name      | varchar(32)                           |          |
| age       | int                                   |          |
| salary    | double                                |          |
| birthday  | date                                  |          |
| hobbies   | array<string>                         |          |
| address   | struct<street:string,country:string>  |          |
| detail    | map<string,double>                    |          |
+-----------+---------------------------------------+----------+--+
8 rows selected (0.076 seconds)

数据格式

默认分隔符

在创建完表之后,需要准备数据,其中hive默认的字段分割符号如下,因此上述建表等价写法如下:

分隔符描述
\n对于文本文件而言,一行表示一条文本记录,因此可以使用\n表示
^A(Ctrl+A)用于分割字符的列,在创建表的时候可以使用八进制’\001’表示
^B(Ctrl+B)用于分割ARRAY、STRUCT或者MAP中的元素,在创建表的时候可以使用八进制’\002’表示
^C(Ctrl+C)用于分割MAP中的key,value,在创建表的时候可以使用八进制’\003’表示
create table if not exists t_employee (
  id int,
  name varchar(32),
  age int,
  salary double,
  birthDay date,
  hobbies array<string>,
  address struct<street:string,country:string>,
  detail map<string,double>
)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003'
lines terminated by '\n'
stored as textfile;

按照如下数据格式输入数据,然后倒入到指定的表中

1^Azhangsan^A17^A15000.0^A1990-12-13^Awatch tv^Bpaly game^Abeijing^Bchina^Alevel^C1^Bdept^C2
2^Alisi^A18^A10000.0^A1995-12-13^Awatch tv^Bpaly game^Ahanghai^Bchina^Alevel^C1^Bdept^C2

load data local inpath ‘xxx路径’ overwrite into table Xxx表

0: jdbc:hive2://CentOS:10000> select id,name,age,salary,address,hobbies from t_employee;
+-----+-----------+------+----------+-----------------------------------------+---------------------------+--+
| id  |   name    | age  |  salary  |                 address                 |          hobbies          |
+-----+-----------+------+----------+-----------------------------------------+---------------------------+--+
| 1   | zhangsan  | 17   | 15000.0  | {"street":"beijing","country":"china"}  | ["watch tv","paly game"]  |
| 2   | lisi      | 18   | 10000.0  | {"street":"hanghai","country":"china"}  | ["watch tv","paly game"]  |
+-----+-----------+------+----------+-----------------------------------------+---------------------------+--+
0: jdbc:hive2://CentOS:10000> select id,name,hobbies,detail from t_employee;
+-----+-----------+---------------------------+---------------------------+--+
| id  |   name    |          hobbies          |          detail           |
+-----+-----------+---------------------------+---------------------------+--+
| 1   | zhangsan  | ["watch tv","paly game"]  | {"level":1.0,"dept":2.0}  |
| 2   | lisi      | ["watch tv","paly game"]  | {"level":1.0,"dept":2.0}  |
+-----+-----------+---------------------------+---------------------------+--+
2 rows selected (0.053 seconds)

自定义分割符号

1,zhangsan,true,18,15000,TV|Game,001>建设|002>招商,china|bj 
2,lisi,true,28,15000,TV|Game,001>建设|002>招商,china|bj 
3,wangwu,false,38,5000,TV|Game,001>建设|002>招商,china|sh
create table if not exists t_user(
  id int,
  name string,
  sex boolean,
  age int,
  salary double,
  hobbies array<string>,
  card map<string,string>,
  address struct<country:string,city:string>
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;

正则格式数据

192.168.0.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.2.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.0.1 xx com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.202.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
create table if not exists t_access(
     ip string,
     app varchar(32),
     service string,
     last_time timestamp
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex"="^(.*)\\s(.*)\\s(.*)\\s(.*\\s.*)"
);

CSV格式文件

1,apple,3,20.5
2,orange,2,21.5
3,pear,2,22.5
4,banana,1,23.0
CREATE TABLE if not exists t_product(
  id int,
  item string,
  count int,
  price double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "escapeChar"    = "\\"
);

JSON格式文件

{"id":1,"name":"zhangsan","sex":true,"register":"1991-02-08","score":100.0}
{"id":2,"name":"lisi","sex":true,"register":"1991-02-08","score":80.0}
create table if not exists t_student(
   id int,
   name varchar(32),
   sex boolean,
   register date,
   score double
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

需要添加jar文件 add jar /usr/apache-hive-1.2.2-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.2.jar删除的时候使用delete指令

Hive常规操作

Hive DDL操作

Database
查看数据库
0: jdbc:hive2://CentOS:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| test           |
+----------------+--+
2 rows selected (0.441 seconds)
使用数据库
0: jdbc:hive2://CentOS:10000> use test;
No rows affected (0.03 seconds)
新建数据库
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name   --DATABASE|SCHEMA 是等价的
  [COMMENT database_comment] --数据库注释
  [LOCATION hdfs_path] --存储在 HDFS 上的位置
  [WITH DBPROPERTIES (property_name=property_value, ...)]; --指定额外属性
CREATE DATABASE IF NOT EXISTS hive_test COMMENT 'database for test' WITH DBPROPERTIES ('create'='jiangzz');
查看数据库信息
DESC DATABASE [EXTENDED] db_name; --EXTENDED 表示是否显示额外属性
0: jdbc:hive2://CentOS:10000> DESC DATABASE hive_test;
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
|  db_name   |      comment       |                       location                       | owner_name  | owner_type  | parameters  |
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
| hive_test  | database for test  | hdfs://CentOS:9000/user/hive/warehouse/hive_test.db  | root        | USER        |             |
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
1 row selected (0.051 seconds)
删除数据库
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

默认行为是 RESTRICT,如果数据库中存在表则删除失败。要想删除库及其中的表,可以使用 CASCADE 级联删除。

0: jdbc:hive2://CentOS:10000> drop database if exists test cascade;
No rows affected (0.018 seconds)
查看当前数据库
0: jdbc:hive2://CentOS:10000> select current_database();
+------------+--+
|    _c0     |
+------------+--+
| hive_test  |
+------------+--+
1 row selected (0.041 seconds)
Table
创建表
  • 管理表

内部表也称之为MANAGED_TABLE;默认存储在/user/hive/warehouse下,也可以通过location指定;删除表时,会删除表数据以及元数据;

create table if not exists t_user(
  id int,
  name string,
  sex boolean,
  age int,
  salary double,
  hobbies array<string>,
  card map<string,string>,
  address struct<country:string,city:string>
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
  • 外部表

外部表称之为EXTERNAL_TABLE;在创建表时可以自己指定目录位置(LOCATION);删除表时,只会删除元数据不会删除表数据;

create external table if not exists t_access(
     ip string,
     app varchar(32),
     service string,
     last_time timestamp
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex"="^(.*)\\s(.*)\\s(.*)\\s(.*\\s.*)"
)
LOCATION '/hive/t_access';
  • 分区表

Hive中的表对应为HDFS上的指定目录,在查询数据时候,默认会对全表进行扫描,这样时间和性能的消耗都非常大。分区为HDFS上表目录的子目录,数据按照分区存储在子目录中。如果查询的where字句的中包含分区条件,则直接从该分区去查找,而不是扫描整个表目录,合理的分区设计可以极大提高查询速度和性能。这里说明一下分区表并非Hive独有的概念,实际上这个概念非常常见。比如在我们常用的Oracle数据库中,当表中的数据量不断增大,查询数据的速度就会下降,这时也可以对表进行分区。表进行分区后,逻辑上表仍然是一张完整的表,只是将表中的数据存放到多个表空间(物理文件上),这样查询数据时,就不必要每次都扫描整张表,从而提升查询性能。在Hive中可以使用PARTITIONED BY子句创建分区表。表可以包含一个或多个分区列,程序会为分区列中的每个不同值组合创建单独的数据目录。

7369	SMITH	CLERK	7902	1980-12-17 00:00:00	800.00
7499	ALLEN	SALESMAN	7698	1981-02-20 00:00:00	1600.00	300.00
7521	WARD	SALESMAN	7698	1981-02-22 00:00:00	1250.00	500.00
7566	JONES	MANAGER	7839	1981-04-02 00:00:00	2975.00
7654	MARTIN	SALESMAN	7698	1981-09-28 00:00:00	1250.00	1400.00
7698	BLAKE	MANAGER	7839	1981-05-01 00:00:00	2850.00
7782	CLARK	MANAGER	7839	1981-06-09 00:00:00	2450.00
7788	SCOTT	ANALYST	7566	1987-04-19 00:00:00	1500.00
7839	KING	PRESIDENT		1981-11-17 00:00:00	5000.00
7844	TURNER	SALESMAN	7698	1981-09-08 00:00:00	1500.00	0.00
7876	ADAMS	CLERK	7788	1987-05-23 00:00:00	1100.00
7900	JAMES	CLERK	7698	1981-12-03 00:00:00	950.00
7902	FORD	ANALYST	7566	1981-12-03 00:00:00	3000.00
7934	MILLER	CLERK	7782	1982-01-23 00:00:00	1300.00
 CREATE EXTERNAL TABLE t_employee(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2)
    )
    PARTITIONED BY (deptno INT)   
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/t_employee';
0: jdbc:hive2://CentOS:10000> load data local inpath '/root/t_emp' overwrite into table t_employee partition(deptno='10');
INFO  : Loading data to table hive_test.t_employee partition (deptno=10) from file:/root/t_emp
INFO  : Partition hive_test.t_employee{deptno=10} stats: [numFiles=1, numRows=0, totalSize=747, rawDataSize=0]
No rows affected (0.361 seconds)
0: jdbc:hive2://CentOS:10000> select id ,name,job,manager,salary,deptno from t_employee;
+-------+---------+------------+----------+---------+---------+--+
|  id   |  name   |    job     | manager  | salary  | deptno  |
+-------+---------+------------+----------+---------+---------+--+
| 7369  | SMITH   | CLERK      | 7902     | 800     | 10      |
| 7499  | ALLEN   | SALESMAN   | 7698     | 1600    | 10      |
| 7521  | WARD    | SALESMAN   | 7698     | 1250    | 10      |
| 7566  | JONES   | MANAGER    | 7839     | 2975    | 10      |
| 7654  | MARTIN  | SALESMAN   | 7698     | 1250    | 10      |
| 7698  | BLAKE   | MANAGER    | 7839     | 2850    | 10      |
| 7782  | CLARK   | MANAGER    | 7839     | 2450    | 10      |
| 7788  | SCOTT   | ANALYST    | 7566     | 1500    | 10      |
| 7839  | KING    | PRESIDENT  | NULL     | 5000    | 10      |
| 7844  | TURNER  | SALESMAN   | 7698     | 1500    | 10      |
| 7876  | ADAMS   | CLERK      | 7788     | 1100    | 10      |
| 7900  | JAMES   | CLERK      | 7698     | 950     | 10      |
| 7902  | FORD    | ANALYST    | 7566     | 3000    | 10      |
| 7934  | MILLER  | CLERK      | 7782     | 1300    | 10      |
+-------+---------+------------+----------+---------+---------+--+
14 rows selected (0.079 seconds)
  • 分桶表

分区表是为了将文件按照分区文件夹进行粗粒度文件隔离,但是分桶表是将数据按照某个字段进行hash计算出所属的桶,然后在对桶内的数据进行排序。

 CREATE EXTERNAL TABLE t_employee_bucket(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2),
    deptno INT)
    CLUSTERED BY(id) SORTED BY(salary ASC) INTO 4 BUCKETS  
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/employee_bucket';
: jdbc:hive2://CentOS:10000> set hive.enforce.bucketing = true;
0: jdbc:hive2://CentOS:10000> INSERT INTO TABLE t_employee_bucket SELECT *  FROM t_employee;
  • 临时表

临时表仅对当前session可见,临时表的数据将存储在用户的暂存目录中,并在会话结束后删除。如果临时表与永久表表名相同,则对该表名的任何引用都将解析为临时表,而不是永久表。临时表还具有以下两个限制:不支持分区列;不支持创建索引.

  CREATE TEMPORARY TABLE if not exists emp_temp(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2),
    deptno INT
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/emp_temp';
  • CTAS创建表

创建的表结构和查询的信息类似,但是并不会拷贝表的结构

0: jdbc:hive2://CentOS:10000> create  TEMPORARY TABLE t_employee_copy1 as select * from t_employee;
  • 复制表结构

仅仅是赋值表的结构,但是不拷贝数据。

0: jdbc:hive2://CentOS:10000> CREATE TEMPORARY EXTERNAL TABLE  IF NOT EXISTS  t_empoyee_copy2  LIKE t_employee_bucket location '/hive/t_empoyee_copy2';
修改表
  • 重命名表
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_user RENAME TO t_u;
  • 修改列(修改类型、顺序、新增)
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_employee CHANGE id  eid INT;--修改列名&类型
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_employee CHANGE eid id decimal(7,2)  AFTER name;--修改顺序
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_employee ADD COLUMNS (address STRING);
清空表
0: jdbc:hive2://CentOS:10000> truncate table t_employee partition(deptno=10);

只可以截断managed-table

删除
0: jdbc:hive2://CentOS:10000> drop table t_employee PURGE;

PURGE表示数据会直接删除,不会放置在垃圾箱中

其他命令
Describe
  • 查看数据库
DESCRIBE|DESC DATABASE [EXTENDED] db_name; 
0: jdbc:hive2://CentOS:10000> desc database hive_test;
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
|  db_name   |      comment       |                       location                       | owner_name  | owner_type  | parameters  |
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
| hive_test  | database for test  | hdfs://CentOS:9000/user/hive/warehouse/hive_test.db  | root        | USER        |             |
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
1 row selected (0.039 seconds)
  • 查看表
DESCRIBE|DESC [EXTENDED|FORMATTED] table_name 
0: jdbc:hive2://CentOS:10000> desc  t_user;
+-----------+-------------------------------------+----------+--+
| col_name  |              data_type              | comment  |
+-----------+-------------------------------------+----------+--+
| id        | int                                 |          |
| name      | string                              |          |
| sex       | boolean                             |          |
| age       | int                                 |          |
| salary    | double                              |          |
| hobbies   | array<string>                       |          |
| card      | map<string,string>                  |          |
| address   | struct<country:string,city:string>  |          |
+-----------+-------------------------------------+----------+--+
8 rows selected (0.06 seconds)
Show
  • 查看数据库列表
SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
0: jdbc:hive2://CentOS:10000> show schemas like '*'
0: jdbc:hive2://CentOS:10000> ;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| hive_test      |
| test           |
+----------------+--+
3 rows selected (0.03 seconds)
  • 查看表的列表
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
0: jdbc:hive2://CentOS:10000> show tables;
+--------------------+--+
|      tab_name      |
+--------------------+--+
| t_access           |
| t_employ_copy      |
| t_employee_bucket  |
| t_employee_copy1   |
| t_user             |
+--------------------+--+
5 rows selected (0.054 seconds)
0: jdbc:hive2://CentOS:10000> show tables in test;
+-------------+--+
|  tab_name   |
+-------------+--+
| t_access    |
| t_employee  |
| t_product   |
| t_student   |
| t_user      |
+-------------+--+
5 rows selected (0.043 seconds)
0: jdbc:hive2://CentOS:10000> show tables in test like 't_*';
+-------------+--+
|  tab_name   |
+-------------+--+
| t_access    |
| t_employee  |
| t_product   |
| t_student   |
| t_user      |
+-------------+--+
5 rows selected (0.04 seconds)
  • 查看分区
0: jdbc:hive2://CentOS:10000> show partitions t_employee;
+------------+--+
| partition  |
+------------+--+
| deptno=10  |
+------------+--+
1 row selected (0.065 seconds)
  • 查看建表语句
0: jdbc:hive2://CentOS:10000> show create table t_employee;
+-----------------------------------------------------------------+--+
|                         createtab_stmt                          |
+-----------------------------------------------------------------+--+
| CREATE EXTERNAL TABLE `t_employee`(                             |
|   `id` int,                                                     |
|   `name` string,                                                |
|   `job` string,                                                 |
|   `manager` int,                                                |
|   `hiredate` timestamp,                                         |
|   `salary` decimal(7,2))                                        |
| PARTITIONED BY (                                                |
|   `deptno` int)                                                 |
| ROW FORMAT DELIMITED                                            |
|   FIELDS TERMINATED BY '\t'                                     |
| STORED AS INPUTFORMAT                                           |
|   'org.apache.hadoop.mapred.TextInputFormat'                    |
| OUTPUTFORMAT                                                    |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'  |
| LOCATION                                                        |
|   'hdfs://CentOS:9000/hive/t_employee'                          |
| TBLPROPERTIES (                                                 |
|   'transient_lastDdlTime'='1576961129')                         |
+-----------------------------------------------------------------+--+
19 rows selected (0.117 seconds)

更多请参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Hive DML操作

加载文件数据到表中
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
0: jdbc:hive2://CentOS:10000> load data local inpath '/root/t_employee' overwrite into table t_employee partition(deptno='10');
查询结果插入表中
  • 将查询结果插入表中
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]]   
select_statement1 FROM from_statement; --覆盖

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] 
select_statement1 FROM from_statement;--追加
{"EMPNO": 7369,"ENAME": "SMITH","JOB": "CLERK","MGR": 7902,"HIREDATE": "1980-12-17 00:00:00","SAL": 800.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7499,"ENAME": "ALLEN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-20 00:00:00","SAL": 1600.00,"COMM": 300.00,"DEPTNO": 30}
{"EMPNO": 7521,"ENAME": "WARD","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-22 00:00:00","SAL": 1250.00,"COMM": 500.00,"DEPTNO": 30}
{"EMPNO": 7566,"ENAME": "JONES","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-04-02 00:00:00","SAL": 2975.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7654,"ENAME": "MARTIN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-28 00:00:00","SAL": 1250.00,"COMM": 1400.00,"DEPTNO": 30}
{"EMPNO": 7698,"ENAME": "BLAKE","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-05-01 00:00:00","SAL": 2850.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7782,"ENAME": "CLARK","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-06-09 00:00:00","SAL": 2450.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7788,"ENAME": "SCOTT","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1987-04-19 00:00:00","SAL": 1500.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7839,"ENAME": "KING","JOB": "PRESIDENT","MGR": null,"HIREDATE": "1981-11-17 00:00:00","SAL": 5000.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7844,"ENAME": "TURNER","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-08 00:00:00","SAL": 1500.00,"COMM": 0.00,"DEPTNO": 30}
{"EMPNO": 7876,"ENAME": "ADAMS","JOB": "CLERK","MGR": 7788,"HIREDATE": "1987-05-23 00:00:00","SAL": 1100.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7900,"ENAME": "JAMES","JOB": "CLERK","MGR": 7698,"HIREDATE": "1981-12-03 00:00:00","SAL": 950.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7902,"ENAME": "FORD","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1981-12-03 00:00:00","SAL": 3000.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7934,"ENAME": "MILLER","JOB": "CLERK","MGR": 7782,"HIREDATE": "1982-01-23 00:00:00","SAL": 1300.00,"COMM": null,"DEPTNO": 10}
0: jdbc:hive2://CentOS:10000> add  jar /usr/apache-hive-1.2.2-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.2.jar

CREATE EXTERNAL TABLE t_emp_json(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
    
0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_emp_json;

+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
 CREATE TABLE t_employee(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2)
    )
    PARTITIONED BY (deptno INT)   
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
0: jdbc:hive2://CentOS:10000> insert overwrite  table t_employee partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from t_emp_json where deptno=10;--覆盖
0: jdbc:hive2://CentOS:10000> insert into  table t_employee partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm from t_emp_json where deptno=20;--追加
  • 将查询结果插入多个表
FROM from_statement
INSERT OVERWRITE TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
from t_emp_json 
INSERT OVERWRITE TABLE t_employee partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=10 
INSERT OVERWRITE TABLE t_employee partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=20 
INSERT OVERWRITE TABLE t_employee partition(deptno=30) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=30 
  • 插入动态分区
0: jdbc:hive2://CentOS:10000> set hive.exec.dynamic.partition.mode=nonstrict;
No rows affected (0.004 seconds)
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE TABLE t_employee PARTITION (deptno)  SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_emp_json;

0: jdbc:hive2://CentOS:10000> show partitions t_employee;
+------------+--+
| partition  |
+------------+--+
| deptno=10  |
| deptno=20  |
| deptno=30  |
+------------+--+
3 rows selected (0.064 seconds)
结果写出到文件系统
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] 
  SELECT ... FROM ...
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE  DIRECTORY '/employee' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE select * from t_employee;

更多参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

SQL数据查询

数据筹备

7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,\N,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,\N,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,\N,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,\N,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,1500,\N,20
7839,KING,PRESIDENT,\N,1981-11-17 00:00:00,5000,\N,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,\N,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,\N,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,\N,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,\N,10
CREATE TABLE t_employee(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
CREATE TABLE t_dept(
    DEPTNO INT,
    DNAME STRING,
    LOC STRING)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno  from t_employee;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
14 rows selected (0.047 seconds)
0: jdbc:hive2://CentOS:10000> select deptno,dname,loc from t_dept;
+---------+-------------+-----------+--+
| deptno  |    dname    |    loc    |
+---------+-------------+-----------+--+
| 10      | ACCOUNTING  | NEW YORK  |
| 20      | RESEARCH    | DALLAS    |
| 30      | SALES       | CHICAGO   |
| 40      | OPERATIONS  | BOSTON    |
+---------+-------------+-----------+--+
4 rows selected (0.046 seconds)
CREATE TABLE t_employee_partition(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2))
PARTITIONED BY(deptno INT) 
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
0: jdbc:hive2://CentOS:10000> set hive.exec.dynamic.partition.mode=nonstrict
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE TABLE t_employee_partition PARTITION (deptno)  SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee;

SQL查询

单表查询
0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno  from t_employee;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
14 rows selected (0.056 seconds)
WHERE查询
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee WHERE empno > 7782 AND deptno = 10;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
2 rows selected (0.067 seconds)
DISTINCT查询
0: jdbc:hive2://CentOS:10000> select distinct(job) from t_employee;
+------------+--+
|    job     |
+------------+--+
| ANALYST    |
| CLERK      |
| MANAGER    |
| PRESIDENT  |
| SALESMAN   |
+------------+--+
分区查询
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee_partition e  WHERE e.deptno >= 20 AND e.deptno <= 40;
+--------+---------+-----------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job    |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+-----------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK     | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7566   | JONES   | MANAGER   | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7788   | SCOTT   | ANALYST   | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7876   | ADAMS   | CLERK     | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7902   | FORD    | ANALYST   | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN  | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN  | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7654   | MARTIN  | SALESMAN  | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER   | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7844   | TURNER  | SALESMAN  | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7900   | JAMES   | CLERK     | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
+--------+---------+-----------+-------+------------------------+-------+-------+---------+--+
11 rows selected (0.123 seconds)
LIMIT查询
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee  ORDER BY sal DESC LIMIT 5;
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
| empno  | ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
| 7839   | KING   | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7902   | FORD   | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7566   | JONES  | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7698   | BLAKE  | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK  | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
5 rows selected (14.294 seconds)
GROUP BY查询
0: jdbc:hive2://CentOS:10000> set hive.map.aggr=true;
0: jdbc:hive2://CentOS:10000> SELECT deptno,SUM(sal) as total FROM t_employee GROUP BY deptno;
+---------+--------+--+
| deptno  | total  |
+---------+--------+--+
| 10      | 8750   |
| 20      | 9375   |
| 30      | 9400   |
+---------+--------+--+
3 rows selected (12.645 seconds)

hive.map.aggr控制程序如何进行聚合。默认值为false。如果设置为true,Hive会在map阶段就执行一次聚合。这可以提高聚合效率,但需要消耗更多内存。

ORDER AND SORT

可以使用ORDER BY或者Sort BY对查询结果进行排序,排序字段可以是整型也可以是字符串:如果是整型,则按照大小排序;如果是字符串,则按照字典序排序。ORDER BY 和 SORT BY 的区别如下:使用ORDER BY时会有一个Reducer对全部查询结果进行排序,可以保证数据的全局有序性;使用SORT BY时只会在每个Reducer中进行排序,这可以保证每个Reducer的输出数据是有序的,但不能保证全局有序。由于ORDER BY的时间可能很长,如果你设置了严格模式(hive.mapred.mode = strict),则其后面必须再跟一个limit子句。

  • sort by
0: jdbc:hive2://CentOS:10000> set mapreduce.job.reduces=2
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee sort by sal desc;
+--------+---------+-------+--+
| empno  |  ename  |  sal  |
+--------+---------+-------+--+
| 7902   | FORD    | 3000  |
| 7566   | JONES   | 2975  |
| 7844   | TURNER  | 1500  |
| 7788   | SCOTT   | 1500  |
| 7521   | WARD    | 1250  |
| 7654   | MARTIN  | 1250  |
| 7876   | ADAMS   | 1100  |
| 7900   | JAMES   | 950   |
| 7369   | SMITH   | 800   |
| 7839   | KING    | 5000  |
| 7698   | BLAKE   | 2850  |
| 7782   | CLARK   | 2450  |
| 7499   | ALLEN   | 1600  |
| 7934   | MILLER  | 1300  |
+--------+---------+-------+--+
14 rows selected (14.474 seconds)
  • order by
0: jdbc:hive2://CentOS:10000> set mapreduce.job.reduces=3;
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc;
+--------+---------+-------+--+
| empno  |  ename  |  sal  |
+--------+---------+-------+--+
| 7839   | KING    | 5000  |
| 7902   | FORD    | 3000  |
| 7566   | JONES   | 2975  |
| 7698   | BLAKE   | 2850  |
| 7782   | CLARK   | 2450  |
| 7499   | ALLEN   | 1600  |
| 7844   | TURNER  | 1500  |
| 7788   | SCOTT   | 1500  |
| 7934   | MILLER  | 1300  |
| 7654   | MARTIN  | 1250  |
| 7521   | WARD    | 1250  |
| 7876   | ADAMS   | 1100  |
| 7900   | JAMES   | 950   |
| 7369   | SMITH   | 800   |
+--------+---------+-------+--+
14 rows selected (13.049 seconds)
0: jdbc:hive2://CentOS:10000> set hive.mapred.mode = strict;
No rows affected (0.004 seconds)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc;
Error: Error while compiling statement: FAILED: SemanticException 1:48 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'sal' (state=42000,code=40000)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc limit 5; 
+--------+--------+-------+--+
| empno  | ename  |  sal  |
+--------+--------+-------+--+
| 7839   | KING   | 5000  |
| 7902   | FORD   | 3000  |
| 7566   | JONES  | 2975  |
| 7698   | BLAKE  | 2850  |
| 7782   | CLARK  | 2450  |
+--------+--------+-------+--+
5 rows selected (12.468 seconds)

8、HAVING过滤

0: jdbc:hive2://CentOS:10000> SELECT deptno,SUM(sal) total FROM t_employee GROUP BY deptno HAVING SUM(sal)>9000;
+---------+--------+--+
| deptno  | total  |
+---------+--------+--+
| 30      | 9400   |
| 20      | 9375   |
+---------+--------+--+
2 rows selected (18.361 seconds)
DISTRIBUTE BY

默认情况下,MapReduce程序会对Map输出结果的Key值进行散列,并均匀分发到所有Reducer上。如果想要把具有相同Key值的数据分发到同一个Reducer进行处理,这就需要使用DISTRIBUTE BY字句。需要注意的是,DISTRIBUTE BY虽然能保证具有相同Key值的数据分发到同一个Reducer,但是不能保证数据在Reducer上是有序的。

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee distribute BY deptno;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7654   | MARTIN  | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
| 7698   | BLAKE   | 2850  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7844   | TURNER  | 1500  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7934   | MILLER  | 1300  | 10      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7788   | SCOTT   | 1500  | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7876   | ADAMS   | 1100  | 20      |
| 7902   | FORD    | 3000  | 20      |
| 7369   | SMITH   | 800   | 20      |
+--------+---------+-------+---------+--+
14 rows selected (15.504 seconds)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee distribute BY deptno sort by sal desc;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7698   | BLAKE   | 2850  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7844   | TURNER  | 1500  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7654   | MARTIN  | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7934   | MILLER  | 1300  | 10      |
| 7902   | FORD    | 3000  | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7788   | SCOTT   | 1500  | 20      |
| 7876   | ADAMS   | 1100  | 20      |
| 7369   | SMITH   | 800   | 20      |
+--------+---------+-------+---------+--+
14 rows selected (16.528 seconds)
CLUSTER BY

如果SORT BYDISTRIBUTE BY指定的是相同字段,且SORT BY排序规则是ASC,此时可以使用CLUSTER BY进行替换,同时CLUSTER BY可以保证数据在全局是有序的。

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee cluster by deptno;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7934   | MILLER  | 1300  | 10      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7876   | ADAMS   | 1100  | 20      |
| 7788   | SCOTT   | 1500  | 20      |
| 7369   | SMITH   | 800   | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7902   | FORD    | 3000  | 20      |
| 7844   | TURNER  | 1500  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7698   | BLAKE   | 2850  | 30      |
| 7654   | MARTIN  | 1250  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
+--------+---------+-------+---------+--+
14 rows selected (25.847 seconds)
表Join查询

Hive支持内连接,外连接,左外连接,右外连接,笛卡尔连接,这和传统数据库中的概念是一致的。需要特别强调:JOIN语句的关联条件必须用ON指定,不能用WHERE指定,否则就会先做笛卡尔积,再过滤,这会导致你得不到预期的结果。

  • 内连接
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e JOIN t_dept d ON e.deptno = d.deptno WHERE e.empno=7369;
+----------+----------+--------+-----------+-----------+--+
| e.empno  | e.ename  | e.sal  |  d.dname  | d.deptno  |
+----------+----------+--------+-----------+-----------+--+
| 7369     | SMITH    | 800    | RESEARCH  | 20        |
+----------+----------+--------+-----------+-----------+--+
1 row selected (10.419 seconds)
  • 外连接
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e LEFT OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
+----------+----------+--------+-------------+-----------+--+
14 rows selected (11.424 seconds)
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e RIGHT OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| NULL     | NULL     | NULL   | OPERATIONS  | 40        |
+----------+----------+--------+-------------+-----------+--+
15 rows selected (11.063 seconds)
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e FULL OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| NULL     | NULL     | NULL   | OPERATIONS  | 40        |
+----------+----------+--------+-------------+-----------+--+
15 rows selected (24.703 seconds)

12、LEFT SEMI JOIN

LEFT SEMI JOIN (左半连接)是 IN/EXISTS 子查询的一种更高效的实现。

  • JOIN 子句中右边的表只能在 ON 子句中设置过滤条件;
  • 查询结果只包含左边表的数据,所以只能SELECT左表中的列。
0: jdbc:hive2://CentOS:10000> SELECT e.empno,e.ename,d.dname FROM t_employee e LEFT SEMI JOIN t_dept d ON e.deptno = d.deptno AND d.loc="NEW YORK";
+----------+----------+-----------+--+
| e.empno  | e.ename  | e.deptno  |
+----------+----------+-----------+--+
| 7782     | CLARK    | 10        |
| 7839     | KING     | 10        |
| 7934     | MILLER   | 10        |
+----------+----------+-----------+--+
3 rows selected (10.119 seconds)
JOIN优化
  • STREAMTABLE

在多表进行join的时候,如果每个ON子句都使用到共同的列,此时Hive会进行优化,将多表JOIN在同一个map / reduce作业上进行。同时假定查询的最后一个表是最大的一个表,在对每行记录进行JOIN操作时,它将尝试将其他的表缓存起来,然后扫描最后那个表进行计算。因此用户需要保证查询的表的大小从左到右是依次增加的。

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key) JOIN c ON (c.key = b.key)

然而用户并非需要总是把最大的表放在查询语句的最后面,Hive提供了/*+ STREAMTABLE() */标志,使用该标识来指出大表,能避免数据表过大导致占用内存过多而产生的问题。示例如下:

0: jdbc:hive2://CentOS:10000> SELECT /*+ STREAMTABLE(e) */ e.empno,e.ename,d.dname,d.deptno FROM t_employee e JOIN t_dept d ON e.deptno = d.deptno WHERE job='CLERK';
+----------+----------+-------------+-----------+--+
| e.empno  | e.ename  |   d.dname   | d.deptno  |
+----------+----------+-------------+-----------+--+
| 7369     | SMITH    | RESEARCH    | 20        |
| 7876     | ADAMS    | RESEARCH    | 20        |
| 7900     | JAMES    | SALES       | 30        |
| 7934     | MILLER   | ACCOUNTING  | 10        |
+----------+----------+-------------+-----------+--+
4 rows selected (11.645 seconds)
  • MAPJOIN

如果在进行join操作时,有一个表很小,则可以将join操作调整到map阶段执行。这就是典型的极大表和极小表关联问题。有两种解决方式:1.增加**/*+ MAPJOIN(b) */标示;2.设置参数hive.optimize.bucketmapjoin = true**,在

0: jdbc:hive2://CentOS:10000> SELECT /*+ MAPJOIN(d) */ e.empno, e.ename,d.dname FROM t_employee e  JOIN t_dept d ON d.deptno = e.deptno;
+----------+----------+-------------+--+
| e.empno  | e.ename  |   d.dname   |
+----------+----------+-------------+--+
| 7369     | SMITH    | RESEARCH    |
| 7499     | ALLEN    | SALES       |
| 7521     | WARD     | SALES       |
| 7566     | JONES    | RESEARCH    |
| 7654     | MARTIN   | SALES       |
| 7698     | BLAKE    | SALES       |
| 7782     | CLARK    | ACCOUNTING  |
| 7788     | SCOTT    | RESEARCH    |
| 7839     | KING     | ACCOUNTING  |
| 7844     | TURNER   | SALES       |
| 7876     | ADAMS    | RESEARCH    |
| 7900     | JAMES    | SALES       |
| 7902     | FORD     | RESEARCH    |
| 7934     | MILLER   | ACCOUNTING  |
+----------+----------+-------------+--+
14 rows selected (11.416 seconds)
开窗函数
0: jdbc:hive2://CentOS:10000> select e.empno ,e.ename,e.sal,e.deptno,rank() over(partition by e.deptno order by e.sal) as rank from t_employee e; 
+----------+----------+--------+-----------+-------+--+
| e.empno  | e.ename  | e.sal  | e.deptno  | rank  |
+----------+----------+--------+-----------+-------+--+
| 7839     | KING     | 5000   | 10        | 1     |
| 7782     | CLARK    | 2450   | 10        | 2     |
| 7934     | MILLER   | 1300   | 10        | 3     |
| 7902     | FORD     | 3000   | 20        | 1     |
| 7566     | JONES    | 2975   | 20        | 2     |
| 7788     | SCOTT    | 1500   | 20        | 3     |
| 7876     | ADAMS    | 1100   | 20        | 4     |
| 7369     | SMITH    | 800    | 20        | 5     |
| 7698     | BLAKE    | 2850   | 30        | 1     |
| 7499     | ALLEN    | 1600   | 30        | 2     |
| 7844     | TURNER   | 1500   | 30        | 3     |
| 7654     | MARTIN   | 1250   | 30        | 4     |
| 7521     | WARD     | 1250   | 30        | 4     |
| 7900     | JAMES    | 950    | 30        | 6     |
+----------+----------+--------+-----------+-------+--+
0: jdbc:hive2://CentOS:10000> select e.empno ,e.ename,e.sal,e.deptno,dense_rank() over(partition by e.deptno order by e.sal desc) as rank from t_employee e; 
+----------+----------+--------+-----------+-------+--+
| e.empno  | e.ename  | e.sal  | e.deptno  | rank  |
+----------+----------+--------+-----------+-------+--+
| 7839     | KING     | 5000   | 10        | 1     |
| 7782     | CLARK    | 2450   | 10        | 2     |
| 7934     | MILLER   | 1300   | 10        | 3     |
| 7902     | FORD     | 3000   | 20        | 1     |
| 7566     | JONES    | 2975   | 20        | 2     |
| 7788     | SCOTT    | 1500   | 20        | 3     |
| 7876     | ADAMS    | 1100   | 20        | 4     |
| 7369     | SMITH    | 800    | 20        | 5     |
| 7698     | BLAKE    | 2850   | 30        | 1     |
| 7499     | ALLEN    | 1600   | 30        | 2     |
| 7844     | TURNER   | 1500   | 30        | 3     |
| 7654     | MARTIN   | 1250   | 30        | 4     |
| 7521     | WARD     | 1250   | 30        | 4     |
| 7900     | JAMES    | 950    | 30        | 5     |
+----------+----------+--------+-----------+-------+--+
14 rows selected (24.262 seconds)
Cube分析
0: jdbc:hive2://CentOS:10000> select e.deptno,e.job,avg(e.sal) avg,max(e.sal) max,min(e.sal) min from t_employee e group by e.deptno,e.job with cube;
+-----------+------------+--------------+-------+-------+--+
| e.deptno  |   e.job    |     avg      |  max  |  min  |
+-----------+------------+--------------+-------+-------+--+
| NULL      | ANALYST    | 2250         | 3000  | 1500  |
| 10        | CLERK      | 1300         | 1300  | 1300  |
| 20        | CLERK      | 950          | 1100  | 800   |
| 30        | CLERK      | 950          | 950   | 950   |
| 20        | ANALYST    | 2250         | 3000  | 1500  |
| NULL      | PRESIDENT  | 5000         | 5000  | 5000  |
| 10        | PRESIDENT  | 5000         | 5000  | 5000  |
| NULL      | SALESMAN   | 1400         | 1600  | 1250  |
| NULL      | MANAGER    | 2758.333333  | 2975  | 2450  |
| 30        | SALESMAN   | 1400         | 1600  | 1250  |
| 10        | MANAGER    | 2450         | 2450  | 2450  |
| 20        | MANAGER    | 2975         | 2975  | 2975  |
| 30        | MANAGER    | 2850         | 2850  | 2850  |
| NULL      | NULL       | 1966.071429  | 5000  | 800   |
| NULL      | CLERK      | 1037.5       | 1300  | 800   |
| 10        | NULL       | 2916.666667  | 5000  | 1300  |
| 20        | NULL       | 1875         | 3000  | 800   |
| 30        | NULL       | 1566.666667  | 2850  | 950   |
+-----------+------------+--------------+-------+-------+--+
18 rows selected (25.037 seconds)
行转列
1,语文,100
1,数学,100
1,英语,100
2,数学,79
2,语文,80
2,英语,100
CREATE TABLE t_student(
    id INT,
    course STRING,
    score double)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
0: jdbc:hive2://CentOS:10000> select * from t_student;
+---------------+-------------------+------------------+--+
| t_student.id  | t_student.course  | t_student.score  |
+---------------+-------------------+------------------+--+
| 1             | 语文                | 100.0            |
| 1             | 数学                | 100.0            |
| 1             | 英语                | 100.0            |
| 2             | 数学                | 79.0             |
| 2             | 语文                | 80.0             |
| 2             | 英语                | 100.0            |
+---------------+-------------------+------------------+--+
6 rows selected (0.05 seconds)
0: jdbc:hive2://CentOS:10000> select id,max(case course when '语文' then score else 0 end) as chinese,max(case course when '数学' then score else 0 end ) as math,max(case course when '英语' then score else 0 end ) as english from t_student group by id ;

+-----+----------+--------+----------+--+
| id  | chinese  |  math  | english  |
+-----+----------+--------+----------+--+
| 1   | 100.0    | 100.0  | 100.0    |
| 2   | 80.0     | 79.0   | 100.0    |
+-----+----------+--------+----------+--+
2 rows selected (25.617 seconds)

SELECT id,concat_ws(’,’, collect_set(concat(course, ‘:’, score))) 成绩 FROM t_student GROUP BY id

Hive数据倾斜

数据倾斜是进行大数据计算时最经常遇到的问题之一。当我们在执行HiveQL或者运行MapReduce作业时候,如果遇到一直卡在map100%,reduce99%一般就是遇到了数据倾斜的问题。数据倾斜其实是进行分布式计算的时候,某些节点的计算能力比较强或者需要计算的数据比较少,早早执行完了,某些节点计算的能力较差或者由于此节点需要计算的数据比较多,导致出现其他节点的reduce阶段任务执行完成,但是这种节点的数据处理任务还没有执行完成。

group by,我使用Hive对数据做一些类型统计的时候遇到过某种类型的数据量特别多,而其他类型数据的数据量特别少。当按照类型进行group by的时候,会将相同的group by字段的reduce任务需要的数据拉取到同一个节点进行聚合,而当其中每一组的数据量过大时,会出现其他组的计算已经完成而这里还没计算完成,其他节点的一直等待这个节点的任务执行完成,所以会看到一直map 100% reduce 99%的情况。

解决方法:

set hive.map.aggr=true
set hive.groupby.skewindata=true

原理:

hive.map.aggr=true 这个配置项代表是否在map端进行聚合hive.groupby.skwindata=true 当选项设定为 true,生成的查询计划会有两个 MR Job。第一个 MR Job 中,Map 的输出结果集合会随机分布到 Reduce 中,每个 Reduce 做部分聚合操作,并输出结果,这样处理的结果是相同的 Group By Key 有可能被分发到不同的 Reduce 中,从而达到负载均衡的目的;第二个 MR Job 再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中(这个过程可以保证相同的 Group By Key 被分布到同一个 Reduce 中),最后完成最终的聚合操作。

HBase集成

CREATE EXTERNAL table t_user_hbase(
         id  string,
         name  string,
         age  int,
         salary  double,
         company string
  )
 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
 WITH SERDEPROPERTIES('hbase.columns.mapping' =
':key,cf1:name,cf1:age,cf1:salary,cf1:company')
 TBLPROPERTIES('hbase.table.name' = 'baizhi:t_user');
0: jdbc:hive2://CentOS:10000> select avg(salary),max(salary),min(salary),sum(salary) ,company from t_user_hbase group by company;
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值