Apache Hive

Apache Hive

背景

Hive是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具,可以用来进行数据提取转化加载(ETL-Extract-Transform-Load),这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。 Hive 查询操作过程严格遵守Hadoop MapReduce 的作业执行模型,Hive 将用户的HiveQL 语句通过解释器转换为提交到Hadoop 集群上,Hadoop 监控作业执行过程,然后返回作业执行结果给用户。 Hive 并非为联机事务处理而设计,Hive 并不提供实时的查询和基于行级的数据更新操作。Hive 的最佳使用场合是大数据集的批处理作业,例如,网络日志分析等。
hive结构
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。同时,Hive也是建立在Hadoop上的数据仓库基础构架。它提供了一系列的工具,可以用来进行数据提取转化加载(ETL),并且提供了存储、查询和分析Hadoop中的大规模数据的机制。Hive定义了简单的类SQL查询语言,称为HQL,它允许熟悉SQL的用户查询数据。这个语言也允许熟悉MapReduce的开发者的设计自定义的Mapper和Reducer来处理内建的Mapper和Reducer无法完成的复杂的分析工作。

架构

Hive提供了一个SQL命令的操作接口,允许用户可以使用类似SQL的Hive的Query Language执行一些离线的SQL分析。但是Hive和传统的数据库不同,Hive只是构建在Hadoop的MapReduce之上的SQL解析工具,并不参与数据的管理和存储,Hive中所有的数据都是在运行任务的时候才会真正的加载。
hive执行原理
总的来说Hive有以下几点特性:

  • Hive和关系数据库存储文件的系统不同,Hive使用的是Hadoop的HDFS,关系数据库则是服务器本地的文件系统。

  • Hive使用的计算模型是MapReduce,而关系数据库则是自己设计的计算模型。

  • 关系数据库都是为OLTP进行设计的,而Hive则是为海量数据做数据挖掘设计的,实时性很差,实时性的区别导致Hive的应用场景和关系数据库有很大的不同。

  • Hive很容易扩展自己的存储能力和计算能力,这个是继承Hadoop的特性,而关系数据库在这个方面要比Hive差很多。

Hive安装

由于hive需要存储一些一些建表的元数据信息,因此在生产环境下安装Hive需要事先安装MySQL数据库服务(注意:MySQL数据库的编码必须是Latin1编码)。其次Hive是构建在Hadoop存储和计算之上的工具,因此还需要在配置和安装hive之前保证Hadoop的hdfs和MapReduce正常运行。最后Hive的安装主机上必须配置HADOOP_HOME环境变量,这样Hive可以通过该环境变量感知用户的Hadoop计算集群的位置和规模。

MariaDB安装

1、安装MariaDB

通过yum安装就行了。简单快捷,安装mariadb-server,默认依赖安装mariadb,一个是服务端、一个是客户端。

[root@CentOS ~]# yum install mariadb-server

2、配置MariaDB

1)安装完成后首先要把MariaDB服务开启,并设置为开机启动

[root@CentOS ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@CentOS ~]# systemctl start mariadb
[root@CentOS ~]# systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
   Active: active (running) since 四 2020-01-09 13:11:39 CST; 2s ago
  Process: 1582 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
  Process: 1495 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 1581 (mysqld_safe)
   CGroup: /system.slice/mariadb.service
           ├─1581 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
           └─1743 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql...

1月 09 13:11:37 CentOS mariadb-prepare-db-dir[1495]: MySQL manual for more i...
1月 09 13:11:37 CentOS mariadb-prepare-db-dir[1495]: Please report any probl...
1月 09 13:11:37 CentOS mariadb-prepare-db-dir[1495]: The latest information ...
1月 09 13:11:37 CentOS mariadb-prepare-db-dir[1495]: You can find additional...
1月 09 13:11:37 CentOS mariadb-prepare-db-dir[1495]: http://dev.mysql.com
1月 09 13:11:37 CentOS mariadb-prepare-db-dir[1495]: Consider joining MariaD...
1月 09 13:11:37 CentOS mariadb-prepare-db-dir[1495]: https://mariadb.org/get...
1月 09 13:11:37 CentOS mysqld_safe[1581]: 200109 13:11:37 mysqld_safe Loggi....
1月 09 13:11:37 CentOS mysqld_safe[1581]: 200109 13:11:37 mysqld_safe Start...l
1月 09 13:11:39 CentOS systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[root@CentOS ~]# systemctl enable mariadb 
Created symlink from /etc/systemd/system/multi-user.target.wants/mariadb.service to /usr/lib/systemd/system/mariadb.service.

2)首次安装需要进行数据库的配置,命令都和mysql的一样 ,配置时出现的各个选项

[root@CentOS ~]# mysql_secure_installation

NOTE: RUNNING ALL PARTS OF THIS SCRIPT IS RECOMMENDED FOR ALL MariaDB
      SERVERS IN PRODUCTION USE!  PLEASE READ EACH STEP CAREFULLY!

In order to log into MariaDB to secure it, we'll need the current
password for the root user.  If you've just installed MariaDB, and
you haven't set the root password yet, the password will be blank,
so you should just press enter here.

Enter current password for root (enter for none): <直接回车>
OK, successfully used password, moving on...

Setting the root password ensures that nobody can log into the MariaDB
root user without the proper authorisation.

Set root password? [Y/n] Y
New password: <输入root>
Re-enter new password: <输入root>
Password updated successfully!
Reloading privilege tables..
 ... Success!


By default, a MariaDB installation has an anonymous user, allowing anyone
to log into MariaDB without having to have a user account created for
them.  This is intended only for testing, and to make the installation
go a bit smoother.  You should remove them before moving into a
production environment.

Remove anonymous users? [Y/n] Y
 ... Success!

Normally, root should only be allowed to connect from 'localhost'.  This
ensures that someone cannot guess at the root password from the network.

Disallow root login remotely? [Y/n] n
 ... skipping.

By default, MariaDB comes with a database named 'test' that anyone can
access.  This is also intended only for testing, and should be removed
before moving into a production environment.

Remove test database and access to it? [Y/n] n
 ... skipping.

Reloading the privilege tables will ensure that all changes made so far
will take effect immediately.

Reload privilege tables now? [Y/n] Y
 ... Success!

Cleaning up...

All done!  If you've completed all of the above steps, your MariaDB
installation should now be secure.

Thanks for using MariaDB!

4)测试是否能够登录成功,并开启远程访问权限!

[root@CentOS ~]# mysql -u root -proot
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 9
Server version: 5.5.64-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use user
ERROR 1049 (42000): Unknown database 'user'
MariaDB [(none)]> use mysql
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [mysql]> grant all privileges on *.* to 'root'@'%' identified by 'root' with grant option;
Query OK, 0 rows affected (0.00 sec)

MariaDB [mysql]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

MariaDB [mysql]>

Hive安装配置

1、确保HDFS和YARN正常运行

2、确保MySQL数据库可以正常访问

3、安装配置Hive

由于下载的Apache-Hive-1.2.2.tar.gz和HBase-1.2.4版本存在兼容性问题,因此建议用户在使用Hive的时候自行编译Hive的源码

①Hive源码编译

1.在CentOS上安装配置maven环境

[root@CentOS ~]# tar -zxf apache-maven-3.6.3-bin.tar.gz -C /usr/
[root@CentOS ~]# vi .bashrc
M2_HOME=/usr/apache-maven-3.6.3
JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2/
HBASE_HOME=/usr/hbase-1.2.4/
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$M2_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
export HBASE_HOME
HBASE_CLASSPATH=$(/usr/hbase-1.2.4/bin/hbase classpath)
HADOOP_CLASSPATH=$HBASE_CLASSPATH:/root/mysql-connector-java-5.1.49.jar
export HADOOP_CLASSPATH
export M2_HOME
[root@CentOS ~]# source .bashrc

2.上传apache-hive-1.2.2-src.tar.gz源码文件,并且修改pom.xml将hbase.hadoop2.version值修改为1.2.4

3.然后在apache-hive-1.2.2-src目录下执行以下指令

mvn -Phadoop-2 -Pdist -DskipTests -Dmaven.javadoc.skip=true clean package

4.经过编译后我们可以在编译的packaging/target/apache-hive-1.2.2-bin.tar.gz找到我们编译好的二进制文件,到此就解决了Hive和Hbase版本兼容问题!

强烈建议为了加速编译速度,建议配置阿里云maven镜像

②、安装配置Hive

1.解压Hive,并且配置HIVE_HOME环境变量

[root@CentOS ~]# tar -zxf apache-hive-1.2.2-bin_编译版本.tar.gz -C /usr/
[root@CentOS ~]# vi .bashrc
M2_HOME=/usr/apache-maven-3.6.3
HIVE_HOME=/usr/apache-hive-1.2.2-bin
JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2/
HBASE_HOME=/usr/hbase-1.2.4/
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$M2_HOME/bin:$HIVE_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
export HBASE_HOME
HBASE_CLASSPATH=$(/usr/hbase-1.2.4/bin/hbase classpath)
HADOOP_CLASSPATH=$HBASE_CLASSPATH:/root/mysql-connector-java-5.1.49.jar
export HADOOP_CLASSPATH
export M2_HOME
export HIVE_HOME
[root@CentOS ~]# source .bashrc

2.在Hive的安装目录下创建conf/hive-site.xml,配置以下内容

[root@CentOS ~]# vi /usr/apache-hive-1.2.2-bin/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://CentOS:3306/hive?createDatabaseIfNotExist=true</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
  </property>

</configuration>

3.需要将MySQL的驱动jar拷贝到hive的lib目录下

[root@CentOS ~]# cp mysql-connector-java-5.1.49.jar /usr/apache-hive-1.2.2-bin/lib/

4.启动Hive有两种模式

  • 本地模式/单用户

    [root@CentOS ~]# cd /usr/apache-hive-1.2.2-bin/
    [root@CentOS apache-hive-1.2.2-bin]# ./bin/hive
    hive> show databases;
    OK
    default
    Time taken: 1.217 seconds, Fetched: 1 row(s)
    hive> create database baizhi;
    OK
    Time taken: 0.293 seconds
    hive> use baizhi ;
    OK
    Time taken: 0.061 seconds
    

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YjDOPpeY-1603760983464)(assets/image-20201014111912493.png)]

由于我们启动Hive的时候,系统需要读取HADOOP_CLASSPTH,前期我们将HBase的类路径添加到了HADOOP_CLASSPTH下,导致HBase的jar和hive的jar产生冲突,因此用户需要将HBase的类路径从HADOOP_CLASSPTH下移除即可

[root@CentOS ~]# vi .bashrc
M2_HOME=/usr/apache-maven-3.6.3
HIVE_HOME=/usr/apache-hive-1.2.2-bin
JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2/
HBASE_HOME=/usr/hbase-1.2.4/
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$M2_HOME/bin:$HIVE_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
export HBASE_HOME
HBASE_CLASSPATH=$(/usr/hbase-1.2.4/bin/hbase classpath)
HADOOP_CLASSPATH=/root/mysql-connector-java-5.1.49.jar
export HADOOP_CLASSPATH
export M2_HOME
export HIVE_HOME
[root@CentOS ~]# source .bashrc

或者

将hive的lib目录下的jline-2.12.jar拷贝到Hadoop的/share/hadoop/yarn/lib/目录下,这样也可以解决jar的冲突问题!

第一进入到Hive窗口之后,hive会自动在MySQL的hive库下创建29张表

  • 远程模式/多用户模式
[root@CentOS apache-hive-1.2.2-bin]#  ./bin/hiveserver2  >/dev/null 2>&1 &
[3] 26279

[root@CentOS apache-hive-1.2.2-bin]# ./bin/beeline -u jdbc:hive2://CentOS:10000 -n root
Connecting to jdbc:hive2://CentOS:10000
Connected to: Apache Hive (version 1.2.2)
Driver: Hive JDBC (version 1.2.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.2 by Apache Hive
0: jdbc:hive2://CentOS:10000> 

0: jdbc:hive2://CentOS:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| baizhi         |
| default        |
+----------------+--+
2 rows selected (1.951 seconds)
0: jdbc:hive2://CentOS:10000> drop database baizhi;
No rows affected (0.273 seconds)

表操作

Hive支持原始数据类型和复杂类型,原始类型包括数值型,Boolean,字符串,时间戳。复杂类型包括数组,map,struct。下面是Hive数据类型的一个总结:

原始类型

类型描述字面量示例
BOOLEAN布尔值,可选值true/falsetrue
TINYINT1字节的有符号整数 -128~1271Y
SMALLINT2个字节的有符号整数,-32768~327671S
INT4个字节的带符号整数1
BIGINT8字节带符号整数1L
FLOAT4字节单精度浮点数1.0
DOUBLE8字节双精度浮点数1.0
DEICIMAL任意精度的带符号小数1.0
STRING字符串,变长“Abc”
VARCHAR变长字符串“Aac”
CHAR固定长度字符串“a”,’b’
BINARY字节数组
TIMESTAMP时间戳,纳秒精度122327493795
DATE日期‘2020-01-01

复杂类型

类型描述字面量示例
ARRAY有序的的同类型的集合array(1,2)
MAPkey-value,key必须为原始类型,value可以任意类型map(‘a’,1,’b’,2)
STRUCT字段集合,类型可以不同struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0)

案例测试

①创建库

0: jdbc:hive2://CentOS:10000> create database baizhi;
No rows affected (0.324 seconds)
0: jdbc:hive2://CentOS:10000> desc database baizhi;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k80QlMsL-1603760983467)(assets/image-20201014144740583.png)]

②使用库

0: jdbc:hive2://CentOS:10000> use baizhi;
No rows affected (0.092 seconds)
0: jdbc:hive2://CentOS:10000> select current_database();
+---------+--+
|   _c0   |
+---------+--+
| baizhi  |
+---------+--+
1 row selected (0.877 seconds)

③创建表

create table if not exists t_employee (
  id int,
  name varchar(32),
  age int,
  salary double,
  birthDay date,
  hobbies array<string>,
  address struct<street:string,country:string>,
  detail map<string,double>
);

④查看所有表

0: jdbc:hive2://CentOS:10000> show tables;
+-------------+--+
|  tab_name   |
+-------------+--+
| t_employee  |
+-------------+--+
1 row selected (0.105 seconds)

⑤查看表的建表详情

0: jdbc:hive2://CentOS:10000> desc t_employee;
+-----------+---------------------------------------+----------+--+
| col_name  |               data_type               | comment  |
+-----------+---------------------------------------+----------+--+
| id        | int                                   |          |
| name      | varchar(32)                           |          |
| age       | int                                   |          |
| salary    | double                                |          |
| birthday  | date                                  |          |
| hobbies   | array<string>                         |          |
| address   | struct<street:string,country:string>  |          |
| detail    | map<string,double>                    |          |
+-----------+---------------------------------------+----------+--+
8 rows selected (0.318 seconds)

如果用户需要看到更多建表信息可以使用desc formatted 表名查看

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JY5bRozO-1603760983470)(assets/image-20201014145523075.png)]

数据格式

默认分隔符

在创建完表之后,需要准备数据,其中hive默认的字段分割符号如下,因此上述建表等价写法如下:

分隔符描述
\n对于文本文件而言,一行表示一条文本记录,因此可以使用\n表示
^A(Ctrl+A)用于分割字符的列,在创建表的时候可以使用八进制’\001’表示
^B(Ctrl+B)用于分割ARRAY、STRUCT或者MAP中的元素,在创建表的时候可以使用八进制’\002’表示
^C(Ctrl+C)用于分割MAP中的key,value,在创建表的时候可以使用八进制’\003’表示
create table if not exists t_employee_1 (
  id int,
  name varchar(32),
  age int,
  salary double,
  birthDay date,
  hobbies array<string>,
  address struct<street:string,country:string>,
  detail map<string,double>
)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003'
lines terminated by '\n'
stored as textfile;

按照如下数据格式输入数据,然后倒入到指定的表中:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-27awZVqS-1603760983473)(assets/image-20201014151151309.png)]

load data local inpath ‘本地路径’ overwrite into table 表名

  • local-表示本地读取省略则表示从hdfs读取
  • overwrite - 表示覆盖表目录下的数据,如果去掉则保留原有的数据文件
0: jdbc:hive2://CentOS:10000> load data local inpath '/root/t_employee' overwrite into table t_employee_1;
INFO  : Loading data to table baizhi.t_employee_1 from file:/root/t_employee
INFO  : Table baizhi.t_employee_1 stats: [numFiles=1, numRows=0, totalSize=175, rawDataSize=0]
No rows affected (0.622 seconds)
自定义分割符号
1,zhangsan,true,18,15000,TV|Game,001>建设|002>招商,china|bj 
2,lisi,true,28,15000,TV|Game,001>建设|002>招商,china|bj 
3,wangwu,false,38,5000,TV|Game,001>建设|002>招商,china|sh
create table if not exists t_user (
  id int,
  name varchar(32),
  sex boolean,
  age int,
  salary double,
  hobbies array<string>,
  card map<string,string>,
  address struct<street:string,country:string>
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
正则格式数据
192.168.0.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.2.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.0.1 xx com.xx.xx.XxxService#xx 2018-10-10 10:10:00
192.168.202.1 qq com.xx.xx.XxxService#xx 2018-10-10 10:10:00
create table if not exists t_access(
     ip string,
     app varchar(32),
     service string,
     last_time timestamp
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex"="^(.*)\\s(.*)\\s(.*)\\s(.*\\s.*)"
);
CSV格式文件
1,apple,3,20.5
2,orange,2,21.5
3,pear,2,22.5
4,banana,1,23.0
CREATE TABLE if not exists t_product(
  id int,
  item string,
  count int,
  price double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "escapeChar"    = "\\"
);
JSON格式文件
{"id":1,"name":"zhangsan","sex":true,"register":"1991-02-08","score":100.0}
{"id":2,"name":"lisi","sex":true,"register":"1991-02-08","score":80.0}
create table if not exists t_student1(
   id int,
   name varchar(32),
   sex boolean,
   register date,
   score double
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

需要添加jar文件 add jar /usr/apache-hive-1.2.2-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.2.jar删除的时候使用delete指令.

常规操作

DDL操作

※Database

1、查看数据库

0: jdbc:hive2://CentOS:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| test           |
+----------------+--+
2 rows selected (0.441 seconds)

2、切换数据库

0: jdbc:hive2://CentOS:10000> use test;
No rows affected (0.03 seconds)

3、新建数据库

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name   --DATABASE|SCHEMA 是等价的
  [COMMENT database_comment] --数据库注释
  [LOCATION hdfs_path] --存储在 HDFS 上的位置
  [WITH DBPROPERTIES (property_name=property_value, ...)]; --指定额外属性
0: jdbc:hive2://CentOS:10000> CREATE DATABASE IF NOT EXISTS test COMMENT 'database for test' LOCATION 'hdfs:///userdatabase/test.db' WITH DBPROPERTIES ('creator'='jiangzz');
No rows affected (0.186 seconds)

4、查看数据库详情

DESC DATABASE [EXTENDED] db_name; --EXTENDED 表示是否显示额外属性

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bYGgdUvS-1603760983475)(assets/image-20201015094810174.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hjx4Ed8a-1603760983477)(assets/image-20201015094836898.png)]

5、删除数据库

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
0: jdbc:hive2://CentOS:10000> drop schema baizhi cascade;
No rows affected (0.653 seconds)

默认行为是 RESTRICT,如果数据库中存在表则删除失败。要想删除库及其中的表,可以使用 CASCADE 级联删除。

6、查看当前数据库

0: jdbc:hive2://CentOS:10000> select current_database();
+------------+--+
|    _c0     |
+------------+--+
| hive_test  |
+------------+--+
1 row selected (0.041 seconds)
√Table

1、建表操作

  • 管理表 - 内部表也称之为MANAGED_TABLE;默认存储在/user/hive/warehouse下,也可以通过location指定;删除表时,会删除表数据以及元数据;
create table if not exists t_user(
  id int,
  name string,
  sex boolean,
  age int,
  salary double,
  hobbies array<string>,
  card map<string,string>,
  address struct<country:string,city:string>
)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
  • 外部表 - 外部表称之为EXTERNAL_TABLE;在创建表时可以自己指定目录位置(LOCATION);删除表时,只会删除元数据不会删除表数据;
create external table if not exists t_access(
    ip string,
    app varchar(32),
    service string,
    last_time timestamp
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
 "input.regex"="^(.*)\\s(.*)\\s(.*)\\s(.*\\s.*)"
)
LOCATION 'hdfs:///hive/t_access';
  • 分区表 - Hive中的表对应为HDFS上的指定目录,在查询数据时候,默认会对全表进行扫描,这样时间和性能的消耗都非常大。分区是在HDFS上为表创建子目录,数据按照分区存储在子目录中。如果查询的where子句的中包含分区条件,则直接从该分区去查找,而不是扫描整个表目录。合理的分区设计可以极大提高查询速度和性能。在Hive中可以使用PARTITIONED BY子句创建分区表。表可以包含一个或多个分区列,程序会为分区列中的每个不同值组合创建单独的数据目录。
 CREATE EXTERNAL TABLE t_employee(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2)
    )
    PARTITIONED BY (deptno INT)   
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/t_employee';
7369	SMITH	CLERK	7902	1980-12-17 00:00:00	800.00
7499	ALLEN	SALESMAN	7698	1981-02-20 00:00:00	1600.00	300.00
7521	WARD	SALESMAN	7698	1981-02-22 00:00:00	1250.00	500.00
7566	JONES	MANAGER	7839	1981-04-02 00:00:00	2975.00
7654	MARTIN	SALESMAN	7698	1981-09-28 00:00:00	1250.00	1400.00
7698	BLAKE	MANAGER	7839	1981-05-01 00:00:00	2850.00
7782	CLARK	MANAGER	7839	1981-06-09 00:00:00	2450.00
7788	SCOTT	ANALYST	7566	1987-04-19 00:00:00	1500.00
7839	KING	PRESIDENT		1981-11-17 00:00:00	5000.00
7844	TURNER	SALESMAN	7698	1981-09-08 00:00:00	1500.00	0.00
7876	ADAMS	CLERK	7788	1987-05-23 00:00:00	1100.00
7900	JAMES	CLERK	7698	1981-12-03 00:00:00	950.00
7902	FORD	ANALYST	7566	1981-12-03 00:00:00	3000.00
7934	MILLER	CLERK	7782	1982-01-23 00:00:00	1300.00
0: jdbc:hive2://CentOS:10000> load data local inpath '/root/t_employee' overwrite into table t_employee partition(deptno='10');
INFO  : Loading data to table test.t_employee partition (deptno=10) from file:/root/t_employee
INFO  : Partition test.t_employee{deptno=10} stats: [numFiles=1, numRows=0, totalSize=747, rawDataSize=0]
No rows affected (0.539 seconds)

0: jdbc:hive2://CentOS:10000> load data local inpath '/root/t_employee' overwrite into table t_employee partition(deptno='20');
INFO  : Loading data to table test.t_employee partition (deptno=20) from file:/root/t_employee
INFO  : Partition test.t_employee{deptno=20} stats: [numFiles=1, numRows=0, totalSize=747, rawDataSize=0]
No rows affected (0.419 seconds)
0: jdbc:hive2://CentOS:10000>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-47B8DVp9-1603760983479)(assets/image-20201015101253848.png)]

  • 分桶表 - 分区表是为了将文件按照分区文件夹进行粗粒度文件隔离,但是分桶表是将数据按照某个字段进行hash计算出所属的桶,然后在对桶内的数据进行排序。
 CREATE EXTERNAL TABLE t_employee_bucket(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2),
    deptno INT)
    DISTRIBUTE BY(job) SORTED BY(salary ASC) INTO 4 BUCKETS  
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/employee_bucket';
0: jdbc:hive2://CentOS:10000> set hive.enforce.bucketing = true;
0: jdbc:hive2://CentOS:10000> INSERT INTO TABLE t_employee_bucket SELECT *  FROM t_employee;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mYdGwrli-1603760983481)(assets/image-20201015103008816.png)]

这里的SORTED BY(salary ASC)目前在Hive-1.2.2中没有实现!

  • 临时表 - 临时表仅对当前session可见,临时表的数据将存储在用户的暂存目录中,并在会话结束后删除。如果临时表与永久表表名相同,则对该表名的任何引用都将解析为临时表,而不是永久表。临时表还具有以下两个限制:不支持分区列;不支持创建索引.
CREATE TEMPORARY TABLE if not exists emp_temp(
    id INT,
    name STRING,
    job STRING,
    manager INT,
    hiredate TIMESTAMP,
    salary DECIMAL(7,2),
    deptno INT
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    LOCATION '/hive/emp_temp';
  • CTAS创建表-创建的表结构和查询的信息类似,但是并不会拷贝表的结构

    0: jdbc:hive2://CentOS:10000> create  TEMPORARY TABLE t_employee_copy1 as select * from t_employee where deptno=10;
    
  • 复制表结构 - 仅仅是赋值表的结构,但是不拷贝数据。

    0: jdbc:hive2://CentOS:10000> CREATE TEMPORARY EXTERNAL TABLE   t_empoyee_copy2  LIKE t_employee_bucket location '/hive/t_empoyee_copy2';
    No rows affected (0.128 seconds)
    0: jdbc:hive2://CentOS:10000> desc t_empoyee_copy2;
    +-----------+---------------+----------+--+
    | col_name  |   data_type   | comment  |
    +-----------+---------------+----------+--+
    | id        | int           |          |
    | name      | string        |          |
    | job       | string        |          |
    | manager   | int           |          |
    | hiredate  | timestamp     |          |
    | salary    | decimal(7,2)  |          |
    | deptno    | int           |          |
    +-----------+---------------+----------+--+
    7 rows selected (0.038 seconds)
    

2、修改表

  • 重命名表
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_user RENAME TO t_u;
  • 修改列(修改类型、顺序、新增)
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_employee CHANGE id  eid INT;--修改列名&类型
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_employee CHANGE id id decimal(7,2)  AFTER name;--修改顺序
0: jdbc:hive2://CentOS:10000> ALTER TABLE t_employee ADD COLUMNS (address STRING);
  • 清空表
0: jdbc:hive2://CentOS:10000> truncate table t_employee partition(deptno=10);

只可以截断 管理表

  • 删除
0: jdbc:hive2://CentOS:10000> drop table t_employee PURGE;

PURGE表示数据会直接删除,不会放置在垃圾箱中,需要开启HDFS垃圾回收!

3、其他命令

Describe

  • 查看数据库
DESCRIBE|DESC DATABASE [EXTENDED] db_name; 
0: jdbc:hive2://CentOS:10000> desc database hive_test;
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
|  db_name   |      comment       |                       location                       | owner_name  | owner_type  | parameters  |
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
| hive_test  | database for test  | hdfs://CentOS:9000/user/hive/warehouse/hive_test.db  | root        | USER        |             |
+------------+--------------------+------------------------------------------------------+-------------+-------------+-------------+--+
1 row selected (0.039 seconds)
  • 查看表
DESCRIBE|DESC [EXTENDED|FORMATTED] table_name 
0: jdbc:hive2://CentOS:10000> desc  t_user;
+-----------+-------------------------------------+----------+--+
| col_name  |              data_type              | comment  |
+-----------+-------------------------------------+----------+--+
| id        | int                                 |          |
| name      | string                              |          |
| sex       | boolean                             |          |
| age       | int                                 |          |
| salary    | double                              |          |
| hobbies   | array<string>                       |          |
| card      | map<string,string>                  |          |
| address   | struct<country:string,city:string>  |          |
+-----------+-------------------------------------+----------+--+
8 rows selected (0.06 seconds)

Show

  • 查看数据库列表
SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
0: jdbc:hive2://CentOS:10000> show schemas like '*'
0: jdbc:hive2://CentOS:10000> ;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| hive_test      |
| test           |
+----------------+--+
3 rows selected (0.03 seconds)
  • 查看表的列表
SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
0: jdbc:hive2://CentOS:10000> show tables;
+--------------------+--+
|      tab_name      |
+--------------------+--+
| t_access           |
| t_employ_copy      |
| t_employee_bucket  |
| t_employee_copy1   |
| t_user             |
+--------------------+--+
5 rows selected (0.054 seconds)
0: jdbc:hive2://CentOS:10000> show tables in test;
+-------------+--+
|  tab_name   |
+-------------+--+
| t_access    |
| t_employee  |
| t_product   |
| t_student   |
| t_user      |
+-------------+--+
5 rows selected (0.043 seconds)
0: jdbc:hive2://CentOS:10000> show tables in test like 't_*';
+-------------+--+
|  tab_name   |
+-------------+--+
| t_access    |
| t_employee  |
| t_product   |
| t_student   |
| t_user      |
+-------------+--+
5 rows selected (0.04 seconds)
  • 查看分区
0: jdbc:hive2://CentOS:10000> show partitions t_employee;
+------------+--+
| partition  |
+------------+--+
| deptno=10  |
+------------+--+
1 row selected (0.065 seconds)
  • 查看建表语句
0: jdbc:hive2://CentOS:10000> show create table t_employee;
+-----------------------------------------------------------------+--+
|                         createtab_stmt                          |
+-----------------------------------------------------------------+--+
| CREATE EXTERNAL TABLE `t_employee`(                             |
|   `id` int,                                                     |
|   `name` string,                                                |
|   `job` string,                                                 |
|   `manager` int,                                                |
|   `hiredate` timestamp,                                         |
|   `salary` decimal(7,2))                                        |
| PARTITIONED BY (                                                |
|   `deptno` int)                                                 |
| ROW FORMAT DELIMITED                                            |
|   FIELDS TERMINATED BY '\t'                                     |
| STORED AS INPUTFORMAT                                           |
|   'org.apache.hadoop.mapred.TextInputFormat'                    |
| OUTPUTFORMAT                                                    |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'  |
| LOCATION                                                        |
|   'hdfs://CentOS:9000/hive/t_employee'                          |
| TBLPROPERTIES (                                                 |
|   'transient_lastDdlTime'='1576961129')                         |
+-----------------------------------------------------------------+--+
19 rows selected (0.117 seconds)

更多请参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

DML操作

加载文件数据到表中
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
0: jdbc:hive2://CentOS:10000> load data local inpath '/root/t_employee' overwrite into table t_employee partition(deptno='10');
查询结果插入表中
  • 将查询结果插入表中
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]]   
select_statement1 FROM from_statement; --覆盖

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] 
select_statement1 FROM from_statement;--追加
{"EMPNO": 7369,"ENAME": "SMITH","JOB": "CLERK","MGR": 7902,"HIREDATE": "1980-12-17 00:00:00","SAL": 800.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7499,"ENAME": "ALLEN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-20 00:00:00","SAL": 1600.00,"COMM": 300.00,"DEPTNO": 30}
{"EMPNO": 7521,"ENAME": "WARD","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-02-22 00:00:00","SAL": 1250.00,"COMM": 500.00,"DEPTNO": 30}
{"EMPNO": 7566,"ENAME": "JONES","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-04-02 00:00:00","SAL": 2975.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7654,"ENAME": "MARTIN","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-28 00:00:00","SAL": 1250.00,"COMM": 1400.00,"DEPTNO": 30}
{"EMPNO": 7698,"ENAME": "BLAKE","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-05-01 00:00:00","SAL": 2850.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7782,"ENAME": "CLARK","JOB": "MANAGER","MGR": 7839,"HIREDATE": "1981-06-09 00:00:00","SAL": 2450.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7788,"ENAME": "SCOTT","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1987-04-19 00:00:00","SAL": 1500.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7839,"ENAME": "KING","JOB": "PRESIDENT","MGR": null,"HIREDATE": "1981-11-17 00:00:00","SAL": 5000.00,"COMM": null,"DEPTNO": 10}
{"EMPNO": 7844,"ENAME": "TURNER","JOB": "SALESMAN","MGR": 7698,"HIREDATE": "1981-09-08 00:00:00","SAL": 1500.00,"COMM": 0.00,"DEPTNO": 30}
{"EMPNO": 7876,"ENAME": "ADAMS","JOB": "CLERK","MGR": 7788,"HIREDATE": "1987-05-23 00:00:00","SAL": 1100.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7900,"ENAME": "JAMES","JOB": "CLERK","MGR": 7698,"HIREDATE": "1981-12-03 00:00:00","SAL": 950.00,"COMM": null,"DEPTNO": 30}
{"EMPNO": 7902,"ENAME": "FORD","JOB": "ANALYST","MGR": 7566,"HIREDATE": "1981-12-03 00:00:00","SAL": 3000.00,"COMM": null,"DEPTNO": 20}
{"EMPNO": 7934,"ENAME": "MILLER","JOB": "CLERK","MGR": 7782,"HIREDATE": "1982-01-23 00:00:00","SAL": 1300.00,"COMM": null,"DEPTNO": 10}
0: jdbc:hive2://CentOS:10000> add  jar /usr/apache-hive-1.2.2-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.2.jar

CREATE EXTERNAL TABLE t_emp_json(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
    
0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_emp_json;

+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
 CREATE TABLE t_employee(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2)
    )
    PARTITIONED BY (deptno INT)   
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
0: jdbc:hive2://CentOS:10000> insert overwrite  table t_employee partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from t_emp_json where deptno=10;--覆盖
0: jdbc:hive2://CentOS:10000> insert into  table t_employee partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm from t_emp_json where deptno=20;--追加
  • 将查询结果插入多个表
FROM from_statement
INSERT OVERWRITE TABLE tablename1 
[PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
from t_emp_json 
INSERT OVERWRITE TABLE t_employee partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=10 
INSERT OVERWRITE TABLE t_employee partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=20 
INSERT OVERWRITE TABLE t_employee partition(deptno=30) select empno,ename,job,mgr,hiredate,sal,comm  where deptno=30 
  • 插入动态分区
0: jdbc:hive2://CentOS:10000> set hive.exec.dynamic.partition.mode=nonstrict;
No rows affected (0.004 seconds)
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE TABLE t_employee PARTITION (deptno)  SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_emp_json;

0: jdbc:hive2://CentOS:10000> show partitions t_employee;
+------------+--+
| partition  |
+------------+--+
| deptno=10  |
| deptno=20  |
| deptno=30  |
+------------+--+
3 rows selected (0.064 seconds)
结果写出到文件系统
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] 
  SELECT ... FROM ...
0: jdbc:hive2://CentOS:10000> INSERT OVERWRITE  DIRECTORY '/employee' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE select * from t_employee;

更多参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

Hive 查询案例

准备

1、筹备一些测试数据

-t_employee-
7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,\N,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,\N,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,\N,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,\N,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,1500,\N,20
7839,KING,PRESIDENT,\N,1981-11-17 00:00:00,5000,\N,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,\N,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,\N,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,\N,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,\N,10
-t_dept-
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

2、筹备计算所需的脚本/root/hivescriot.sql

drop database if exists baizhi cascade;

create database if not exists baizhi;

use baizhi;

CREATE TABLE if not exists t_employee(
    empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;

CREATE TABLE  if not exists  t_dept(
    DEPTNO INT,
    DNAME STRING,
    LOC STRING)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;


load data local inpath '/root/t_employee' overwrite  into table  t_employee;
load data local inpath '/root/t_dept' overwrite  into table  t_dept;

3、执行脚本

[root@CentOS ~]# hive -f /root/hivescript.sql
SQL查询
[root@CentOS ~]# hive -e 'sql语句'

1、单表查询

0: jdbc:hive2://CentOS:10000> select empno,ename,job,mgr,hiredate,sal,comm,deptno  from t_employee;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7369   | SMITH   | CLERK      | 7902  | 1980-12-17 00:00:00.0  | 800   | NULL  | 20      |
| 7499   | ALLEN   | SALESMAN   | 7698  | 1981-02-20 00:00:00.0  | 1600  | 300   | 30      |
| 7521   | WARD    | SALESMAN   | 7698  | 1981-02-22 00:00:00.0  | 1250  | 500   | 30      |
| 7566   | JONES   | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7654   | MARTIN  | SALESMAN   | 7698  | 1981-09-28 00:00:00.0  | 1250  | 1400  | 30      |
| 7698   | BLAKE   | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK   | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
| 7788   | SCOTT   | ANALYST    | 7566  | 1987-04-19 00:00:00.0  | 1500  | NULL  | 20      |
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7844   | TURNER  | SALESMAN   | 7698  | 1981-09-08 00:00:00.0  | 1500  | 0     | 30      |
| 7876   | ADAMS   | CLERK      | 7788  | 1987-05-23 00:00:00.0  | 1100  | NULL  | 20      |
| 7900   | JAMES   | CLERK      | 7698  | 1981-12-03 00:00:00.0  | 950   | NULL  | 30      |
| 7902   | FORD    | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
14 rows selected (0.056 seconds)

2、WHERE查询

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee WHERE empno > 7782 AND deptno = 10;
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| empno  |  ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
| 7839   | KING    | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7934   | MILLER  | CLERK      | 7782  | 1982-01-23 00:00:00.0  | 1300  | NULL  | 10      |
+--------+---------+------------+-------+------------------------+-------+-------+---------+--+
2 rows selected (0.067 seconds)

3、DISTINCT查询

0: jdbc:hive2://CentOS:10000> select distinct(job) from t_employee;
+------------+--+
|    job     |
+------------+--+
| ANALYST    |
| CLERK      |
| MANAGER    |
| PRESIDENT  |
| SALESMAN   |
+------------+--+

4、LIMIT查询

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM t_employee  ORDER BY sal DESC LIMIT 5;
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
| empno  | ename  |    job     |  mgr  |        hiredate        |  sal  | comm  | deptno  |
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
| 7839   | KING   | PRESIDENT  | NULL  | 1981-11-17 00:00:00.0  | 5000  | NULL  | 10      |
| 7902   | FORD   | ANALYST    | 7566  | 1981-12-03 00:00:00.0  | 3000  | NULL  | 20      |
| 7566   | JONES  | MANAGER    | 7839  | 1981-04-02 00:00:00.0  | 2975  | NULL  | 20      |
| 7698   | BLAKE  | MANAGER    | 7839  | 1981-05-01 00:00:00.0  | 2850  | NULL  | 30      |
| 7782   | CLARK  | MANAGER    | 7839  | 1981-06-09 00:00:00.0  | 2450  | NULL  | 10      |
+--------+--------+------------+-------+------------------------+-------+-------+---------+--+
5 rows selected (14.294 seconds)

5、GROUP BY查询

0: jdbc:hive2://CentOS:10000>  SELECT deptno,SUM(sal) as total FROM t_employee GROUP BY deptno;
+---------+--------+--+
| deptno  | total  |
+---------+--------+--+
| 10      | 60900  |
| 20      | 69400  |
| 30      | 75200  |
| 107369  | 9100   |
+---------+--------+--+
4 rows selected (73.336 seconds)

类似的有max、min、avg、sum等常见聚合函数,hive.map.aggr控制程序如何进行聚合。默认值为false。如果设置为true,Hive会在map阶段就执行一次聚合。这可以提高聚合效率,但需要消耗更多内存。

6、ORDER AND SORT

可以使用ORDER BY或者Sort BY对查询结果进行排序,排序字段可以是整型也可以是字符串:如果是整型,则按照大小排序;如果是字符串,则按照字典序排序。ORDER BY 和 SORT BY 的区别如下:使用ORDER BY时会有一个Reducer对全部查询结果进行排序,可以保证数据的全局有序性;使用SORT BY时只会在每个Reducer中进行排序,这可以保证每个Reducer的输出数据是有序的,但不能保证全局有序。由于ORDER BY的时间可能很长,如果你设置了严格模式(hive.mapred.mode = strict),则其后面必须再跟一个limit子句。

  • sort by
0: jdbc:hive2://CentOS:10000> set mapreduce.job.reduces=2
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee sort by sal desc;
+--------+---------+-------+--+
| empno  |  ename  |  sal  |
+--------+---------+-------+--+
| 7902   | FORD    | 3000  |
| 7566   | JONES   | 2975  |
| 7844   | TURNER  | 1500  |
| 7788   | SCOTT   | 1500  |
| 7521   | WARD    | 1250  |
| 7654   | MARTIN  | 1250  |
| 7876   | ADAMS   | 1100  |
| 7900   | JAMES   | 950   |
| 7369   | SMITH   | 800   |
| 7839   | KING    | 5000  |
| 7698   | BLAKE   | 2850  |
| 7782   | CLARK   | 2450  |
| 7499   | ALLEN   | 1600  |
| 7934   | MILLER  | 1300  |
+--------+---------+-------+--+
14 rows selected (14.474 seconds)
  • order by
0: jdbc:hive2://CentOS:10000> set mapreduce.job.reduces=3;
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc;
+--------+---------+-------+--+
| empno  |  ename  |  sal  |
+--------+---------+-------+--+
| 7839   | KING    | 5000  |
| 7902   | FORD    | 3000  |
| 7566   | JONES   | 2975  |
| 7698   | BLAKE   | 2850  |
| 7782   | CLARK   | 2450  |
| 7499   | ALLEN   | 1600  |
| 7844   | TURNER  | 1500  |
| 7788   | SCOTT   | 1500  |
| 7934   | MILLER  | 1300  |
| 7654   | MARTIN  | 1250  |
| 7521   | WARD    | 1250  |
| 7876   | ADAMS   | 1100  |
| 7900   | JAMES   | 950   |
| 7369   | SMITH   | 800   |
+--------+---------+-------+--+
14 rows selected (13.049 seconds)

如果开启strict模式,用户必须添加limit关键字

0: jdbc:hive2://CentOS:10000> set hive.mapred.mode = strict;
No rows affected (0.004 seconds)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc;
Error: Error while compiling statement: FAILED: SemanticException 1:48 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'sal' (state=42000,code=40000)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal from t_employee order by sal desc limit 5; 
+--------+--------+-------+--+
| empno  | ename  |  sal  |
+--------+--------+-------+--+
| 7839   | KING   | 5000  |
| 7902   | FORD   | 3000  |
| 7566   | JONES  | 2975  |
| 7698   | BLAKE  | 2850  |
| 7782   | CLARK  | 2450  |
+--------+--------+-------+--+
5 rows selected (12.468 seconds)

7、HAVING过滤

0: jdbc:hive2://CentOS:10000> SELECT deptno,SUM(sal) total FROM t_employee GROUP BY deptno HAVING total>9000;
+---------+--------+--+
| deptno  | total  |
+---------+--------+--+
| 30      | 9400   |
| 20      | 9375   |
+---------+--------+--+
2 rows selected (91.233 seconds)

8、DISTRIBUTE BY

默认情况下,MapReduce程序会对Map输出结果的Key值进行散列,并均匀分发到所有Reducer上。如果想要把具有相同Key值的数据分发到同一个Reducer进行处理,这就需要使用DISTRIBUTE BY字句。需要注意的是,DISTRIBUTE BY虽然能保证具有相同Key值的数据分发到同一个Reducer,但是不能保证数据在Reducer上是有序的。

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee distribute BY deptno;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7654   | MARTIN  | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
| 7698   | BLAKE   | 2850  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7844   | TURNER  | 1500  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7934   | MILLER  | 1300  | 10      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7788   | SCOTT   | 1500  | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7876   | ADAMS   | 1100  | 20      |
| 7902   | FORD    | 3000  | 20      |
| 7369   | SMITH   | 800   | 20      |
+--------+---------+-------+---------+--+
14 rows selected (15.504 seconds)
0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee distribute BY deptno sort by sal desc;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7698   | BLAKE   | 2850  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7844   | TURNER  | 1500  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7654   | MARTIN  | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7934   | MILLER  | 1300  | 10      |
| 7902   | FORD    | 3000  | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7788   | SCOTT   | 1500  | 20      |
| 7876   | ADAMS   | 1100  | 20      |
| 7369   | SMITH   | 800   | 20      |
+--------+---------+-------+---------+--+
14 rows selected (16.528 seconds)

9、CLUSTER BY

如果SORT BYDISTRIBUTE BY指定的是相同字段,且SORT BY排序规则是ASC,此时可以使用CLUSTER BY进行替换。

0: jdbc:hive2://CentOS:10000> SELECT empno,ename,sal, deptno  FROM t_employee cluster by deptno;
+--------+---------+-------+---------+--+
| empno  |  ename  |  sal  | deptno  |
+--------+---------+-------+---------+--+
| 7934   | MILLER  | 1300  | 10      |
| 7839   | KING    | 5000  | 10      |
| 7782   | CLARK   | 2450  | 10      |
| 7876   | ADAMS   | 1100  | 20      |
| 7788   | SCOTT   | 1500  | 20      |
| 7369   | SMITH   | 800   | 20      |
| 7566   | JONES   | 2975  | 20      |
| 7902   | FORD    | 3000  | 20      |
| 7844   | TURNER  | 1500  | 30      |
| 7499   | ALLEN   | 1600  | 30      |
| 7698   | BLAKE   | 2850  | 30      |
| 7654   | MARTIN  | 1250  | 30      |
| 7521   | WARD    | 1250  | 30      |
| 7900   | JAMES   | 950   | 30      |
+--------+---------+-------+---------+--+
14 rows selected (25.847 seconds)

10、Join查询

Hive支持内连接,外连接,左外连接,右外连接,笛卡尔连接,这和传统数据库中的概念是一致的。需要特别强调:JOIN语句的关联条件必须用ON指定,不能用WHERE指定,否则就会先做笛卡尔积,再过滤,这会导致你得不到预期的结果。

  • 内连接
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e JOIN t_dept d ON e.deptno = d.deptno WHERE e.empno=7369;
+----------+----------+--------+-----------+-----------+--+
| e.empno  | e.ename  | e.sal  |  d.dname  | d.deptno  |
+----------+----------+--------+-----------+-----------+--+
| 7369     | SMITH    | 800    | RESEARCH  | 20        |
+----------+----------+--------+-----------+-----------+--+
1 row selected (10.419 seconds)

  • 外连接
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e LEFT OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
+----------+----------+--------+-------------+-----------+--+
14 rows selected (11.424 seconds)
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e RIGHT OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| NULL     | NULL     | NULL   | OPERATIONS  | 40        |
+----------+----------+--------+-------------+-----------+--+
15 rows selected (11.063 seconds)
0: jdbc:hive2://CentOS:10000>  SELECT e.empno,e.ename,e.sal,d.dname,d.deptno FROM t_employee e FULL OUTER JOIN t_dept d ON e.deptno = d.deptno;
+----------+----------+--------+-------------+-----------+--+
| e.empno  | e.ename  | e.sal  |   d.dname   | d.deptno  |
+----------+----------+--------+-------------+-----------+--+
| 7934     | MILLER   | 1300   | ACCOUNTING  | 10        |
| 7839     | KING     | 5000   | ACCOUNTING  | 10        |
| 7782     | CLARK    | 2450   | ACCOUNTING  | 10        |
| 7876     | ADAMS    | 1100   | RESEARCH    | 20        |
| 7788     | SCOTT    | 1500   | RESEARCH    | 20        |
| 7369     | SMITH    | 800    | RESEARCH    | 20        |
| 7566     | JONES    | 2975   | RESEARCH    | 20        |
| 7902     | FORD     | 3000   | RESEARCH    | 20        |
| 7844     | TURNER   | 1500   | SALES       | 30        |
| 7499     | ALLEN    | 1600   | SALES       | 30        |
| 7698     | BLAKE    | 2850   | SALES       | 30        |
| 7654     | MARTIN   | 1250   | SALES       | 30        |
| 7521     | WARD     | 1250   | SALES       | 30        |
| 7900     | JAMES    | 950    | SALES       | 30        |
| NULL     | NULL     | NULL   | OPERATIONS  | 40        |
+----------+----------+--------+-------------+-----------+--+
15 rows selected (24.703 seconds)

11、LEFT SEMI JOIN

LEFT SEMI JOIN (左半连接)是 IN/EXISTS 子查询的一种更高效的实现。

  • JOIN 子句中右边的表只能在 ON 子句中设置过滤条件;
  • 查询结果只包含左边表的数据,所以只能SELECT左表中的列。
0: jdbc:hive2://CentOS:10000> SELECT e.empno,e.ename FROM t_employee e LEFT SEMI JOIN t_dept d ON e.deptno = d.deptno AND d.loc="NEW YORK";
+----------+----------+-----------+--+
| e.empno  | e.ename  | e.deptno  |
+----------+----------+-----------+--+
| 7782     | CLARK    | 10        |
| 7839     | KING     | 10        |
| 7934     | MILLER   | 10        |
+----------+----------+-----------+--+
3 rows selected (10.119 seconds)

12、JOIN优化

  • STREAMTABLE

在多表进行join的时候,如果每个ON子句都使用到共同的列,此时Hive会进行优化,将多表JOIN在同一个map / reduce作业上进行。同时假定查询的最后一个表是最大的一个表,在对每行记录进行JOIN操作时,它将尝试将其他的表缓存起来,然后扫描最后那个表进行计算。因此用户需要保证查询的表的大小从左到右是依次增加的。

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key) JOIN c ON (c.key = b.key)

然而用户并非需要总是把最大的表放在查询语句的最后面,Hive提供了/*+ STREAMTABLE() */标志,使用该标识来指出大表,能避免数据表过大导致占用内存过多而产生的问题。示例如下:

0: jdbc:hive2://CentOS:10000> SELECT /*+ STREAMTABLE(e) */ e.empno,e.ename,d.dname,d.deptno FROM t_employee e JOIN t_dept d ON e.deptno = d.deptno WHERE job='CLERK';
+----------+----------+-------------+-----------+--+
| e.empno  | e.ename  |   d.dname   | d.deptno  |
+----------+----------+-------------+-----------+--+
| 7369     | SMITH    | RESEARCH    | 20        |
| 7876     | ADAMS    | RESEARCH    | 20        |
| 7900     | JAMES    | SALES       | 30        |
| 7934     | MILLER   | ACCOUNTING  | 10        |
+----------+----------+-------------+-----------+--+
4 rows selected (11.645 seconds)

  • MAPJOIN

如果在进行join操作时,有一个表很小,则可以将join操作调整到map阶段执行。这就是典型的极大表和极小表关联问题。有两种解决方式:1.增加**/+ MAPJOIN(b) /标示;2.设置参数hive.optimize.bucketmapjoin = true**

0: jdbc:hive2://CentOS:10000> set hive.optimize.bucketmapjoin = true ;
No rows affected (0.006 seconds)
0: jdbc:hive2://CentOS:10000> SELECT /*+ MAPJOIN(d) */ e.empno, e.ename,d.dname FROM t_employee e  JOIN t_dept d ON d.deptno = e.deptno;
+----------+----------+-------------+--+
| e.empno  | e.ename  |   d.dname   |
+----------+----------+-------------+--+
| 7369     | SMITH    | RESEARCH    |
| 7499     | ALLEN    | SALES       |
| 7521     | WARD     | SALES       |
| 7566     | JONES    | RESEARCH    |
| 7654     | MARTIN   | SALES       |
| 7698     | BLAKE    | SALES       |
| 7782     | CLARK    | ACCOUNTING  |
| 7788     | SCOTT    | RESEARCH    |
| 7839     | KING     | ACCOUNTING  |
| 7844     | TURNER   | SALES       |
| 7876     | ADAMS    | RESEARCH    |
| 7900     | JAMES    | SALES       |
| 7902     | FORD     | RESEARCH    |
| 7934     | MILLER   | ACCOUNTING  |
+----------+----------+-------------+--+
14 rows selected (11.416 seconds)

13、开窗函数

0: jdbc:hive2://CentOS:10000> select e.empno ,e.ename,e.sal,e.deptno,rank() over(partition by e.deptno order by e.sal) as rank from t_employee e; 
+----------+----------+--------+-----------+-------+--+
| e.empno  | e.ename  | e.sal  | e.deptno  | rank  |
+----------+----------+--------+-----------+-------+--+
| 7839     | KING     | 5000   | 10        | 1     |
| 7782     | CLARK    | 2450   | 10        | 2     |
| 7934     | MILLER   | 1300   | 10        | 3     |
| 7902     | FORD     | 3000   | 20        | 1     |
| 7566     | JONES    | 2975   | 20        | 2     |
| 7788     | SCOTT    | 1500   | 20        | 3     |
| 7876     | ADAMS    | 1100   | 20        | 4     |
| 7369     | SMITH    | 800    | 20        | 5     |
| 7698     | BLAKE    | 2850   | 30        | 1     |
| 7499     | ALLEN    | 1600   | 30        | 2     |
| 7844     | TURNER   | 1500   | 30        | 3     |
| 7654     | MARTIN   | 1250   | 30        | 4     |
| 7521     | WARD     | 1250   | 30        | 4     |
| 7900     | JAMES    | 950    | 30        | 6     |
+----------+----------+--------+-----------+-------+--+
0: jdbc:hive2://CentOS:10000> select e.empno ,e.ename,e.sal,e.deptno,dense_rank() over(partition by e.deptno order by e.sal desc) as rank from t_employee e; 
+----------+----------+--------+-----------+-------+--+
| e.empno  | e.ename  | e.sal  | e.deptno  | rank  |
+----------+----------+--------+-----------+-------+--+
| 7839     | KING     | 5000   | 10        | 1     |
| 7782     | CLARK    | 2450   | 10        | 2     |
| 7934     | MILLER   | 1300   | 10        | 3     |
| 7902     | FORD     | 3000   | 20        | 1     |
| 7566     | JONES    | 2975   | 20        | 2     |
| 7788     | SCOTT    | 1500   | 20        | 3     |
| 7876     | ADAMS    | 1100   | 20        | 4     |
| 7369     | SMITH    | 800    | 20        | 5     |
| 7698     | BLAKE    | 2850   | 30        | 1     |
| 7499     | ALLEN    | 1600   | 30        | 2     |
| 7844     | TURNER   | 1500   | 30        | 3     |
| 7654     | MARTIN   | 1250   | 30        | 4     |
| 7521     | WARD     | 1250   | 30        | 4     |
| 7900     | JAMES    | 950    | 30        | 5     |
+----------+----------+--------+-----------+-------+--+
14 rows selected (24.262 seconds)

14、Cube分析

0: jdbc:hive2://CentOS:10000> select e.deptno,e.job,avg(e.sal) avg,max(e.sal) max,min(e.sal) min from t_employee e group by e.deptno,e.job with cube;
+-----------+------------+--------------+-------+-------+--+
| e.deptno  |   e.job    |     avg      |  max  |  min  |
+-----------+------------+--------------+-------+-------+--+
| NULL      | ANALYST    | 2250         | 3000  | 1500  |
| 10        | CLERK      | 1300         | 1300  | 1300  |
| 20        | CLERK      | 950          | 1100  | 800   |
| 30        | CLERK      | 950          | 950   | 950   |
| 20        | ANALYST    | 2250         | 3000  | 1500  |
| NULL      | PRESIDENT  | 5000         | 5000  | 5000  |
| 10        | PRESIDENT  | 5000         | 5000  | 5000  |
| NULL      | SALESMAN   | 1400         | 1600  | 1250  |
| NULL      | MANAGER    | 2758.333333  | 2975  | 2450  |
| 30        | SALESMAN   | 1400         | 1600  | 1250  |
| 10        | MANAGER    | 2450         | 2450  | 2450  |
| 20        | MANAGER    | 2975         | 2975  | 2975  |
| 30        | MANAGER    | 2850         | 2850  | 2850  |
| NULL      | NULL       | 1966.071429  | 5000  | 800   |
| NULL      | CLERK      | 1037.5       | 1300  | 800   |
| 10        | NULL       | 2916.666667  | 5000  | 1300  |
| 20        | NULL       | 1875         | 3000  | 800   |
| 30        | NULL       | 1566.666667  | 2850  | 950   |
+-----------+------------+--------------+-------+-------+--+
18 rows selected (25.037 seconds)

15、行转列

1,语文,100
1,数学,100
1,英语,100
2,数学,79
2,语文,80
2,英语,100
CREATE TABLE t_student(
    id INT,
    course STRING,
    score double)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
0: jdbc:hive2://CentOS:10000> select id,max(case course when '语文' then score else 0 end) as chinese,max(case course when '数学' then score else 0 end ) as math,max(case course when '英语' then score else 0 end ) as english from t_student group by id ;

+-----+----------+--------+----------+--+
| id  | chinese  |  math  | english  |
+-----+----------+--------+----------+--+
| 1   | 100.0    | 100.0  | 100.0    |
| 2   | 80.0     | 79.0   | 100.0    |
+-----+----------+--------+----------+--+
2 rows selected (25.617 seconds)

Hive Hbase集成

create external table t_employee_hive(empno INT,
    ename STRING,
    job STRING,
    mgr INT,
    hiredate TIMESTAMP,
    sal DECIMAL(7,2),
    comm DECIMAL(7,2),
    deptno INT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES("hbase.columns.mapping" = ":key,cf1:name,cf1:job,cf1:mgr,cf1:hiredate,cf1:sal,cf1:comm,cf1:deptno") 
TBLPROPERTIES("hbase.table.name" = "baizhi:t_employee");

将Hbase数据库中的baizhi:t_employee映射到hive中的t_employee_hive

0: jdbc:hive2://CentOS:10000> insert into table t_employee_hive select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_employee;

这里需要注意,必须使用和hbase兼容的hive才能无缝的集成Hbase,由于笔者使用的编译的hive,这里就可以正常的运行,如果集成遇到问题,请参考https://blog.csdn.net/weixin_38231448/article/details/103935613文章。

JDBC集成Hive

1、引入Hive JDBC依赖

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.2.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.9.2</version>
</dependency>

2、使用标准的JDBC测试

Class.forName("org.apache.hive.jdbc.HiveDriver");
Connection conn = DriverManager.getConnection("jdbc:hive2://CentOS:10000","root","");
PreparedStatement pstm = conn.prepareStatement("select empno,ename,job,rank() over (partition by deptno order by sal desc) rank from baizhi.t_employee_hive");

ResultSet resultSet = pstm.executeQuery();
while(resultSet.next()){
    Integer empno = resultSet.getInt("empno");
    String name = resultSet.getString("ename");
    String job = resultSet.getString("job");
    Integer rank = resultSet.getInt("rank");
    System.out.println(empno+"\t"+name+"\t"+job+"\t"+rank);
}
resultSet.close();
pstm.close();
conn.close();

※自定义UDF

UDF函数其实就是一个简单的函数,执行过程就是在Hive转换成MapReduce程序后,执行java方法,类似于像MapReduce执行过程中加入一个插件,方便扩展。目前Hive除去一些内置的函数以外,还提供了一些内置的函数的扩展接口:

  • UDF:针对单行数据操作,需要继承UDF
  • UDTF:操作一个数据行,产生多个数据行或者是多个列,需要用户继承GenericUDTF
  • UDAF:操作多个数据行,产生一个数据行,主要用以聚合统计,需要继承AbstractGenericUDAFResolver

目前由于Hive中已经提供了强大的聚合函数,本篇文章就UDF和UDTF实现

1、引入自定义函数依赖

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.9.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.2.2</version>
</dependency>

UDF定义

/**
 * 将两个列拼接,方法名字一定是evaluate
 */
public class UserDefineUDF extends UDF {

    public String evaluate(Object col1,Object col2){
        return col1.toString()+col2.toString();
    }
}

将代码打包,然后使用CREATE [TEMPORARY] FUNCTION 方法名字 AS 类的全限定名

0: jdbc:hive2://CentOS:10000> add jar /root/Hive-1.0-SNAPSHOT.jar;
INFO  : Added [/root/Hive-1.0-SNAPSHOT.jar] to class path
INFO  : Added resources: [/root/Hive-1.0-SNAPSHOT.jar]
No rows affected (0.02 seconds)


0: jdbc:hive2://CentOS:10000> create TEMPORARY function userconcat  as 'com.baizhi.UserDefineUDF';
No rows affected (0.043 seconds)

0: jdbc:hive2://CentOS:10000> select userconcat(empno,ename) ,sal from t_employee;
+-------------+-------+--+
|     _c0     |  sal  |
+-------------+-------+--+
| 7369SMITH   | 800   |
| 7499ALLEN   | 1600  |
| 7521WARD    | 1250  |
| 7566JONES   | 2975  |
| 7654MARTIN  | 1250  |
| 7698BLAKE   | 2850  |
| 7782CLARK   | 2450  |
| 7788SCOTT   | 1500  |
| 7839KING    | 5000  |
| 7844TURNER  | 1500  |
| 7876ADAMS   | 1100  |
| 7900JAMES   | 950   |
| 7902FORD    | 3000  |
| 7934MILLER  | 1300  |
+-------------+-------+--+
14 rows selected (0.179 seconds)

UDTF

public class UserDefineUDTF extends GenericUDTF {

    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        List<ObjectInspector> fieldOIs=new ArrayList<ObjectInspector>();

        fieldOIs.add(PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspector.PrimitiveCategory.INT));
        fieldOIs.add(PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspector.PrimitiveCategory.STRING));
        fieldOIs.add(PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspector.PrimitiveCategory.BOOLEAN));

        return ObjectInspectorFactory.getStandardStructObjectInspector(Arrays.asList("id","name","sex"), fieldOIs);
    }

    /**
     * 传递的参数
     * @param objects
     * @throws HiveException
     */
    //处理结果 ['1,zhangsan,true' , ','] 是一个Text类型的数组
    public void process(Object[] objects) throws HiveException {
        String value= (String) objects[0].toString();
        String sep= (String) objects[1].toString();

        String[] tokens = value.split(sep);

        Integer id=Integer.parseInt(tokens[0]);
        String  name=tokens[1];
        Boolean sex=Boolean.parseBoolean(tokens[2]);
        //产生一行
        this.forward(Arrays.asList(id,name,sex));

    }

    public void close() throws HiveException {

    }
}

将代码打包,然后使用CREATE [TEMPORARY] FUNCTION 方法名字 AS 类的全限定名

0: jdbc:hive2://CentOS:10000> create TEMPORARY function usercolexplode  as 'com.baizhi.UserDefineUDTF';
No rows affected (0.021 seconds)
0: jdbc:hive2://CentOS:10000> select usercolexplode('1,zhangsan,true',',');
+-----+-----------+-------+--+
| id  |   name    |  sex  |
+-----+-----------+-------+--+
| 1   | zhangsan  | true  |
+-----+-----------+-------+--+
1 row selected (0.118 seconds)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值