构建在Hadoop上的数据仓库平台,为数据仓库管理提供了许多功能
- 定义了一种类SQL语言HiveQL。可以看成是仍SQL到Map-Reduce的映射器
- 提供Hive shell、JDBC/ODBC、Thrift客户端等接
先上一个结构图,方便现有一个整体上的认识
安装和配置Hive
安装模式:
(*)嵌入模式:不需要使用MySQL,使用Hive自带的Derby数据库存储Hive的元信息
(*)本地模式、远程模式:都需要MySQL
准备工作:
1、解压 tar -zxvf apache-hive-2.3.0-bin.tar.gz -C ~/training/
2、设置环境变量 vi ~/.bash_profile
HIVE_HOME=/root/training/apache-hive-2.3.0-bin
export HIVE_HOME
PATH=$HIVE_HOME/bin:$PATH
export PATH
核心的配置文件 conf/hive-site.xml
1、安装Hive的嵌入模式
特点:(1)使用自带的Derby
(2)只支持一个连接
(3)用于开发和测试
创建hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>file:///root/training/apache-hive-2.3.0-bin/warehouse</value>
</property>
</configuration>
初始化MetaStore:
schematool -dbType derby -initSchema
2、安装配置MySQL数据库
在虚拟机上安装MySQL:
yum remove mysql-libs
rpm -ivh mysql-community-common-5.7.18-1.el7.x86_64.rpm
rpm -ivh mysql-community-libs-5.7.18-1.el7.x86_64.rpm
rpm -ivh mysql-community-client-5.7.18-1.el7.x86_64.rpm
rpm -ivh mysql-community-server-5.7.18-1.el7.x86_64.rpm
rpm -ivh mysql-community-devel-5.7.18-1.el7.x86_64.rpm (可选,但还是装上,后面装HUE的时候会用到。)
启动MySQL:service mysqld start
或者:systemctl start mysqld.service
查看root用户的密码:cat /var/log/mysqld.log | grep password
登录:mysql -uroot -pxxxxx
登录后修改密码:alter user 'root'@'localhost' identified by 'Welcome_1';
MySQL数据库的配置:
创建一个新的数据库:create database hive;
创建一个新的用户:
create user 'hiveowner'@'%' identified by 'Welcome_1';
给该用户授权
grant all on hive.* TO 'hiveowner'@'%';
grant all on hive.* TO 'hiveowner'@'localhost' identified by 'Welcome_1';
免费工具:http://www.mysqlfront.de/
3、本地模式、远程模式
地模式、远程模式
创建hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveowner</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Welcome_1</value>
</property>
</configuration>
hive 设计思想:
利用关系型数据库来辅助记录数据的结构元信息(即表结构),真实的数据存储在HDFS上,
四、(最重要)Hive的数据模型
1、内部表:类似MySQL、Oracle中表
默认的列的分隔符是:tab键
举例:创建一个员工表 emp表
数据 7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
create table emp
(empno int,
ename string,
job string,
mgr int,
hiredate string,
sal int,
comm int,
deptno int
);
创建一个员工表 emp1表,并且指定分隔符是:逗号
create table emp1
(empno int,
ename string,
job string,
mgr int,
hiredate string,
sal int,
comm int,
deptno int
)row format delimited fields terminated by ',';
导入数据:
insert语句
load语句:相当于ctrl+x
(1)导入HDFS的数据 load data inpath '/scott/emp.csv' into table emp;
(2)导入本地Linux的数据:load data local inpath '/root/temp/emp.csv' into table emp1;
2、分区表:提高性能
举例:创建一个分区表,根据部门号deptno
create table emp_part
(empno int,
ename string,
job string,
mgr int,
hiredate string,
sal int,
comm int
)partitioned by (deptno int)
row format delimited fields terminated by ',';
往分区表中插入数据:子查询
insert into table emp_part partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=10;
insert into table emp_part partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=20;
insert into table emp_part partition(deptno=30) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=30;
3、外部表:只定义表结构,数据保存在HDFS的某个目录下
create external table ext_student
(sid int,sname string,age int)
row format delimited fields terminated by ','
location '/students';
4、桶表:类似Hash分区
根据员工的职位job建立桶表
create table emp_bucket
(empno int,
ename string,
job string,
mgr int,
hiredate string,
sal int,
comm int,
deptno int
)clustered by (job) into 4 buckets
row format delimited fields terminated by ',';
需要设置环境变量:set hive.enforce.bucketing = true;
插入数据
insert into table emp_bucket select * from emp1;
5、视图:View
(*) 是一个虚(不存数据)表
(*) 作用:简化复杂的查询
create view view10
as
select * from emp1 where deptno=10;
(*) 补充:物化视图可以缓存数据
*)补充:如何判断提高了性能? ----> 重点讲:关系型数据库(Oracle)、Hive类似方式
----> SQL执行计划
Hive中的执行计划
查询10号部门的员工
情况一:查询内部表
explain select * from emp1 where deptno=10;
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: emp1
Statistics: Num rows: 1 Data size: 629 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (deptno = 10) (type: boolean)
Statistics: Num rows: 1 Data size: 629 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: int), comm (type: int), 10 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 1 Data size: 629 Basic stats: COMPLETE Column stats: NONE
ListSink
情况二:查询分区表
explain select * from emp_part where deptno=10;
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: emp_part
Statistics: Num rows: 3 Data size: 118 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: int), comm (type: int), 10 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 3 Data size: 118 Basic stats: COMPLETE Column stats: NONE
ListSink
Hive的自定义函数(UDF)
- 在udf包下新建一个类checksal,继承 UDF,重写evaluate方法,实现业务方法
- 打jar包,这里命名为myudf.jar
- 部署我们jar包(自定义函数),
把jar加入Hive的Classpath
为自定义函数创建别名add jar /root/temp/myudf.jar;
create temporary function checksal as 'udf.CheckSalaryGrade';
这里只是简单的说了下自定义函数的流程,推荐一片文章,这方面写得比较详细link: https://blog.csdn.net/HG_Harvey/article/details/77688735
这篇文章原理讲得比较清晰推荐看下link:https://www.cnblogs.com/qingyunzong/p/8707885.html