Hive 学习笔记-CSDN博客

本文链接：https://blog.csdn.net/goldlone/article/details/82531684

Hive

0. 简介

The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

1. 环境搭建

1.1 Hive 1.X

要求：搭建Hive要求具有hadoop环境
选择一个镜像站点下载(http://www.apache.org/dyn/closer.cgi/hive/) apache-hive-1.2.2-bin.tar.gz
解压 tar -zxvf apache-hive-1.2.2-bin.tar.gz
修改配置文件

# 根据模板复制一份hive-env.sh
cp conf/hive-env.sh.template conf/hive-env.sh
vim hive-env.sh
# ------
# 修改两处 解开注释并配置 HADOOP_HOME 和 export HIVE_CONF_DIR
HADOOP_HOME=/hadoop/hadoop-2.6.5 # 自己的hadoop安装目录
export HIVE_CONF_DIR=/hadoop/hive-1.2.2/conf # hive的安装目录/conf

hadoop fs -mkdir /tmp
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

启动hive bin/hive

可能遇到的错误

Terminal initialization failed; falling back to unsupported

原因：yarn所使用的老版本jline-0.9.94.jar与Hive的jline-2.12.jar不兼容

# 备份yarn的jline
mv ../hadoop-2.6.5/share/hadoop/yarn/lib/jline-0.9.94.jar mv ../hadoop-2.6.5/share/hadoop/yarn/lib/jline-0.9.94.jar.bak
# 将hive的jline添加到yarn的lib中
cp lib/jline-2.12.jar ../hadoop-2.6.5/share/hadoop/yarn/lib/

hive默认使用derby管理元数据，在启动hive的目录下创建的 metastore_db 中保存，缺陷为不支持多用户同时启动和更改目录启动元数据丢失（其实是新建了一个metastore_db）

配置MySQL来管理Hive元数据，可以解决上述问题

在conf目录下: cp hive-default.xml.template hive-site.xml
修改hive-site.xml，删除全部内容，仅保留<configuration></configuration>，在其中添加：

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hh1:3306/hivemeta?createDatabaseIfNotExist=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>root</value>
</property>

将连接mysql的jar包拷贝至hive/lib中
启动hive，测试是否配置成功

1.2 Hive 2.X

待补充

1.3 一些参考配置

<property>
  <name>hive.cli.print.header</name>
  <value>true</value>
  <description>输出结果时是否带表头</description>
</property>
<property>
  <name>hive.cli.print.current.db</name>
  <value>true</value>
  <description>hive shell中是否显示当前使用的数据库(例如：hive (test)>)</description>
</property>
<property>
  <name>hive.exec.reducers.max</name>
  <value>1009</value>
  <description>最大可使用的reduce个数</description>
</property>
<property>
  <name>hive.exec.reducers.bytes.per.reducer</name>
  <value>256000000</value>
  <description>每个reducer处理的数据量，默认是256M</description>
</property>

<!-- Hive配置不启动MR的任务机制 -->
<property>
  <name>hive.fetch.task.conversion</name>
  <value>more</value>
  <description>
    Expects one of [none, minimal, more].
    Some select queries can be converted to single FETCH task minimizing latency.
    Currently the query should be single sourced not having any subquery and should not have
    any aggregations or distincts (which incurs RS), lateral views and joins.
    0. none : disable hive.fetch.task.conversion
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
  </description>
</property>

1.4 配置日志

复制hive-log4j.properties.template，并重命名为hive-log4j.properties
修改日志保存目录hive.log.dir=/hadoop/hive-1.2.2/logs

1.5 使用beeline连接Hive(界面美观)

启动hiveserver2，命令：bin/hiveserver2
启动beeline，命令：bin/beeline -u jdbc:hive2://hostname:10000（用户名和密码不用写，直接就可以进）

-u: URL
-n: 用户名
-p: 密码

2. hive的工作机制

2.1 数据库

在hive中创建一个数据库
1. 在hive的元数据库中记录
2. 在HDFS的默认工作路径下 (/user/hive/warehouse/) ，建立一个“库名.db”的目录

2.1 表

在hive中创建一个表
1. 在hive的元数据库中记录
2. 在HDFS的默认工作路径下 (/user/hive/warehouse/库名.db) ，建立一个“表名”的目录
管理表（内部表）与外部表的区别
1. 管理表在创建时无需指定目录，默认在/user/hive/warehouse/库名.db下，外部表在创建时需要指定目录。
2. 管理表在删除时将存放目录全部清空，外部表在删除时仅删除表的元数据，数据存放目录不会被删除。

3. 常用操作

参考wiki: https://cwiki.apache.org/confluence/display/Hive/Home

3.1 数据库

创建数据库

create database db_test;

使用数据库

use db_test;

删除数据库

drop database db_test;

3.2 表

创建表

-- 1. 默认创建管理表
create table tab_stu(id int, name string, school string)
  comment 'this is a stu table'
  partitioned by (school string) -- 可以是表内字段也可是表中不存在的字段
  row format delimited
    fields terminated by '\t'
  stored as textfile;

-- 2. 创建外部表(指定一个位置，不会在hive的warehouse下创建该表)
CREATE EXTERNAL TABLE tab_ip_ext(id int, name string, ip STRING, country STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  STORED AS TEXTFILE
  LOCATION '/external/user';

-- 3. 根据子查询结果创建表，一般用于存储中间数据
CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;

-- 4. 复制表结构
create table stu_like like stu;

-- 5. 集合属性
create table tab_array(a array<int>,b array<string>)
  row format delimited
  fields terminated by '\t'
  collection items terminated by ',';
-- 示例数据： 1,2,3,4   hh1,hh2,hh3,hh4

-- 6. map属性
create table tab_map(name string, info map<string,string>)
  row format delimited
  fields terminated by '\t'
  collection items terminated by ','
  map keys terminated by ':';
-- 示例数据： name1   ext1:1,ext2:2,ext3:3

-- 7. struct属性
create table tab_struct(name string,info struct<age:int,tel:string,addr:string>)
  row format delimited
  fields terminated by '\t'
  collection items terminated by ','
-- 示例数据： name1   18,10086,beijing

修改表

alert table tab_name id id_alter string;
alert table tab_name add partition(partCol='dt') location '/external/hive/dt';

删除表

drop table tab_name;

3.3 数据操纵

3.3.1 导入数据

-- 1. 从本地导入数据（实际上是将文件上传至hive仓库中）
load data local inpath '/home/hadoop/ip.txt' into table tab_ext;

-- 2. 从HDFS导入数据（实际上是将文件移动到hive仓库中）
load data inpath 'hdfs://ns1/aa/bb/data.log' into table tab_user;

-- 3. 根据分区导入数据（使用overwrite表示覆写表的内容）
load data local inpath '/home/hadoop/data.log' into table tab_ip_part partition(year='1990');

-- 4. 通过select查询结果批量导入hive表（使用overwrite表示覆写表的内容）
insert overwrite(可以into) table tab_ip_like
select * from tab_ip;

3.3.2 导出数据

-- 1. 将查询结果覆写至本地文件
insert overwrite local directory '/home/hadoop/hivetemp/test.txt' select * from tab_ip_part where part_flag='part1';
-- 2. 将查询结果覆写至HDFS文件
insert overwrite directory '/hiveout.txt' select * from tab_ip_part where part_flag='part1';
-- 3. 导出表数据
export table table_name to 'HDFS路径'

3.3.3 查询

查询某个分区的数据

select * from tab_ip_part  where part_flag='part2';

查询集合属性

select a[0], b[1] from tab_array;

查询Map属性

select name, info['age'] from tab_map;

查询Struct属性

select name, info.age, info.tel, info.addr from tab_struct;

4种排序

-- 为保证分组排序的结果正确性，需要设置reduce数量
--     set mapreduce.job.reduces=3;
-- 表结构：AA(id int, name string)
-- 1. order by
select * from AA order by id;

-- 2. sort by(reduce任务内部排序，默认升序)
select * from AA sort by id desc;

-- 3. distribute by(分组，默认降序)
select * from AA distribute by id;

-- 4. cluster by(当分组与排序字段一致时使用，默认升序)
insert overwrite local directory '/datas/sort4/'  
select * from AA cluster by id;

-- 5. sort by 与 distribute by 组合
set mapreduce.job.reduces=3;

insert overwrite local directory '/datas/sort5/' 
row format delimited fields terminated by '\t' 
select * from AA distribute by AA.id sort by AA.name;

内外连接

-- 表结构：A,B(id int, name string)
-- 1. 内连接
select A.id,A.name,B.id,B.name from A inner join B where A.id=B.id;
-- 2. 左连接
select A.id,A.name,B.id,B.name from A left join B on A.id=B.id;
-- 3. 右连接
select A.id,A.name,B.id,B.name from A right join B on A.id=B.id;
-- 4. 全连接
select A.id,A.name,B.id,B.name from A full join B on A.id=B.id;

3.3.4 删除

通过overwrite实现，仅保留符合xxx条件的数据

insert overwrite table tab_name
select * from tab_name where xxx;

清空表数据

truncate table table_name;

3.3.5 虚拟列

Hive 0.8.0 provides support for two virtual columns:

INPUT__FILE__NAME: which is the input file’s name for a mapper task.
BLOCK__OFFSET__INSIDE__FILE: which is the current global file position.

3.4 分区

创建分区表(在表的目录下根据分区字段的值建立目录)

-- 分区字段可以为表的已有字段，也可为其他字段
create table stu_part(id int, name string)
  partitioned by (class int)
  row format delimited
  fields terminated by '\t';

查看分区

show partitions stu_part;

存储结构

hive分区

3.5 分桶

创建分桶表(在分区或直接在表目录下建立多个文件)

create table tab_cluster(id int, name string) 
  clustered by (id) into 4 buckets
  row format delimited 
  fields terminated by '\t';

向分桶表中导入数据

-- 本地导入数据不会做分桶处理，需要通过MR导入
-- 需要配置 开启分桶的功能
set hive.enforce.bucketing=true;
-- 通过临时表的方式导入
insert into tab_cluster select * form tab_tmp;

存储结构

hive分桶结果

分桶的抽样查询

-- 从1号桶开始，4/2=2个桶一组，即抽出1、3号桶
select * from tab_cluster tablesample(bucket 1 out of 2 on id);

-- would pick out half of the 1st cluster as each bucket would be composed of (4/8)=1/2 of a cluster.
-- 将挑选出第一个桶的(4/8)=1/2
select * from tab_cluster tablesample(bucket 1 out of 8 on id);

3.6 函数

查看系统函数

show functions;

自定义函数
1. 继承 org.apache.hadoop.hive.ql.exec.UDF
2. 重载 public String evaluate()
3. 打成Jar包
4. 加载 hive>add jar /home/hadoop/myudf.jar;
5. 创建临时函数 hive>CREATE TEMPORARY FUNCTION 函数名 AS '报名.类名';
6. 使用 select 函数名(name) from tab_ext;
7. 注意：重启hive客户端后，自定义的函数会失效
一些常用的系统函数
- over()：窗口函数，可以在窗口内进行分组、排序等操作后在通过聚集函数计算结果
- row_number()：获取当前行数

3.7 通过脚本运行HQL

hive -S -e 'select * from tab_stu;'
hive -S -f xx文件

4. 使用JDBC连接Hive

4.1 启动远程服务

hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10010

4.2 JDBC连接

4.2.1 拷贝Jar包

hadoop-common-*.jar
slf4j-api-*.jar

以下Jar包从Hive的lib目录中复制

commons-logging-*.jar
hive-exec-*.jar
hive-jdbc-*.jar
hive-metastore-*.jar
hive-service-*.jar
httpclient-*.jar
httpcore-*.jar
libfb303-*.jar

4.2.1 测试连接

public static void main(String[] args) {

    Connection conn = null;
    Statement stmt = null;
    ResultSet rs = null;

    try {
        // 0. 加载类驱动
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        // 1. 获取连接对象
        conn = DriverManager.getConnection("jdbc:hive2://hh1:10010");
        // 2. 获取操作对象
        stmt = conn.createStatement();
        // 3. 执行查询语句
        rs = stmt.executeQuery("show databases");
        // 4. 获取查询结果
        while(rs.next()) {
            System.out.println(rs.getString(1));
        }
    } catch (ClassNotFoundException | SQLException e) {
        e.printStackTrace();
    } finally {
        // 5. 释放资源
        close(rs);
        close(stmt);
        close(conn);
    }
}

public static void close(AutoCloseable conn) {
    try {
        if(conn != null)
            conn.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
}