study-notes（9 Hive）

最新推荐文章于 2022-12-11 23:35:09 发布

GraysonWP

最新推荐文章于 2022-12-11 23:35:09 发布

阅读量188

点赞数

文章标签： Hive

本文链接：https://blog.csdn.net/wpwbb510582246/article/details/83588419

版权

这篇文章是将自己所学技术按模块划分总结而成的笔记，包含了 JavaSE、JavaWeb（SpringMVC、Spring、MyBatis、SpringBoot、SpringCloud 等）、Linux、Hadoop、MapReduce、Hive、Scala、Spark 等，希望通过这些笔记的总结，不仅能让自己对这些技术的掌握更加深刻，同时也希望能帮助一些其他热爱技术的人，这些笔记后续会继续更新，以后自己学习的其他最新技术，也都会以这样笔记的形式来保留，这些笔记已经共享到 Github，大家可以在那里下载到 Markdown 文件，如果大家在看的时候有什么问题或疑问，可以通过邮箱与我取得联系，或者在下面的评论区留言，同时也可以在 Github 上与我进行互动，希望能与大家一起相互学习，相互进步，共同成长。

本篇文章 Github 地址 : https://github.com/wpwbb510582246/study-notes/blob/master/9 Hive/9 Hive.md

项目 Github 地址 : https://github.com/wpwbb510582246/study-notes

email : weipengweibeibei@163.com

博客地址 : https://blog.csdn.net/wpwbb510582246

9 Hive

在 Hive 中，当使用 local 时，表示是本地路径，当不使用 local 时，表示 HDFS 路径，当使用 into 时，表示追加写，当使用 overwrite 时，表示覆盖写

9.1 原理

9.1.1 数据仓库和数据库的区别

1、数据库是面向事务而设计，数据仓库是面向主题而设计

2、数据库主要用来存储业务数据，数据仓库主要用来存储历史数据

3、数据库主要用于捕获数据，例如捕获 JavaEE 中页面传过来的数据，进行处理后然后进行响应，数据仓库主要用于分析数据，从数据仓库中获取数据，然后进行分析

4、数据库应尽量避免数据冗余，数据仓库有时为了数据分析的方便，会有意地引入数据冗余

9.2 DDL

CREATE TABLE employee (
NAME string,
salary FLOAT,
subordinated array < string >,
deductions map < string, FLOAT >,
address struct < province:string, city:string, state:string, zip:INT > 
)
partitioned by (province string, state string)
clustered by (salary)
into 4 buckets
ROW format delimited
FIELDS TERMINATED BY '\t'
collection items TERMINATED BY ','
map KEYS TERMINATED BY '=';

测试数据 :
John Doe    10000.0    Mary Sith,Todd Jones    Federal Taxes=0.2,State Taxes=0.1,Insurance=0.1    1 Michigan Ave.,Chicago,IL,60600
Mary Smith    80000.0    Bill King    Federal Taxes=0.2,State Taxes=0.05,Insurance=0.1    100 Ontario St.,Chicago,IL,60601
Todd Jones    70000.0        Federal Taxes=0.15,State Taxes=0.03,Insurance=0.1    200 Chicago Ave.,Oak Park,NY,60700
Bill King    60000.0        Federal Taxes=0.15,State Taxes=0.03,Insurance=0.1    300 Obscure Dr.,Obscur,CA,6010

9.2.1 分区表

9.2.1.1 分区表的好处

分区表主要用于辅助查询，缩小查询范围，提高检索速度，同时可以按照一定的规格和条件对数据进行管理，分区表中的每一个分区对应的都是一个文件夹，分区中的数据对应于文件夹中的数据

9.2.1.2 分区表的常用操作

1、将数据从文件中加载入分区表

// 将数据从文件 file 加载到表 tableName 的分区值分别为 partitionValue1、partitionValue2 的分区 partitionName1、partitionName2 中
load data local inpath 'file' into table tableName partition(partitionName1='partitionValue1', partitionName2='partitionValue2')

2、将数据动态的插入到分区表

// 从表 tableName2 中查询数据并将其插入到表 tableName2 中的指定分区，其中表 tableName1 中的字段包括 field1、field2，分区值分别为 partitionName1、partitionName2
insert into table tableName1 partition (partitionName1, partitionName2) select field1, field2, field3 as partitionName1, field4 as partitionName2 from tableName2;

9.2.2 桶表

9.2.3 桶表的好处

粪桶表可以用于对数据的抽样调查，同时由于桶表的内部机制，可以减少 join 的次数，从而提高效率，桶表相当于 MapReduce 中的分区，每一桶对应于一个文件，每一桶中的数据对应于每一个文件中的数据，桶表中通过将桶表中分桶字段的值与桶的个数进行模运算，从而确定每一条数据处于哪一个桶中

9.2.4 桶表的常用操作

1、在创建桶表时，需要先开启分桶

set hive.enforce.bucketing = true;

2、设置 reduce 的个数（当在创建桶表的时候指定了桶的个数时也可以不用设置这个参数值，此时 reduce 的个数就等于桶的个数）

set mapreduce.job.reduces=4;

3、将数据插入到桶表

// 从表 tableName2 中将数据查询出来并将其插入到 tableName1 中
insert into tableName1 select * from tableName2 cluster by (clusterName);

9.2.3 内部表和外部表

9.2.3 内部表和外部表的区别

1、删除表时，删除内部表，会将元数据和数据目录一起删除，删除外部表，只会删除元数据，而数据目录不会被删除

2、创建表时，创建内部表会出现在数据库目录中，创建外部表不会出现在数据库目录中

9.2.4 常用 DDL 命令

9.2.4.1 修改表

1、增加分区

alter table tableName add partition(partitionName1=partitionValue1) location '/user/hive/warehouse/databaseName/tableName/partitionName1=partitionValue1' partition(partitionName2=partitionValue2) location '/user/hive/warehouse/databaseName/tableName/partitionName2=partitionValue2'

2、删除分区

alter table tableName drop if exists partition(partitionName1=partitionValue1, partitionName2=partitionValue2)

3、修改分区

alter table tableName partition(partitionName=partitionValue) rename to partition(partitionName=newPartitionValue)

4、添加列

alter table tableName add columns(name string)

5、修改列

alter table tableName change id int

alter table tableName change id int after name

alter table tableName change id int first

6、表重命名

alter table tableName rename to newTableName

7、like

// 创建表 tableName1，其结构与 tableName2 一样，但没有数据
create table tableName1 like tableName2

// 创建 tableName1，其结构与 tableName2 一样，同时数据也和 tableName2 一样
create table tableName1 as select * from tableName2

9.2 DML

9.2.1 常用 DML 命令

1、insert

// 将从 tableName2 中查到的数据插入到 tableName1 中，其中表 tableName2 查询结果的结构与表 tableName1 一样
insert into table tableName1 select * from tableName2

// 将从 tableName1 中查询到的 id 插入到 tableName2 中，将从 tableName1 中查询到的 name 插入到 tableName3 中
from tableName1
insert into table tableName2
select id
insert into table tableName3
select name

// 将从 tableName 中查到的数据写入到本地文件夹 file 中
insert into local directory 'file' select * from tableName

// 将从 tableName1 中查到的 id 写入到本地文件夹 file1 中，将从 tableName2 中查到的 name 写入到本地文件夹 file2 中
from tableName1
insert into local directory 'file1'
select id
insert into local directory 'file2'
select name

2、select

// order by : 所有数据进入同一个 reducer 进行处理，并在 reducer 中对数据进行全局排序，由于只有一个 reducer，所以对于大量数据，将会消耗很长时间去执行
// sort by : 为每一个 reducer 进行排序，保证了数据的局部有序，接下来再通过一次归并排序就可以实现全局有序，可以为全局排序提高效率
// distribute by : 指定数据在 reducer 端是如何划分的，例如一个表有 mid、name两个字段，当使用 distribute by (mid) 时，所有的 mid 相同的数据会被放到相同的 reducer 中进行处理，一般可以结合 sort by 一起使用
// cluster by : 相当于 distribute by 和 sort by 结合在一起，例如 cluster by (id) 就相当于 distribute by (id) sort by (id)
select * from tableName order by id asc
select * from tableName distribute by (id) sort by (salary)
select * from tableName cluster by (id)

9.3 内置函数

9.3.1 regexp_replace

# 对字符串中指定的字符串使用指定的分隔符进行替换
select regexp_replace('"http://www.taobao.com/3c/items?id=001&name=book"', '\"', '');

运行结果 :
+--------------------------------------------------+--+
|                       _c0                        |
+--------------------------------------------------+--+
| http://www.taobao.com/3c/items?id=001&name=book  |
+--------------------------------------------------+--+

9.3.2 split

# 将字符串以指定的分隔符进行分割
select split('beijing,shanghai,guangzhou,shenzhen', ',');

运行结果 :
+------------------------------------------------+--+
|                      _c0                       |
+------------------------------------------------+--+
| ["beijing","shanghai","guangzhou","shenzhen"]  |
+------------------------------------------------+--+

9.3.3 array

# 生成数组
select array('a','b','c');

运行结果 :
+----------------+--+
|      _c0       |
+----------------+--+
| ["a","b","c"]  |
+----------------+--+

9.3.4 explode

# 将一个数组中的每一个元素分行显示
select explode(split('beijing,shanghai,guangzhou,shenzhen', ','));

运行结果 :
+------------+--+
|    col     |
+------------+--+
| beijing    |
| shanghai   |
| guangzhou  |
| shenzhen   |
+------------+--+

9.3.5 parse_url_tuple

# 从一个 url 字符串中查询出主机名、请求路径、请求参数等信息
select  parse_url_tuple('http://www.taobao.com/3c/items?id=001&name=book','HOST','PATH', 'QUERY', 'QUERY:name') as (host,path,query,name);

运行结果 :
+-----------------+------------+-------------------+-------+--+
|      host       |    path    |       query       | name  |
+-----------------+------------+-------------------+-------+--+
| www.taobao.com  | /3c/items  | id=001&name=book  | book  |
+-----------------+------------+-------------------+-------+--+

9.3.6 union

# 将两个表中的数据联合起来放在一个表中
select 'http://www.taobao.com/3c/items?id=001&name=book' as url union select 'http://www.tmall.com/3c/items?id=001&name=book&date=2018' as url;

+-----------------------------------------------------------+--+
|                          _u2.url                          |
+-----------------------------------------------------------+--+
| http://www.taobao.com/3c/items?id=001&name=book           |
| http://www.tmall.com/3c/items?id=001&name=book&date=2018  |
+-----------------------------------------------------------+--+

9.3.7 lateral view

# 将一个表中的每一条数据都和另一个表中的相应数据连接起来
select a.name, b.* from (select 'zhangsan' as name, 'beijing,shanghai' as location union select 'lisi' as name, '广州,shenzhen,海南' as location) a lateral view explode(split(location, ',')) b as kkk;

运行结果 :
+-----------+-----------+--+
|  a.name   |   b.kkk   |
+-----------+-----------+--+
| lisi      | 广州        |
| lisi      | shenzhen  |
| lisi      | 海南        |
| zhangsan  | beijing   |
| zhangsan  | shanghai  |
+-----------+-----------+--+

9.4 UDF

9.4.1 自定义 UDF 函数

在 pm.xml 中添加相应的依赖

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
</dependency>

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
</dependency>

创建一个类，让其继承 UDF，然后重写 evaluate 方法

/**
 * @author Grayson
 * @date 2018/8/28 21:11
 */
public class DateStartUDF extends UDF {

    /**
     * 计算某天的起始时刻（毫秒数）
     * @return
     */

    public long evaluate() {
        return evaluate(new Date());
    }

    /**
     * 获取某天的起始时刻（毫秒数）
     * @param date
     * @return
     */

    public long evaluate(Date date) {
        return DateUtil.getStartTimeLong(date);
    }

    /**
     * 获取某天的起始时刻
     * @param dateString
     * @return
     */

    public long evaluate(String dateString) {
        return evaluate(DateUtil.getDateFromString1(dateString));
    }

    /**
     * 获取某天的起始时刻
     * @param dateString
     * @param dateFormatString
     * @return
     */

    public long evaluate(String dateString, String dateFormatString) {
        return evaluate(DateUtil.getDateFromStringDefined(dateString, dateFormatString));
    }

}

将项目打包成 jar 包，然后上传至服务器中

将 jar 包添加至 hive 的 classpath

add jar /usr/local/distribute/hive/udf/app-logs-hive-1.0-SNAPSHOT.jar

创建函数
5.1 创建临时函数

# getstartday 为要创建的方法名称，app.logs.udf.DateStartUDF 为 DateStartUDF 的全限定类名（包括包名和类名）
 create function getstartday as 'app.logs.udf.DateStartUDF'

5.2 创建永久函数
5.2.1 因为 hive 仓库在 hdfs 上，所以在创建永久函数时，需要将 jar 包上传到 hdfs 上

#创建目录
hdfs dfs -mkdir /file/hiveUDF
#将 jar 包上传到指定目录
hdfs dfs -put app-logs-hive-1.0-SNAPSHOT.jar /file/hiveUDF

5.2.2 创建永久函数

create function getstartday as 'app.logs.udf.DateStartUDF' using jar 'hdfs:///file/hiveUDF/app-logs-hive-1.0-SNAPSHOT.jar';

使用函数

select getstartday();

结果为 :

OK
_c0
1535558400000
Time taken: 3.749 seconds, Fetched: 1 row(s)

GraysonWP

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
study-notes（9 Hive）

这篇文章是将自己所学技术按模块划分总结而成的笔记，包含了 JavaSE、JavaWeb（SpringMVC、Spring、MyBatis、SpringBoot、SpringCloud 等）、Linux、Hadoop、MapReduce、Hive、Scala、Spark 等，希望通过这些笔记的总结，不仅能让自己对这些技术的掌握更加深刻，同时也希望能帮助一些其他热爱技术的人，这些笔记后续会继续更新，以...
复制链接

扫一扫