Hive全家桶

最新推荐文章于 2024-04-10 13:19:08 发布

龙卷风摧毁停车场!

最新推荐文章于 2024-04-10 13:19:08 发布

阅读量592

点赞数

分类专栏：程序 java 文章标签： Hive 编程指南完整版

本文链接：https://blog.csdn.net/internation985/article/details/107035638

版权

程序同时被 2 个专栏收录

20 篇文章 0 订阅

订阅专栏

java

14 篇文章 0 订阅

订阅专栏

什么是hive

1.1hive基本思想
hive是基于Hadoop的一个数据仓库工具(离线)，可以将结构化数据文件映射为一张数据库表,并提供SQL查询功能。
在这里插入图片描述
1.2为什么使用Hive

直接使用hadoop所面临的问题(50%)

```
	人员学习成本太高
```
```
	项目周期要求太短
```

	MapReduce实现复杂查询逻辑开发难度大

为什么使用hive
操作接口采用类SQL语句，. 提供快速开发能力。
避免了去写MapReduce，减少了开发人员的学习成本。
功能扩展很方便。

1.3Hive的特点

可扩展:

 	hive可以自由的扩展集群的规模，一般情况不需要重启服务。

延展性:

 	Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数。

容错:

 	良好的容错性，节点出现问题SQL扔可完成执行。

2.hive的基本结构
在这里插入图片描述
Jobtracker是hadoop1.x中的组件，它的功能相当于:
Resourcemanager+MRAppMaster

TaskTracker 相当于:
Nodemanager + yarnchild

3.hive安装

3.1最简单的安装:用内嵌derby作为元数据库

准备工作:安装hive的机器上应该有HADOOP环境(安装目录，HADOOP_HOME环境变量) 安装: 直接解压一个hive安装包即可。
此时，安装的这个hive实例使用其中其内嵌的derby数据库作为记录数据源得数据库
此模式不便玉让团队成员之间共享协作。

3.2标准安装: 将mysql作为元数据库
3.2.1 mysql安装

 - `yum安装
 - https://blog.csdn.net/HANLIPENGHANLIPENG/article/details/77
 - 445045`

— 1.给root用户授予从任何机器上登录mysql服务器得权限:

mysql>grant all privileges on . to ‘root’@%’ identfied by 你的密码’
with grant option; 注意点：要让mysql可以远程登录访问
最直接测试方法：从windows上用Navicat去连接，能连，则可以，不能连，则要去mysql的机器上用命令行客户端进行授权：
在mysql的机器上,启动命令行客户端：

mysql -uroot -proot

mysql>grant all privileges on *.* to 'root'@'%' identified by 'root的密码' with grant option;
mysql>flush privileges;

3.2.2.3.2.2.hive的元数据库配置
vi hive-site.xml

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
</configuration>

2.上传一个mysql得驱动jar包到hive得安装目录得lib中
3.配置HADOOP_HOME 和 HIVE_HOME到系统
4.配置HADOOP_HOME 和HIVE_HOME到系统环境变量中：/etc/profile
5.hive启动测试，然后用命令启动hive交互界面：

[root@hdp20-04 ~]# hive

4.hive使用方式
4.1。最基本的使用方式
启动一个hive交互shell
bin/hive
hive>

设置一些参数，让hive使用起来更见便捷，比如:

让提示符显示当前库：
hive>set hive.cli.print.current.db=true;
显示查询结果时显示字段名称：

hive>set hive.cli.print.header=true; 
set hive.resultset.use.unique.column.names=false;
set hive.exec.mode.local.auto=true;

4.2脚本化运行

大量的hive查询任务，如果用交互式shell来进行输入的话, 显然效率极其低下，因此，生产中很多的是使用脚本话运行机制：
该机制得核心点是:hive可以用一次性命令的方式来执行给定的hql语句
[root@hdp20-04 ~]# hive -e "insert into table t_dest select * from t_src;"
然后，进一步，可以将上述命令写入shell脚本，以便于脚本化运行hive任务，并控制，调度众多hive任务，示例如下：
vi t_order_etl.sh

#!/bin/bash
hive -e "select * from db_order.t_order"
hive -e "select * from default.t_user"
hql="create table  default.t_bash as select * from db_order.t_order"
hive -e "$hql"

如果要执行得hql语句特别复杂，那么，可以把hql语句写入一个文件:

select * from db_order.t_order;

select count(1) from db_order.t_user;

然后，用hive -f /root/x.hql 来执行

5.hive建库表与数据导入

5.1建库
hive中有一个默认的库:
库名: default
库目录:hdfs://hdp20-01:9000/user/hive/warehouse

新建库：

create database db_order;

库建好以后，在hdfs中会生成一个库目录：

hdfs://hdp20-01:9000/user/hive/warehouse/db_order.db

5.2建表
5.2.1.基本建表语句

use db_order;
create table t_order(id string,create_time string,amount float,uid string);

表建好之后，会在所属得库目录中生成一个表目录：

/user/hive/warehouse/db_order.db/t_order

只是，这样建表的话，hive会认为表数据文件中的字段分隔符为 ^A（\001）

*正确的建表语句为：

create table t_order(id string,create_time string,amount float,uid string)
row format delimited
fields terminated by ',';

这样就指定了，我们的表数据文件中的字段分隔符为 “,”

5.2.2.删除表

droup table t_order; 删除表的效果是：
hive会从元数据库中清除关于这个表的信息；
hive还会从hdfs中删除这个表的表目录；

5.2.3.内部表与外部表

内部表(MANAGED_TABLE)：表目录按照hive的规范来部署，位于hive的仓库目录/user/hive/warehouse中
外部表(EXTERNAL_TABLE)：表目录由建表用户自己指定
create external table t_access(ip string,url string,access_time string)
row format delimited
fields terminated by ‘,’
location ‘/access/log’;
外部表和内部表的特性差别：
1，外部表的目录在hive的仓库目录中 VS 外部表的目录由用户指定
2，drop一个内部表时：hive会清除相关的元数据，并删除表数据目录
3，drop一个外部表时：hive只会清除相关的元数据。

一个hive的数据仓库，最底层的表一般来自外部系统，为了不影响外部系统的工作逻辑，在hive中可建external表来映射这些外部系统产生的数据目录；
然后，后续的etl操作，产生的各种中间表建议用managed_table(内部表)

5.2.4.分区表
分区表的实质是： 在表目录中为数据文件创建分区子目录，以便于在查询时，MR程序可针对分区子目录中的数据进行处理，缩减读取数据的范围。

比如，网站每天产生的浏览记录，浏览记录应该键一个表来存放，但是，有时候我们可能只需要对某一天的浏览记录进行分析
*这是，就可以将这个表建为分区表，每天的数据导入其中一个分区；*当然，每日的分区目录，应该有一个目录名（分区字段）

5.2.4.1一个分区字段的实例：
实例如下：
1，创建带分区的表
create table t_access(ip string,url string,access_time string) partitioned by(dt string) row format delimited

fields terminated by ',';

注意：分区字段不能是表定义的已存在字段

2，向分区中导入数据

load data local inpath '/root/access.log.2020-08-04.log' into table t_access partition(dt='2020527');
load data local inpath '/root/access.log.2020-08-05.log' into table t_access partition(dt='2020528');

3,针对分区数据进行查询
a，统计8月4号的总PV：

select count(*) from t_access where dt='20200527';

实质：就是将分区字段当成表字段来用，就可以使用where 字句指定分区了

b，统计表中所有的数据总得PV：

select count(*)from t_access;

实质：不指定分区条件即可

5.3数据导入导出
5.3.1将数据文件导入hive的表
方式1：导入数据的一种方式：手动用hdfs命令，将文件放入表目录：
方式2：在hive的交互式，shell中用hive命令来导入本地数据到表目录

hive>load data inpath '/access.log.2017-08-06.log' into table t_access;

注意：导本地文件和导HDFS文件的区别：
本地文件导入表：复制
hdfs文件导入表：移动

5.3.2.将hive表中数据导入到指定的路径的文件
1，将hvie表中的数据导入HDFS的文件

insert overwrite directory '/root/access-data'
row format delimited fields terminated by '::'
select * from student;

2，将hive表中的数据导入本地磁盘文件

insert overwrite local directory '/root/access-data'
row format delimited fields terminated by ','
select * from t_access limit 100000;

5.4数据类型
5.4.1数字类型

TINYINT (1字节整数) 
SMALLINT (2字节整数)
INT/INTEGER (4字节整数)
BIGINT (8字节整数) ==long
FLOAT (4字节浮点数)
DOUBLE (8字节双精度浮点数)

create table t_test(a string ,b int,c bigint,d float,e double,f tinyint,g smallint)

5.4.2时间类型
TIMESTAMP (时间戳) (包含年月日时分秒的一种封装)
DATE (日期)（只包含年月日）
示例，假如有以下数据文件：

1,zhangsan,1985-06-30
2,lisi,1986-07-10
3,wangwu,1985-08-09

那么，就可以建一个表来对数据进行映射

create table t_customer(id int,name string,birthday date)
row format delimited fields terminated by ',';

然后导入数据

load data local inpath '/root/customer.dat' into table t_customer;

然后，就可以正确查询
5.4.3字符串类型
VARCHAR (字符串1-65355长度，超长截断)
CHAR (字符串，最大长度255)

5.4.4其他类型
BOOLEAN（布尔类型）：true false
BINARY (二进制)：

5.4.5复合类型
arrays: ARRAY<data_type> )
–示例：array类型的应用
假如有如下数据需要用hive的表去映射：

流浪地球,吴京:吴孟达:李光洁,2019-02-06
三生三世十里桃花,杨幂:迪丽热巴:高伟光,2017-08-20
都挺好,姚晨:倪大伟:郭京飞,2019-03-01

设想：如果主演信息用一个数组来映射比较方便

建表：

create table t_movie(moive_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';

导入数据：

load data local inpath '/root/movie.dat' into table t_movie;

查询：

load data local inpath '/root/movie.dat' into table t_movie;

5.4.5.2. map类型

maps: MAP<primitive_type, data_type>

1)假如有以下数据

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoli,28
2,lisi,father:mayun#mother:huangyi#brother:mahuateng,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

可以用map类型来对上述数据中的家庭成员进行描述
2)建表语句：

create table t_person(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';

3)查询

select * from t_person_struct;
select id,name,info.age from t_person_struct;

6，hive查询语法
提交hive的时间需要同步

yum install -y ntpdate
ntpdate pool.ntp.org

提示：在做小数据量查询测试时，可以让hive将mrjob提交给本地运行器运行，可以在hive会话中设置如下参数：

hive> set hive.exec.mode.local.auto=true;

6.1.基本查询示例

select * from t_access;
select count(*) from t_access;
select max(age) from t_access;
select min(age) from t_access;
select avg(age) from t_access;

**6.2条件查询
**

select * from t_access where access_time<'2017-08-06 15:30:20'
select * from t_access where access_time<'2017-08-06 16:30:20' and ip>'192.168.33.3';

6.3.join关联查询示例

select * from  a right join  b on a.name=b.name

假如有a.txt文件

a,10
b,15
c,20
d,25

假如有b.txt文件

a,xx
b,yy
d,zz
e,ww

进行各种join查询：
1，1、inner join（join）

select

a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name

结果：

+--------+--------+--------+--------+--+
| aname  | anumb  | bname  | bnick  |
+--------+--------+--------+--------+--+
| a      | 1      | a      | xx     |
| b      | 2      | b      | yy     |
| d      | 4      | d      | zz     |
+--------+--------+--------+--------+--+

2，2、left join（left join）

select 
a.name as aname,
a.numb as anumb,

b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name

结果：
在这里插入图片描述
3，3、right join（right join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name

结果：
在这里插入图片描述
4，4、full join（full join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
full join t_b b
on a.name=b.name;

结果：
在这里插入图片描述
6.4.group by 分组聚合（自己案例）
1，1.统计每个省(province)份的农产品市场(market)总数

select a.province,count(*) from
(select province,market from t_farm 
group by province,market)a
group by a.province
having a.province is not null
and a.province <> ‘’create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';
;

2，2.统计每个省(province)农产品种类(name)总数，找出排名前三的省份

select b.province,count(*)counts from
(select province,name from t_farm
group by province,name)b
group by b.province
order by counts desc
limit 3;

香菜,2.80,4.00,4.00,4.00,2.20,山西汾阳市晋阳农副产品批发

市场,山西,汾阳
大葱,2.80,2.80,2.80,2.80,2.60,山西汾阳市晋阳农副产品批发市场,山西,汾阳
葱头,1.60,1.60,1.60,1.60,1.60,山西汾阳市晋阳农副产品批发市场,山西,汾阳
大蒜,3.60,3.60,3.60,3.60,3.00,山西汾阳市晋阳农副产品批发市场,山西,汾阳
蒜苔,6.20,6.40,6.40,6.40,5.20,山西汾阳市晋阳农副产品批发市场,山西,汾阳
韭菜,5.60,5.60,5.60,5.60,4.60,山西汾阳市晋阳农副产品批发市场,山西,汾阳
青椒,5.20,5.00,5.00,5.00,4.80,山西汾阳市晋阳农副产品批发市场,山西,汾阳
茄子,5.40,4.40,4.40,4.40,5.40,山西汾阳市晋阳农副产品批发市场,山西,汾阳
西红柿,4.80,5.00,5.00,5.00,5.00,山西汾阳市晋阳农副产品批发市场,山西,汾阳
黄瓜,3.40,4.00,4.00,4.00,2.60,山西汾阳市晋阳农副产品批发市场,山西,汾阳
青冬瓜,1.60,1.60,1.60,1.60,1.50,山西汾阳市晋阳农副产品批发市场,山西,汾阳
西葫芦,2.80,3.00,3.00,3.00,2.60,山西汾阳市晋阳农副产品批

发市场,山西,汾阳
白萝卜,1.20,1.20,1.20,1.20,0.80,山西汾阳市晋阳农副产品批发市场,山西,汾阳
胡萝卜,1.50,1.50,1.50,1.50,1.50,山西汾阳市晋阳农副产品批发市场,山西,汾阳
土豆,1.80,2.00,2.00,2.00,1.80,山西汾阳市晋阳农副产品批发市场,山西,汾阳
豆角,9.00,10.40,10.40,10.40,8.60,山西汾阳市晋阳农副产品批发市场,山西,汾阳
尖椒,5.40,5.40,5.40,5.40,4.40,山西汾阳市晋阳农副产品批发市场,山西,汾阳
面粉,3.44,3.44,3.44,3.44,3.44,山西汾阳市晋阳农副产品批发

注意：一旦有group by子句，那么，在select子句中就不能有（分组字段，聚合函数）以外的字段

为什么where必须写在group by的前面，为什么group by后面的条件只能用having

因为，where是用于在真正执行查询逻辑之前过滤数据用的
having是对group by聚合之后的结果进行再过滤；

上述语句的执行逻辑：
1、where过滤不满足条件的数据
2、用聚合函数和group by进行数据运算聚合，得到聚合结果
3、用having条件过滤掉聚合结果中不满足条件的数据

6.5子查询

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

–查询有兄弟的人

select id,name,brother
from 
(select id,name,family_members['brother'] as brother from t_person) tmp
where brother is not null;

另一种写法

select id,name,family_members[‘brother’]
from t_person where array_contains(map_keys(family_members),”brother”);

7，hive函数使用
小技巧：测试函数的用法，可以专门准备一个专门的dual表
create table dual(x string);
insert into table dual values(’’);
**其实：**直接用常亮来测试函数即可

select substr("abcdefg",1,3);

hive的所有函数手册：
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inTable-GeneratingFunctions(UDTF)

7.1常用内置函数
7.1.1.类型转换函数
Select id ,name,cast(salary as int) ; salary是字段
select cast(“2017-08-03” as date) ;
select cast(current_timestamp as date);

示例：

1	1995-05-05 13:30:59	1200.3
2	1994-04-05 13:30:59	2200
3	1996-06-01 12:20:30	80000.5

create table t_fun1(id string,birthday string,salary string)
row format delimited fields terminated by '\t';

select id,cast(birthday as date) as bir, salary from t_fun ;

7.1.2.se数学运算函数

select round(5.4);   ## 5  四舍五入
select round(5.1345,3) ;  ##5.135
select ceil(5.4) ; // select ceiling(5.4) from dual;   ## 6  向上

取证：
select floor(5.4); ## 5 向下取整
select abs(-5.4) ; ## 5.4 绝对值
示例：
select max(age) from t_person; 聚合函数
select min(age) from t_person; 聚合函数

7.1.3. ''字符串函数

substr(string str, int start)   ## 截取子串
substring(string str, int start)
示例：select substr("abcdefg",2) from dual;
substr(string, int start, int len) 
substring(string, int start, int len)
示例：select substr("abcdefg",2,3) from dual;

concat(string A, string B...)  ## 拼接字符串
concat_ws(string SEP, string A, string B...)
示例：select concat("ab","xy") from dual;  ## abxy
select concat_ws(".","192","168","33","44") from dual; ## 192.168.33.44

length(string A)
示例：select length("192.168.33.44") from dual;  ## 13

split(string str, string pat)
示例：select split("192.168.33.44",".") from dual; 错误的，因为.号是正则语法中的特定字符
select split("192.168.33.44","\\.") from dual;

upper(string str) ##转大写
lower(string str)

7.1.4.时间函数

select current_timestamp; ## 获取当前的时间戳(详细时间信息)
select current_date;   ## 获取当前的日期

## 取当前时间的秒数时间戳--(距离格林威治时间1970-1-1 0：0：0秒的差距) 
2019-05-28 11:18:30
select unix_timestamp();
145354747   秒
## unix时间戳转字符串
from_unixtime(bigint unixtime[, string format])
示例：select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");

## 字符串转unix时间戳
unix_timestamp(string date, string pattern)
示例： select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017-08-10 17:50:30","yyyy-MM-dd HH:mm:ss");

## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");

7.1.5.条件控制函数
7.1.5.1. case when

语法：

CASE   [ expression ]
       WHEN condition1 THEN result1
       WHEN condition2 THEN result2
       ...
       WHEN conditionn THEN resultn
       ELSE result
END

示例：
select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end 
from t_user;

7.1.5.2. IF
select id,if(age>25,‘working’,‘worked’) from t_user;

select moive_name,if(array_contains(actors,‘吴京’),‘好电影’,’烂片儿’)
from t_movie;

select id,if(age>25,‘working’,‘worked’) from t_user;

select moive_name,if(array_contains(actors,‘吴京’),‘好电影’,’烂片儿’)
from t_movie;

7.1.6. 9常见聚合函数
group by
sum
avg
max
min
count

7.1.7.表生成函数
7.1.7.1 行转列函数：explode()

假如有一下数据：

1,zhangsan,化学:物理:数学:语文
2,lisi,化学:数学:生物:生理:卫生
3,wangwu,化学:语文:英语:体育:生物
4,zhaoliu,数学:物理:化学:英语

映射成一张表：
create table stu_subject(id string,name string,subjects array)
row format delimited fields terminated by ‘,’
collection items terminated by ‘:’;

使用explode()对数组字段“炸裂”
在这里插入图片描述
然后，我们利用这个explode的结果，来求去重的课程：

select distinct tmp.sub
from 
(select explode(subjects) as sub from stu_subject) tmp;

7.1.7.2.表生成函数lateral view
select id,name,tmp.sub
from stu_subject lateral view explode(subjects) tmp as sub;

select a.id,a.name,a.sub from
(select id,name,tmp.sub
from stu_subject lateral view explode(subjects)tmp as sub)a
where a.sub=’化学’;
在这里插入图片描述

理解： lateral view 相当于两个表在join
左表：是原表
右表：是explode(某个集合字段)之后产生的表
而且：这个join只在同一行的数据间进行

那样，可以方便做更多的查询:
比如，查询选修了生物课的同学
select a.id,a.name,a.sub from
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) a
where sub=‘化学’;

7.1.8.7.1.8.json解析函数：表生成函数
需求：有如下json格式的电影评分数据：

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}

需要做各种统计分析。
发现，直接对json做sql查询不方便，需要将json数据解析成普通的结构化数据表。可以采用hive中内置的json_tuple()函数

实现步骤：
1、创建一个原始表用来对应原始的json数据
create table t_json(json string);
load data local inpath ‘/root/rating.json’ into table t_json;

2、利用json_tuple进行json数据解析
测试，示例：
select json_tuple(json,‘movie’,‘rate’,‘timeStamp’,‘uid’) as(movie,rate,ts,uid) from t_json limit 10;
产生结果：
在这里插入图片描述
真正解析整张json表，将解析结果数据插入一张新表
create table t_movie_rate
as
select json_tuple(json,‘movie’,‘rate’,‘timeStamp’,‘uid’) as(movie,rate,ts,uid) from t_json;

7.1.9.分析函数：row_number() over()——分组TOPN

7.1.9.1.需求
有如下数据：

1,18,a,male
2,19,b,male 
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
7,32,g,male
8,36,h,female
9,30,j,male
10,46,k,female

需要查询出每种性别中年龄最大的2条数据
create table t_topn(id int,age int,name string,sex string)
row format delimited fields terminated by ‘,’;

7.1.9.2.实现
使用row_number函数，对表中的数据按照性别分组，按照年龄倒序排序并进行标记

hql代码：

select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber

产生结果：
在这里插入图片描述
然后，利用上面的结果，查询出rank<=2的即为最终需求

select id,age,name,sex
from 
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber) tmp
where rank<=2;

练习：求出电影评分数据中，每个用户评分最高的topn条数据

7.2.自定义函数
7.2.1.需求：
有如下数据：

a,1000,5000,120
b,2200,150,200
c,2200,450,2200
d,1100,1500,320
e,2200,200,4200
f,2200,3500,620

3个字段分别表示：用户id，基本工资，业绩提成，股权收益
需要查询出每个人的三类收益中最高的是哪一种收益

hive中函数，满足不了本案例的需求。此时，我们可以考虑自己开发一个hive的函数（hive具备这个机制）

7.2.2.实现思路

hive的函数无非也就是一个逻辑的封装，可以接收参数，返回结果，跟java中的方法本质上没有区别

hive就允许用户开发一个java的方法，来实现你想要的函数的功能；
然后在hive中定义一个自己命名的函数，并将这个函数跟你的java方法所在的类关联起来即可。

7.2.3.实现步骤：

开发一个java类继承（HIVE的父类UDF），写一个方法evaluate()
方法的功能：
输入：3个整数值
返回：最大值所在的序号
IDEA里面：

public class ZiDingYi extends UDF{

   public class ZiDingYi extends UDF{

    public  int evaluate(int a,int b,int c) {
        if(a>b && a>c){
            return 1;
        }else if(b>a && b>c){
            return 2;
        }else {
            return 3;
        }
    }
}

将java工程打成jar包，上传到hive所在的机器上
在hive的提示符中，将jar包添加到hive的运行时classpath
在hive的提示符中：

hive> add  jar  /root/kgc.jar;

在hive的提示中，用hive语法创建一个自定义函数，并与jar包中的java类关联
然后，在hive的提示符中，创建一个临时函数：

hive>create  temporary  function  getmax  as  'cn.hive.Cuntom ';

就可以在sql查询中使用这个函数了(建表)
create table t_employee(uid string,salary int,ticheng int,guquan int)
row format delimited fields terminated by ‘,’;
加载数据
load data local inpath '/root/censhi.txt ’ into table t_employee;
用自定义函数查询：
select uid,salary,ticheng,guquan,getmax (salary,ticheng,guquan) as idx
from t_employee;
用case-when改改显示值：
select uid,salary,ticheng,guquan, case when getmax (salary,ticheng,guquan) =1 then '基本工资' when getmax (salary,ticheng,guquan) =2 then '业绩提成' when getmax (salary,ticheng,guquan) =3 then '股权收益' end from t_employee;

##注：临时函数只在一次hive会话中有效，重启会话后就无效

如果需要经常使用该自定义函数，可以考虑创建永久函数：
拷贝jar包到hive的类路径中：

cp  kgc.jar  /usr/local/apache-hive-1.2.1/lib/

创建方法：

create  function  getmax  as 'cn.kgc.ZiDingYi';

删除函数：

DROP  TEMPORARY  FUNCTION  [IF  EXISTS] function_name  
DROP FUNCTION[IF EXISTS] function_name

8.综合查询案例
8.1.用hql来做wordcount
有以下文本文件：

hello tom hello jim
hello rose hello tom
tom love rose rose love jim
jim love tom love is what
what is love

需要用hive做wordcount
– 建表映射
create table t_wc(words string);
– 导入数据
load data local inpath ‘/root/wc.txt’ into table t_wc;

hql答案：

SELECT tmp.word,count(1) as cnts
FROM (
    SELECT explode(split(words, ' ')) AS word
    FROM t_wc
    ) tmp
GROUP BY tmp.word
order by cnts desc
;

整理的都在这里了希望有帮助，持续更新大数据，微服务，机器学习ing。。。

龙卷风摧毁停车场!

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive全家桶

什么是hive1.1hive基本思想hive是基于Hadoop的一个数据仓库工具(离线)，可以将结构化数据文件映射为一张数据库表,并提供SQL查询功能。1.2为什么使用Hive直接使用hadoop所面临的问题(50%) 人员学习成本太高项目周期要求太短 MapReduce实现复杂查询逻辑开发难度大为什么使用hive操作接口采用类SQL语句，. 提供快速开发能力。避免了去写MapReduce，减少了开发人员的学习成本。功能扩展很方便。1.3Hive的特点可
复制链接

扫一扫

专栏目录