hive的高级操作

why do not

已于 2022-04-04 12:40:56 修改

阅读量834

点赞数

分类专栏： hive 文章标签： hive

于 2020-08-25 15:34:47 首次发布

本文链接：https://blog.csdn.net/qq_42994177/article/details/90741785

版权

hive 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Hive自定义函数UDF

当hive提供的内置函数无法满足业务处理需求时，此时就可以考虑使用用户自定义函数

UDF( user-defined function )作用域单个数据行，产生一个数据行作为输出。（数学函数，字符串函数）

UDAF( 用户自定义聚集函数User-Defined Aggregation Function )：接受多个输入数据行，并产生一个输出数据行（count，max）

UDTF( 表格生成函数User-Defined Table Function)：接受一行输入，输出多行（explode）

用户自定义函数开发：

1、新建一个java项目，导入hive的jar包

2、新建一个类：UDF_Test 必须继承 UDF

UDF_Test extends UDF

3、在UDF_Test类的内部，实现一个到多个重载的名字叫做evaluate的方法

4、将编写好的用户自定义函数类导成jar添加到hive的classpath中

hive > add jar /home/hadoop/udf.jar;

hive > list jar; //查看当前加入的jar ，可以写成list jars;

5、最重要的一步

创建一个临时函数关联到用户自定义的类

create temporary function myfunc as "com.ghgj.hive.udf.UDF_Test"; //类的权限定名称

6、验证

show functions; 能看到自定义的函数myfunc

select myfunc(3,4);

select myfunc(4);

select myfunc("String", true)

注意：

当前添加的用户自定义函数，仅仅只是临时性的。所以，当当前这个会话断开时，这个临时函数就失效。下次再使用，请重新添加，重复4 、5步

永久函数：修改源代码不建议使用

Json数据解析UDF开发

现有原始json数据(rating.json)如下：

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}

求：1. 求评分最高的3部电影（超过50次点评）

求被评分次数最高的3部电影

两种方式：

自定义函数

自定义一个函数 parse_json：能解析json格式的字符串数据，然后输出：movieid

Select parse_json(t.line) as movieid,count(*) as total from test t group by parse_json(t.line)

Group by 在select前执行

解析json格式的数据有最常用的三种方式：

gson
fastjson
jsckson

其实，hive给提供了一个内置函数使用

get_json_object();

第二题：select get_json_object(t.line,’$.movie’) as movieid,count(*) as total from t_json t group by get_json_object(t.line,’$.movie’) order by total desc limit 3;

hive中的json解析：

map+数组的格式 [ {}，{} ] {“tt”:[{},{}]}

用户观影的评分数据：

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}

{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}

{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}

{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}

需求：

评分最高的电影

select * from movie order by rate desc limit 1;

关键是我的原始数据是json格式的

普通的建表：

create table test_json(movie string,rate int,times string,uid int) row format delimited fields terminated by ',';

加载数据：

load data local inpath '/home/hadoop/apps/movie.json' into table test_json;

查询：

{"movie":"3408" "rate":"4" "timeStamp":"978300275" "uid":"1"}

{"movie":"1193" NULL "timeStamp":"978300760" NULL

怎么解析json：

1）自定义udf

2）内置的udf： get_json_object

语法：

get_json_object( 需要解析的json串, 需要从第一个参数中取的值的路径 )

$ : Root object json串的根目录第一级目录

. : Child operator 子节点 $.movie

[] : Subscript operator for array 取数组的下标的从0开始

* : Wildcard for [] 取数组的所有的

select get_json_object('{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}','$.movie');

select get_json_object('{

"test1": "wx9fdb8ble7ce3c68f",

"test2": "123456789",

"testData1": {

"testdatason1": "97895455"

"testdatason2":3,

"testData2": [

{

"testshuzu1":"12"

"testshuzu1":"11"

{

"testshuzu2":"13"

"testshuzu2":"14"

{

"testshuzu3":"15"

"testshuzu3":"16"

}

]

}

}','$.testData1.testData2[0].testshuzu1')

正确的建表：

create table test_json02(line string);

load data local inpath '/home/hadoop/apps/movie.json' into table test_json02;

方式1：

建最终的表：

create table test_json_final(movie string,rate int,times string,uid int);

//对原来的数据进行解析插入

insert into table test_json_final select get_json_object(t.line,'$.movie'),get_json_object(t.line,'$.rate'),get_json_object(t.line,'$.tim eStamp'),get_json_object(t.line,'$.uid') from test_json02 t;

方式2：

create table movies as select get_json_object(t.line,'$.movie') movieid,get_json_object(t.line,'$.rate')rate,get_json_object(t.line,'$.timeStamp')times,get_j son_object(t.line,'$.uid') uid from test_json02 t;

select * from test_json_final order by rate desc limit 1;

Transform实现

Json 数据：

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}

需求：把timestamp的值转换成日期编号

需求：统计movies这个表中周一到周日期间哪一天的观影人数最多

时间戳的解析：

1）自定义的udf

2）内置函数

3）transform方式脚本的方式： shell 、python

python脚本：

#!/usr/bin/python

import sys

import datetime

for line in sys.stdin: //对系统的标准输入进行循环遍历for(String line:sys.stdin)

line = line.strip() //=line.trim 去前后空格

movie,rate,unixtime,userid = line.split('\t')

weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

print '\t'.join([movie, rate, str(weekday),userid])

脚本的使用:

1)将脚本放在classpath下

add file /home/hadoop/apps/mypython.py;

shell-----.sh sh ....show python ....py

检测：

list files;

2)建表语句

create table lastjsontable(movie int, rate int, weekday int, userid int) row format delimited fields terminated by '\t';

3）使用脚本解析数据向表中插入

insert into table lastjsontable select transform(movie,rate,times,uid) using 'python mypython.py' as (movie, rate, weekday, userid) from test_json_final;

select count(*) count_person from lastjsontable group by weekday order by count_person desc limit 1;

hive的多字节分隔符：

hive不支持多字节分隔符

建表：

create table test_hive_delimit(id int,name string) row format delimited fields terminated by '\|\|';

load data local inpath '/home/hadoop/apps/test_hive' into table test_hive_delimit;

查询：

原始数据：

1 | |zs

hive的内部并不支持多字节分隔符：

解决方案：

1）将多字节分隔符进行替换为单字节分隔符 ||--->:

1||zs

2）建表的时候自定义分割符自定义多字节的分隔符

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 默认只支持单字节的分隔符

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForma

需要将输入和分隔符换掉输入换成正则表达式的输入（不需要记住）

create table test_hive_delimit01(id int,name string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*)\\|\\|(.*)','output.format.string'='%1$s %2$s') stored as textfile;

加载数据：

load data local inpath '/home/hadoop/apps/test_hive' into table test_hive_delimit01;