sql系列——hive之array、map、struct、java函数（udf）、python函数、分隔符、json_tuple的处理

最新推荐文章于 2023-04-27 22:33:15 发布

qq924178473

最新推荐文章于 2023-04-27 22:33:15 发布

阅读量1.2k

点赞数

分类专栏： hive 文章标签： hive

原文链接：https://www.cnblogs.com/h-kang/p/10916609.html

版权

hive 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

https://www.cnblogs.com/h-kang/p/10916609.html

原始数据

1 huangbo guangzhou,xianggang,shenzhen a1:30,a2:20,a3:100 beijing,112233,13522334455,500
2 xuzheng xianggang b2:50,b3:40 tianjin,223344,13644556677,600
3 wangbaoqiang beijing,zhejinag c1:200 chongqinjg,334455,15622334455,20

表：

use class;
create table cdt(
id int, 
name string, 
work_location array<string>, 
piaofang map<string,bigint>, 
address struct<location:string,zipcode:int,phone:string,value:int>) 
row format delimited 
fields terminated by "\t" 
collection items terminated by "," 
map keys terminated by ":" 
lines terminated by "\n";

1、数据类型处理

1.1、array数组：选取全部，按索引选取

1.2、map：查询全部，查询某个键

1.3、struct：选取全部，选取某个字段

2、函数、udf、udaf、udtf

2.1、显示自定义函数的信息

show functions;
desc function substr;
desc function extended substr;

2.2、udf、udaf、udtf

当 Hive 提供的内置函数无法满足业务处理需要时，此时就可以考虑使用用户自定义函数。

UDF（user-defined function）作用于单个数据行，产生一个数据行作为输出。（数学函数，字符串函数）

UDAF（用户定义聚集函数 User- Defined Aggregation Funcation）：接收多个输入数据行，并产生一个输出数据行。（count，max）

UDTF（表格生成函数 User-Defined Table Functions）：接收一行输入，输出多行（explode）

2.3、java中使用jar包方式使用udf

A、ToLowerCase.java：

import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLowerCase extends UDF{
    
    // 必须是 public，并且 evaluate 方法可以重载
    public String evaluate(String field) {
    String result = field.toLowerCase();
    return result;
    }
    
}

B、打成 jar 包上传到服务器

C、将 jar 包添加到 hive 的 classpath

add JAR /home/hadoop/udf.jar;

D、查看jar包

list jar

E、创建临时函数与开发好的 class 关联起来

create temporary function tolowercase as 'com.study.hive.udf.ToLowerCase';

F、至此，便可以在 hql 在使用自定义的函数

2.4、使用transform+python方法创建函数

先编辑一个 python 脚本文件

########python######代码
## vi weekday_mapper.py
#!/bin/python
import sys
import datetime
for line in sys.stdin:
 line = line.strip()
 movie,rate,unixtime,userid = line.split('\t')
 weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
 print '\t'.join([movie, rate, str(weekday),userid])

保存文件然后，将文件加入 hive 的 classpath，其中rate是\t分隔的表：

hive>add file /home/hadoop/weekday_mapper.py;
hive> insert into table lastjsontable select transform(movie,rate,unixtime,userid)
using 'python weekday_mapper.py' as(movie,rate,weekday,userid) from rate;

创建最后的用来存储调用 python 脚本解析出来的数据的表：lastjsontable

create table lastjsontable(movie int, rate int, weekday int, userid int) row format delimited
fields terminated by '\t';

最后查询看数据是否正确

select distinct(weekday) from lastjsontable;

3、特殊符号（分隔符）的处理

Hive 对文件中字段的分隔符默认情况下只支持单字节分隔符，如果数据文件中的分隔符是多字符的，如下所示：

01||huangbo

02||xuzheng

03||wangbaoqiang

创建表

create table t_bi_reg(id string,name string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)\\|\\|(.*)','output.format.string'='%1$s %2$s')
stored as textfile;

导入数据并查询

0: jdbc:hive2://hadoop3:10000> load data local inpath '/home/hadoop/data.txt' into table t_bi_reg;
No rows affected (0.747 seconds)
0: jdbc:hive2://hadoop3:10000> select a.* from t_bi_reg a;

4、json_tuple+lateral view的使用

使用json_tuple:

json_tuple(jsonStr, k1, k2, ...)

qq924178473

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
sql系列——hive之array、map、struct、java函数（udf）、python函数、分隔符、json_tuple的处理

https://www.cnblogs.com/h-kang/p/10916609.html原始数据1 huangbo guangzhou,xianggang,shenzhen a1:30,a2:20,a3:100 beijing,112233,13522334455,5002 xuzheng xianggang b2:50,b3:40 tianjin,223344,13644556677,6003 wangbaoqiang beijing,zhejinag c1:200 chongqinjg
复制链接

扫一扫