大数据_hive_function

最新推荐文章于 2023-11-23 11:48:08 发布

kanbuqinghuanyizhang

最新推荐文章于 2023-11-23 11:48:08 发布

阅读量344

点赞数

分类专栏： hive 大数据笔记面试

本文链接：https://blog.csdn.net/kanbuqinghuanyizhang/article/details/79132616

版权

面试同时被 3 个专栏收录

23 篇文章 0 订阅

订阅专栏

笔记

15 篇文章 0 订阅

订阅专栏

大数据

8 篇文章 0 订阅

订阅专栏

UDF

UDF UDTF UDAF

有什么需要去查就行，根据需要的类型来查
UDAF：聚合函数，多行聚合成一个值，如max,min...,collect_set(col)...
UDTF：将一行拆分成多行，如explode(ARRAY)，json_tuple(jsonStr, k1, k2, ...)，parse_url_tuple(url, p1, p2, ...)
使用"SELECT udtf(col) AS colAlias..."有以下限制：
1、No other expressions are allowed in SELECT
    SELECT pageid, explode(adid_list) AS myCol... is not supported
2、UDTF's can't be nested
    SELECT explode(explode(adid_list)) AS myCol... is not supported
3、/GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
    SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported

如果不想受这些限制，可以使用LateralView

LateralView

1、语法

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*

2、处理过程

A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

3、举例

假设有表pageAds
**col naem**    **col type**
adid_list       Array<int>
pageid          STRING

两行数据：
pageid          adid_list
front_page      [1, 2, 3]
contact_page    [3, 4, 5]

要求每个adid出现的总次数
SELECT adid, count(1)
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid
GROUP BY adid;

这里打破了udtf不能出现其他列和不能使用group by等的限制

输出：
int adid    count(1)
1           1
2           1
3           2
4           1
5           1

4、Multiple Lateral Views

SELECT * FROM exampleTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(myCol1) myTable2 AS myCol2;
结果就是两列数组的笛卡尔积了

5、Outer Lateral Views

LATERAL VIEW可能生成0行数据，In this case the source row would never appear in the results，OUTER可以保证原来行数据存在，udtf函数没有值将会用NULL值替代

如：
SELEC * FROM src LATERAL VIEW explode(array()) C AS a limit 10;
不会输出如何结果

SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 2;
输出：
238 val_238 NULL
86 val_86 NULL

举例

1、explode

Array<int> myCol
[100,200,300]
[400,500,600]
SELECT explode(myCol) AS myNewCol FROM myTable;
输出：
100
200
300
400
500
600
如果列是一个map类型的数据，可以如下：
SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;

2、posexplode

Array<int> myCol
[100,200,300]
[400,500,600]
SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
输出：
pos myNewCol
1   100
2   200
3   300
1   400
2   500
3   600

3、json_tuple

This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string

select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;

select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

4、parse_url_tuple

SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;

GROUPing and SORTing on f(column)

select f(col) as fc, count(*) from table_name group by fc;将会报错：
FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc

应该使用下面的方式：
select f(col) as fc, count(*) from table_name group by f(col);

HivePlugins

参考

https://cwiki.apache.org/confluence/display/Hive/HivePlugins
https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

Creating Custom UDFs

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {
  public Text evaluate(final Text s) {
    if (s == null) { return null; }
    return new Text(s.toString().toLowerCase());
  }
}

Deploying Jars for User Defined Functions and User Defined SerDes

hive> add jar my_jar.jar;
Added my_jar.jar to class path

指定全路径:
hive> add jar /tmp/my_jar.jar;
Added /tmp/my_jar.jar to class path

查看Jar

hive> list jars;
my_jar.jar

create function

create temporary function my_lower as 'com.example.hive.udf.Lower';
调用function
hive> select my_lower(title), sum(freq) from titles group by my_lower(title);

也可以在add jar的时候创建function:
CREATE FUNCTION myfunc AS 'myclass' USING JAR 'hdfs:///path/to/jar';

Transform

参考

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

知识点

1、进入脚本的多列会转成用\t分割的字符串，NULL值将被转成\N
2、脚本的输出值将通过\t分割成多列String类型的值，\N将被解析成列的NULL值，String类型的值将自动转成列指定的类型
3、 User scripts can output debug information to standard error which will be shown on the task detail page on hadoop

示例
* hql

hive -e "add file /data0/user/jialin5/shell2/join_json_strs.py;
SELECT transform(MAP('a','av','b','bv'), '{\"c\":\"cv\"}') USING 'python join_json_strs.py' AS str1 
FROM mds_tblog_expo_dtl_feed limit 3;"

/data0/user/jialin5/shell2/join_json_strs.py


#!/usr/bin/python

import sys
reload(sys)
sys.setdefaultencoding('utf8')

import json

for line in sys.stdin:
        try:
                cols = line.strip().split('\t')
                col_dict={}
                for col in cols:
                        col_dict.update(eval(col))
                json_str = json.dumps(col_dict)
        except Exception, e:
                json_str = '{}'
        finally:
                print json_str

输出结果

{"a": "av", "c": "cv", "b": "bv"}
{"a": "av", "c": "cv", "b": "bv"}
{"a": "av", "c": "cv", "b": "bv"}

参考

官方文档

kanbuqinghuanyizhang

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
大数据_hive_function

UDFUDF UDTF UDAF有什么需要去查就行，根据需要的类型来查UDAF：聚合函数，多行聚合成一个值，如max,min...,collect_set(col)...UDTF：将一行拆分成多行，如explode(ARRAY)，json_tuple(jsonStr, k1, k2, ...)，parse_url_tuple(url, p1, p2, ...)使用"SELECT
复制链接

扫一扫

专栏目录