大数据_hive_function

15 篇文章 0 订阅
8 篇文章 0 订阅

UDF

  • UDF UDTF UDAF
有什么需要去查就行,根据需要的类型来查
UDAF:聚合函数,多行聚合成一个值,如max,min...,collect_set(col)...
UDTF:将一行拆分成多行,如explode(ARRAY),json_tuple(jsonStr, k1, k2, ...),parse_url_tuple(url, p1, p2, ...)
使用"SELECT udtf(col) AS colAlias..."有以下限制:
1、No other expressions are allowed in SELECT
    SELECT pageid, explode(adid_list) AS myCol... is not supported
2、UDTF's can't be nested
    SELECT explode(explode(adid_list)) AS myCol... is not supported
3、/GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
    SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported

如果不想受这些限制,可以使用LateralView
  • LateralView
    1、语法
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*

2、处理过程

A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

3、举例

假设有表pageAds
**col naem**    **col type**
adid_list       Array<int>
pageid          STRING

两行数据:
pageid          adid_list
front_page      [1, 2, 3]
contact_page    [3, 4, 5]

要求每个adid出现的总次数
SELECT adid, count(1)
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid
GROUP BY adid;

这里打破了udtf不能出现其他列和不能使用group by等的限制

输出:
int adid    count(1)
1           1
2           1
3           2
4           1
5           1

4、Multiple Lateral Views

SELECT * FROM exampleTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(myCol1) myTable2 AS myCol2;
结果就是两列数组的笛卡尔积了

5、Outer Lateral Views

LATERAL VIEW可能生成0行数据,In this case the source row would never appear in the results,OUTER可以保证原来行数据存在,udtf函数没有值将会用NULL值替代

如:
SELEC * FROM src LATERAL VIEW explode(array()) C AS a limit 10;
不会输出如何结果

SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 2;
输出:
238 val_238 NULL
86 val_86 NULL
  • 举例
    1、explode
Array<int> myCol
[100,200,300]
[400,500,600]
SELECT explode(myCol) AS myNewCol FROM myTable;
输出:
100
200
300
400
500
600
如果列是一个map类型的数据,可以如下:
SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;

2、posexplode

Array<int> myCol
[100,200,300]
[400,500,600]
SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
输出:
pos myNewCol
1   100
2   200
3   300
1   400
2   500
3   600

3、json_tuple

This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string

select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;

select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

4、parse_url_tuple

SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;
  • GROUPing and SORTing on f(column)
select f(col) as fc, count(*) from table_name group by fc;将会报错:
FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc

应该使用下面的方式:
select f(col) as fc, count(*) from table_name group by f(col);

HivePlugins

  • 参考
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy
  • Creating Custom UDFs
package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public final class Lower extends UDF {
  public Text evaluate(final Text s) {
    if (s == null) { return null; }
    return new Text(s.toString().toLowerCase());
  }
}
  • Deploying Jars for User Defined Functions and User Defined SerDes
hive> add jar my_jar.jar;
Added my_jar.jar to class path

指定全路径:
hive> add jar /tmp/my_jar.jar;
Added /tmp/my_jar.jar to class path
  • 查看Jar
hive> list jars;
my_jar.jar
  • create function
create temporary function my_lower as 'com.example.hive.udf.Lower';
调用function
hive> select my_lower(title), sum(freq) from titles group by my_lower(title);

也可以在add jar的时候创建function:
CREATE FUNCTION myfunc AS 'myclass' USING JAR 'hdfs:///path/to/jar';

Transform

参考

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

知识点

1、进入脚本的多列会转成用\t分割的字符串,NULL值将被转成\N
2、脚本的输出值将通过\t分割成多列String类型的值,\N将被解析成列的NULL值,String类型的值将自动转成列指定的类型
3、 User scripts can output debug information to standard error which will be shown on the task detail page on hadoop

示例
* hql

hive -e "add file /data0/user/jialin5/shell2/join_json_strs.py;
SELECT transform(MAP('a','av','b','bv'), '{\"c\":\"cv\"}') USING 'python join_json_strs.py' AS str1 
FROM mds_tblog_expo_dtl_feed limit 3;"
  • /data0/user/jialin5/shell2/join_json_strs.py

#!/usr/bin/python

import sys
reload(sys)
sys.setdefaultencoding('utf8')

import json

for line in sys.stdin:
        try:
                cols = line.strip().split('\t')
                col_dict={}
                for col in cols:
                        col_dict.update(eval(col))
                json_str = json.dumps(col_dict)
        except Exception, e:
                json_str = '{}'
        finally:
                print json_str
  • 输出结果
{"a": "av", "c": "cv", "b": "bv"}
{"a": "av", "c": "cv", "b": "bv"}
{"a": "av", "c": "cv", "b": "bv"}

参考

官方文档

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值