UDF
- UDF UDTF UDAF
有什么需要去查就行,根据需要的类型来查
UDAF:聚合函数,多行聚合成一个值,如max,min...,collect_set(col)...
UDTF:将一行拆分成多行,如explode(ARRAY),json_tuple(jsonStr, k1, k2, ...),parse_url_tuple(url, p1, p2, ...)
使用"SELECT udtf(col) AS colAlias..."有以下限制:
1、No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
2、UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
3、/GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
如果不想受这些限制,可以使用LateralView
- LateralView
1、语法
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
2、处理过程
A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.
3、举例
假设有表pageAds
**col naem** **col type**
adid_list Array<int>
pageid STRING
两行数据:
pageid adid_list
front_page [1, 2, 3]
contact_page [3, 4, 5]
要求每个adid出现的总次数
SELECT adid, count(1)
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid
GROUP BY adid;
这里打破了udtf不能出现其他列和不能使用group by等的限制
输出:
int adid count(1)
1 1
2 1
3 2
4 1
5 1
4、Multiple Lateral Views
SELECT * FROM exampleTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(myCol1) myTable2 AS myCol2;
结果就是两列数组的笛卡尔积了
5、Outer Lateral Views
LATERAL VIEW可能生成0行数据,In this case the source row would never appear in the results,OUTER可以保证原来行数据存在,udtf函数没有值将会用NULL值替代
如:
SELEC * FROM src LATERAL VIEW explode(array()) C AS a limit 10;
不会输出如何结果
SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 2;
输出:
238 val_238 NULL
86 val_86 NULL
- 举例
1、explode
Array<int> myCol
[100,200,300]
[400,500,600]
SELECT explode(myCol) AS myNewCol FROM myTable;
输出:
100
200
300
400
500
600
如果列是一个map类型的数据,可以如下:
SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;
2、posexplode
Array<int> myCol
[100,200,300]
[400,500,600]
SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
输出:
pos myNewCol
1 100
2 200
3 300
1 400
2 500
3 600
3、json_tuple
This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string
select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;
select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;
4、parse_url_tuple
SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;
- GROUPing and SORTing on f(column)
select f(col) as fc, count(*) from table_name group by fc;将会报错:
FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc
应该使用下面的方式:
select f(col) as fc, count(*) from table_name group by f(col);
HivePlugins
- 参考
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy
- Creating Custom UDFs
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
}
- Deploying Jars for User Defined Functions and User Defined SerDes
hive> add jar my_jar.jar;
Added my_jar.jar to class path
指定全路径:
hive> add jar /tmp/my_jar.jar;
Added /tmp/my_jar.jar to class path
- 查看Jar
hive> list jars;
my_jar.jar
- create function
create temporary function my_lower as 'com.example.hive.udf.Lower';
调用function
hive> select my_lower(title), sum(freq) from titles group by my_lower(title);
也可以在add jar的时候创建function:
CREATE FUNCTION myfunc AS 'myclass' USING JAR 'hdfs:///path/to/jar';
Transform
参考
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
知识点
1、进入脚本的多列会转成用\t分割的字符串,NULL值将被转成\N
2、脚本的输出值将通过\t分割成多列String类型的值,\N将被解析成列的NULL值,String类型的值将自动转成列指定的类型
3、 User scripts can output debug information to standard error which will be shown on the task detail page on hadoop
示例
* hql
hive -e "add file /data0/user/jialin5/shell2/join_json_strs.py;
SELECT transform(MAP('a','av','b','bv'), '{\"c\":\"cv\"}') USING 'python join_json_strs.py' AS str1
FROM mds_tblog_expo_dtl_feed limit 3;"
- /data0/user/jialin5/shell2/join_json_strs.py
#!/usr/bin/python
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import json
for line in sys.stdin:
try:
cols = line.strip().split('\t')
col_dict={}
for col in cols:
col_dict.update(eval(col))
json_str = json.dumps(col_dict)
except Exception, e:
json_str = '{}'
finally:
print json_str
- 输出结果
{"a": "av", "c": "cv", "b": "bv"}
{"a": "av", "c": "cv", "b": "bv"}
{"a": "av", "c": "cv", "b": "bv"}