UDF指用户定义函数,即在hive中进行客户化的函数,hive中自定义了很多function,我们使用show function命令可以很容易查看已有函数的用法:
hive (default)> desc function split;
OK
tab_name
split(str, regex) - Splits str around occurances that match regex
Time taken: 0.934 seconds, Fetched: 1 row(s)
hive (default)> desc function extended split;
OK
tab_name
split(str, regex) - Splits str around occurances that match regex
Example:
> SELECT split('oneAtwoBthreeC', '[ABC]') FROM src LIMIT 1;
["one", "two", "three"]
Time taken: 0.112 seconds, Fetched: 4 row(s)b
在实际应用中,我们需要根据实际的业务需求自定义开发这种函数,自定义函数的例子参考文档: Hive Plugins
UDF函数一共分为三种类型:
UDF-一进一出,给定一个参数,输出一个处理后的数据
UDAF-多进一出,属于聚合函数,类似于count、sum等函数
UDTF-一进多出,属于一个参数,返回一个列表作为结果
根据官方文档给出的UDF的例子,我们编写将字符转换为小写的函数如下:
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public final class MyLower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());
}
public static void main(String[] args) {
System.out.println(new MyLower().evaluate(new Text("HIVE")));
}
}
需要注意的是,UDF的类可以重载方法evaluate,传入的参数要是hadoop的类型,不能是java的基本类型。该方法的返回值不能是void,可以返回null。将上面的代码保存到文件MyLower.java,将文件上传到/opt/datas/com/example/hive/udf路径下,执行结果如下:
[root@node3 datas]# mkdir -p com/example/hive/udf
[root@node3 datas]# ls
000000_0 com distby-res exp_res.txt hive_exp_emp2
clustby-res dept.txt emp.txt hive_exp_emp sortby-res
[root@node3 datas]# javac -classpath .:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/jars/* com/example/hive/udf/MyLower.java
Note: com/example/hive/udf/MyLower.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
[root@node3 datas]# java -classpath .:/opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/jars/* com/example/hive/udf/MyLower
hive
[root@node3 datas]# jar cf MyLower.jar com/example/hive/udf/MyLower.class
[root@node3 datas]# ls
000000_0 com distby-res exp_res.txt hive_exp_emp2 sortby-res
clustby-res dept.txt emp.txt hive_exp_emp MyLower.jar
编译没有问题的话会生成MyLower.jar,需要将jar文件加载到hive中,使用如下命令:
add jar /opt/datas/MyLower.jar;
然后创建一个临时函数:
create temporary function my_lower as “com.example.hive.udf.MyLower” ;
执行结果如下:
hive (default)> add jar /opt/datas/MyLower.jar;
Added [/opt/datas/MyLower.jar] to class path
Added resources: [/opt/datas/MyLower.jar]
hive (default)> create temporary function my_lower as "com.example.hive.udf.MyLower" ;
OK
Time taken: 2.516 seconds
hive (default)> show functions my_lower;
SHOW FUNCTIONS is deprecated, please use SHOW FUNCTIONS LIKE instead.
OK
tab_name
my_lower
Time taken: 0.559 seconds, Fetched: 1 row(s)
可以看到我们新增加的函数已经成功了。下面我们试一下该函数:
hive (default)> select ename, my_lower(ename) lowername from emp limit 3 ;
Query ID = hive_20190221222424_216b7dd7-a0c0-463b-b4cc-dcdcb16aa741
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1550060164760_0027, Tracking URL = http://node1:8088/proxy/application_1550060164760_0027/
Kill Command = /opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1550060164760_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-21 22:26:34,976 Stage-1 map = 0%, reduce = 0%
2019-02-21 22:27:35,513 Stage-1 map = 0%, reduce = 0%
2019-02-21 22:28:07,558 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 27.74 sec
MapReduce Total cumulative CPU time: 27 seconds 740 msec
Ended Job = job_1550060164760_0027
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 27.74 sec HDFS Read: 5024 HDFS Write: 34 SUCCESS
Total MapReduce CPU Time Spent: 27 seconds 740 msec
OK
ename lowername
SMITH smith
ALLEN allen
WARD ward
Time taken: 204.818 seconds, Fetched: 3 row(s)
可以看出,雇员的大写名称被转换为小写名字了。
还有一种通过hdfs文件来创建function,需要将jar包上传到hdfs目录下。
hive (default)> dfs -mkdir -p /user/hive/user1/jars;
hive (default)> dfs -put /opt/datas/MyLower.jar /user/hive/user1/jars;
hive (default)> dfs -ls /user/hive/user1/jars;
Found 1 items
-rw-r--r-- 3 hive hive 984 2019-02-21 22:36 /user/hive/user1/jars/MyLower.jar
将jar包上传到hdfs目录下后,可以创建自定义函数:
hive (default)> CREATE FUNCTION self_lower AS 'com.example.hive.udf.MyLower' USING JAR 'hdfs://node1:8020/user/hive/user1/jars/MyLower.jar';
converting to local hdfs://node1:8020/user/hive/user1/jars/MyLower.jar
Added [/tmp/781e39e0-1cfa-40cc-94d6-9313102f4104_resources/MyLower.jar] to class path
Added resources: [hdfs://node1:8020/user/hive/user1/jars/MyLower.jar]
OK
Time taken: 3.75 seconds
hive (default)> select ename, self_lower(ename) lowername from emp limit 3 ;
Query ID = hive_20190221224444_82e00c3a-deda-4ce7-9416-859464c1e277
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1550060164760_0028, Tracking URL = http://node1:8088/proxy/application_1550060164760_0028/
Kill Command = /opt/cloudera/parcels/CDH-5.15.2-1.cdh5.15.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1550060164760_0028
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-21 22:45:20,111 Stage-1 map = 0%, reduce = 0%
2019-02-21 22:45:43,217 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 16.3 sec
MapReduce Total cumulative CPU time: 16 seconds 300 msec
Ended Job = job_1550060164760_0028
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 16.3 sec HDFS Read: 5027 HDFS Write: 34 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 300 msec
OK
ename lowername
SMITH smith
ALLEN allen
WARD ward
Time taken: 51.599 seconds, Fetched: 3 row(s)
注意,这种方式创建的function无法用show functions函数进行查看,但是可以正常使用。
更多有关大数据的内容请关注微信公众号:大数据与人工智能初学者
扫描下面的二维码即可关注: