1、定义自己的UDF函数
package com.hihi.hive;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class HelloWord extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text("HelloWord:" + s.toString().toLowerCase());
}
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>study-hadoop</groupId>
<artifactId>hive</artifactId>
<version>1.0</version>
<properties>
<projcet.build.sourceEncoding>UTF-8</projcet.build.sourceEncoding>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
<hive.version>1.1.0-cdh5.7.0</hive.version>
</properties>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
</dependency>
</dependencies>
</project>
2、将代码打包,放上后台
[root@hadoop001 jar]# rz
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring hive-1.0.jar...
100% 2 KB 2 KB/sec 00:00:01 0 Errors
[root@hadoop001 jar]# pwd
/home/hadoop/jar
3、进入hive执行命令,这命令能创建一个临时函数,该函数只适用于当前会话
hive> add jar /home/hadoop/jar/hive-1.0.jar;
Added [/home/hadoop/jar/hive-1.0.jar] to class path
Added resources: [/home/hadoop/jar/hive-1.0.jar]
hive> create temporary function my_hello as 'com.hihi.hive.HelloWord';
OK
Time taken: 0.016 seconds
hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
OK
SMITH HelloWord:SMITH
JONES HelloWord:JONES
SCOTT HelloWord:SCOTT
Time taken: 0.124 seconds, Fetched: 3 row(s)
hive> list jars;
/home/hadoop/jar/hive-1.0.jar
4、查询元数据发现不存在关于函数的数据
mysql> select * from funcs;
Empty set (0.00 sec)
5、重新建立会话,由于刚刚建立的是临时函数,所以发现报错
hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
FAILED: SemanticException [Error 10011]: Line 1:14 Invalid function 'my_hello'
6、尝试创建一个永久的函数
hive> add jar /home/hadoop/jar/hive-1.0.jar;
Added [/home/hadoop/jar/hive-1.0.jar] to class path
Added resources: [/home/hadoop/jar/hive-1.0.jar]
hive> create function my_hello as 'com.hihi.hive.HelloWord';
OK
Time taken: 0.016 seconds
7、查找元数据,发现有改函数的信息,但func_ru表中却没有数据。
mysql> select * from funcs;
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
| FUNC_ID | CLASS_NAME | CREATE_TIME | DB_ID | FUNC_NAME | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
| 6 | com.hihi.hive.HelloWord | 1515675864 | 1 | my_hello | 1 | NULL | USER |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
1 row in set (0.00 sec)
mysql> select * from func_ru;
Empty set (0.00 sec)
8、重新进入会话,发现调用函数还是失败
hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
FAILED: SemanticException [Error 10011]: Line 1:14 Invalid function 'my_hello'
9、尝试从HDFS导入jar包
CREATE FUNCTION my_hello AS 'com.hihi.hive.HelloWord' USING JAR 'hdfs://hadoop001:9000/jar/hive-1.0.jar';
10、查看元数据,发现func_ru现在有关于函数my_hello的数据,那是不是每次调用函数,就读取元数据重新加载jar包并创建函数呢?
mysql> select * from func_ru;
+---------+---------------+----------------------------------------+-------------+
| FUNC_ID | RESOURCE_TYPE | RESOURCE_URI | INTEGER_IDX |
+---------+---------------+----------------------------------------+-------------+
| 11 | 1 | hdfs://hadoop001:9000/jar/hive-1.0.jar | 0 |
+---------+---------------+----------------------------------------+-------------+
1 row in set (0.00 sec)
mysql> select * from funcs;
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
| FUNC_ID | CLASS_NAME | CREATE_TIME | DB_ID | FUNC_NAME | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
| 11 | com.hihi.hive.HelloWord | 1515676179 | 1 | my_hello | 1 | NULL | USER |
+---------+-------------------------+-------------+-------+-----------+-----------+------------+------------+
1 row in set (0.00 sec)
11、重新登陆会话,先检查jar包是否被加载再调用函数,会发现调用函数的时候会重新加载jar包,加载jar包的规则记录在元数据库的func_ru表格中
hive> list jar;
hive> select ename, my_hello(ename) from emp_dept_partition limit 3;
converting to local hdfs://hadoop001:9000/jar/hive-1.0.jar
Added [/tmp/9da42cea-1284-46f1-9969-74dc80ed05fe_resources/hive-1.0.jar] to class path
Added resources: [hdfs://hadoop001:9000/jar/hive-1.0.jar]
OK
SMITH HelloWord:SMITH
JONES HelloWord:JONES
SCOTT HelloWord:SCOTT
Time taken: 1.252 seconds, Fetched: 3 row(s)
hive> list jar;
/tmp/9da42cea-1284-46f1-9969-74dc80ed05fe_resources/hive-1.0.jar
不足点:show functions的命令并不会显示该命令,而且每次使用都要重新加载jar包的话还是挺麻烦的。所以后续会继续寻找其解决方法。
【来自@若泽大数据】