背景:在工作过程中有这么一个需求,在hive中有一张日志表orig_client,该表中的数据是通过mr1.jar(就一个mr程序)进行清洗过滤得到,现在我重写了mr程序,暂且叫mr2.jar,更改了表中id字段的生成规则,别的字段没有修改,将经过mr2.jar清洗的数据存入orig_client_test表中(该表创建语句与前表一样),执行后的结果理论上只有id的value不一样,考虑到字段比较多,我将除id以外的字段拼接在一起成为一条非常长度的一个字符串,然后通过MD5函数加密后成为32位的字符串,再比对这个字符串就ok,加快执行的速度;
步骤:
1、创建maven工程–配置pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.qunar</groupId>
<artifactId>hive_md5</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build/>
<properties>
<hadoopVersion>2.5.2</hadoopVersion>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
</dependency>
<!-- Hadoop start -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<!-- Hadoop -->
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.6</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>0.8.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/javax.jdo/jdo2-api -->
<dependency>
<groupId>javax.jdo</groupId>
<artifactId>jdo2-api</artifactId>
<version>2.3-20090302111651</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency>
</dependencies>
</project>
2、java代码
package com.qunar.hive_md5;
import java.security.MessageDigest;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
public class MD5 extends UDF {
public String evaluate(final String str) {
if (StringUtils.isBlank(str)) {
return "";
}
String digest = null;
StringBuffer buffer = new StringBuffer();
try {
MessageDigest digester = MessageDigest.getInstance("md5");
byte[] digestArray = digester.digest(str.getBytes("UTF-8"));
for (int i = 0; i < digestArray.length; i++) {
buffer.append(String.format("%02x", digestArray[i]));
}
digest = buffer.toString();
} catch (Exception e) {
e.printStackTrace();
}
return digest;
}
// 测试
public static void main(String[] args) {
MD5 md5 = new MD5();
System.out.println(md5.evaluate("dsfasfsafas"));
}
}
3、导出jar包上传到linux服务器上
我的jar包地址: /home/q/server/data/mobiledcs-script/logclean/client_rxtest/hive_md5.jar (等下有用)
4、在hive上注册自己的UDF函数
(1)临时注册
进入hive命令行——>
添加jar包:add jar /home/q/server/data/mobiledcs-script/logclean/client_rxtest/hive_md5.jar;
创建udf : create temporary function md5 as ‘com.qunar.hive_md5.MD5’;
测试:select md5(‘123456’);
结果:
(2)临时注册
原理一样的,只是把jar包从本地上传到hdfs上—
hadoop fs -put hive_md5.jar /user/test/
进入hive
CREATE FUNCTION md5 AS ‘com.qunar.hive_md5.MD5’’ USING JAR ‘hdfs:///user/test/hive_md5.jar’ ;
select md5(‘132456’);
–结果: e10adc3949ba59abbe56e057f20f883e
–删除临时函数
DROP TEMPORARY FUNCTION [IF EXISTS] function_name;
–删除永久函数
DROP FUNCTION [IF EXISTS] function_name;
–重新加载函数 hive1.2.0之后支持
RELOAD FUNCTION;