编写hive udf和使用hive udf:hue的hive界面中使用hive udf函数、oozie使用hive udf函数、hive命令行使用udf函数

开发环境:jdk1.7+idea 16+Hive-1.1.0

使用udf的生产环境:cdh5.8.0+hive-1.1.0


1、导入hive的所有相关jar



或者使用maven引入CDH相关包:

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
            
            
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns= "http://maven.apache.org/POM/4.0.0"
xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation= "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" >
<modelVersion>4.0.0 </modelVersion>
<groupId>com.enn </groupId>
<artifactId>hive-udf </artifactId>
<version>1.0-SNAPSHOT </version>
<properties>
<project.build.sourceEncoding>UTF-8 </project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8 </project.reporting.outputEncoding>
<java.version>1.7 </java.version>
<cdh.version>1.1.0-cdh5.8.0 </cdh.version>
<!--add maven release-->
<maven.compiler.source>1.7 </maven.compiler.source>
<maven.compiler.target>1.7 </maven.compiler.target>
<encoding>UTF-8 </encoding>
</properties>
<!--配置依赖库地址(用于加载CDH依赖的jar包) -->
<repositories>
<repository>
<id>cloudera </id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/ </url>
</repository>
</repositories>
<dependencies>
<!--cdh的hive依赖-->
<dependency>
<groupId>org.apache.hive </groupId>
<artifactId>hive-accumulo-handler </artifactId>
<version>${cdh.version} </version>
</dependency>
<dependency>
<groupId>org.apache.hive </groupId>
<artifactId>hive-ant </artifactId>
<version>${cdh.version} </version>
</dependency>
<dependency>
<groupId>org.apache.hive </groupId>
<artifactId>hive-beeline </artifactId>
<version>${cdh.version} </version>
</dependency> <dependency>
<groupId>org.apache.hive </groupId>
<artifactId>hive-cli </artifactId>
<version>${cdh.version} </version>
</dependency>
<dependency>
<groupId>org.apache.hive </groupId>
<artifactId>hive-common </artifactId>
<version>${cdh.version} </version>
</dependency>
<dependency>
<groupId>org.apache.hive </groupId>
<artifactId>hive-contrib </artifactId>
<version>${cdh.version} </version>
</dependency>
<dependency>
<groupId>org.apache.hive </groupId>
<artifactId>hive-exec </artifactId>
<version>${cdh.version} </version>
</dependency>
</dependencies>
<build>
<finalName>hive-udf </finalName>
<plugins>
<!-- 编译插件 -->
<plugin>
<groupId>org.apache.maven.plugins </groupId>
<artifactId>maven-compiler-plugin </artifactId>
<version>3.1 </version>
<configuration>
<source>1.7 </source>
<target>1.7 </target>
<encoding>UTF-8 </encoding>
</configuration>
</plugin>
<!-- 打包 -->
<plugin>
<artifactId>maven-assembly-plugin </artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies </descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly </id>
<phase>package </phase>
<goals>
<goal>single </goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
 来自CODE的代码片
pom.xml


2、编写类WeekTransform.Java



内容如下:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
            
            
import org.apache.hadoop.hive.ql.exec.UDF ;
import org.apache.hadoop.io.Text ;
import java.text.SimpleDateFormat ;
import java.util.Calendar ;
import java.util.Date ;
/**
* Document:本类作用---->hive日期转换函数---返回日期所在的周
* User: yangjf
* Date: 2016/9/1 8:8
*/
public class WeekTransform extends UDF {
public Integer evaluate ( final Text str ) {
Integer weekDay = null ;
try {
Date nowDate = new SimpleDateFormat ( "yyyy-MM-dd" ). parse ( str . toString ());
Calendar calendar = Calendar . getInstance ();
if ( nowDate != null ){
calendar . setTime ( nowDate );
}
weekDay = calendar . get ( Calendar . DAY_OF_WEEK ) - 1 ;
if ( weekDay < 0 ){
weekDay = 1 ;
}
} catch ( Exception e ){
}
return weekDay + 1 ;
}
// public static void main(String[] args) {
// System.out.println(new WeekTransform().evaluate("2016-08-31"));
// }
}
 来自CODE的代码片
WeekTransform.java

3、使用idea编译、打包生成hiveUdf.jar


第一种使用方法:在xshell中使用udf函数

1、打开xshell、输入hive进入命令行

2、上传hiveUdf.jar到当前用户目录

3、Linux目录上的hiveUdf.jar上传hdfs用户目录/user/hive/test/

   hdfs dfs -put  /user/e_lvbin/test/hiveudf.jar   /user/hive/test/

4、添加jar到内存中

   (1)第一种方法----->add jar hdfs://localhost:9000/user/hive/test/hiveUdf.jar;

   (2)第二种方法----->add jar hdfs://nameservice/user/hive/test/hiveUdf.jar;

注意:nameservicecore-site.xml中,入口的别名,用于namenodeHA

5、创建临时函数:weektransform(可以改其他别名)

create temporary function weektransform  as  ‘WeekTransform;

注意:’WeekTransform是刚才编写的java类,如果有包名称,则应该是全路径

6、创建成功后可以使用命令查看到

>show functions;

会看见----default.weektransform

7、使用该函数

>select weektransform(2016-08-31);

返回4,代表是周三,因为周日是1,依次类推,周三是4



第二种使用方法:在HUE中使用Hive自定义函数

1、打开hue窗口,进入hive界面

2、点击’session’,即图片中的红框处




3、选择上传到hdfshiveUdf.jar包、输入自定义函数的名称(可以该其他名称)、添加jar包中的udf函数类的全路径





4、使用weektransform函数




第三种使用方法:在oozie调度中使用udf函数

1、上传hiveUdf.jar包到oozie目录下

2、创建一个workflow、添加2hive流程(一个是hiveudf.txt脚本、一个是ykt_modelToApp_predict.txt)

3、编写hiveudf.txt脚本,内容如下:

use origin_ennenergy_test;

add jar  hdfs://nameservice/user/hive/test/hiveUdf.jar;

 

4、编写另一个脚本:ykt_modelToApp_predict.txt脚本,内容如下:

use origin_ennenergy_test;

create table  t_test_weektransform as select weektransform(‘2016-08-31’);


5、编写hive-site.xml文件(备注一)

6、先添加hiveudf.txt,再添加ykt_modelToApp_predict.txt

注意:

(1)需要编写hive-site.xml脚本

(2)每个流程都需要添加hiveUdf.jar

(3)txt文件编码格式应该是:UTF-8BOM格式编码

(4)首次用于使用udf函数时需要创建:

     create temporary function weektransform  as  ‘WeekTransform;

     如果已经在当前用户中已经创建weektransform临时函数,那么ykt_modelToApp_predict.txt脚本中不用添加

           如果集群重启,或者hive重启,那么需要在

          add jar  hdfs://nameservice/user/hive/test/hiveUdf.jar;    后执行一次

          create temporary function weektransform  as  ‘WeekTransform;

    原因:udf函数只针对当前用户有效!

编码格式图:





流程图:



备注一:

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
            
            
<?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hive.metastore.uris </name>
<value>thrift://host12.master.cluster.enn.cn:9083,thrift://host13.master.cluster.enn.cn:9083 </value>
</property>
<property>
<name>hive.metastore.client.socket.timeout </name>
<value>300 </value>
</property>
<property>
<name>hive.metastore.warehouse.dir </name>
<value>/user/hive/warehouse </value>
</property>
<property>
<name>hive.warehouse.subdir.inherit.perms </name>
<value>true </value>
</property>
<!--'hive.enable.spark.execution.engine', originally set to 'false' (non-final), is overridden below by a safety valve-->
<property>
<name>hive.conf.restricted.list </name>
<value>hive.enable.spark.execution.engine </value>
</property>
<property>
<name>hive.auto.convert.join </name>
<value>true </value>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask.size </name>
<value>20971520 </value>
</property>
<property>
<name>hive.optimize.bucketmapjoin.sortedmerge </name>
<value>false </value>
</property>
<property>
<name>hive.smbjoin.cache.rows </name>
<value>10000 </value>
</property>
<property>
<name>mapred.reduce.tasks </name>
<value>-1 </value>
</property>
<property>
<name>hive.exec.reducers.bytes.per.reducer </name>
<value>67108864 </value>
</property>
<property>
<name>hive.exec.copyfile.maxsize </name>
<value>33554432 </value>
</property>
<!--'hive.exec.reducers.max', originally set to '1099' (non-final), is overridden below by a safety valve-->
<property>
<name>hive.vectorized.groupby.checkinterval </name>
<value>4096 </value>
</property>
<property>
<name>hive.vectorized.groupby.flush.percent </name>
<value>0.1 </value>
</property>
<property>
<name>hive.compute.query.using.stats </name>
<value>false </value>
</property>
<!--
当一个表存在过多小文件(小于10M)时,不仅会对namenode造成很大压力,也会降低执行效率,甚至任务无法执行,
因此,当发现表出现很多小文件时,请添加如下参数:
hive.mergejob.maponly = true;
hive.merge.mapfiles = true;
hive.merge.mapredfiles = true;
mapred.max.split.size=256000000;
mapred.min.split.size.per.node=256000000;
mapred.min.split.size.per.rack=256000000;
hive.merge.size.per.task = 256000000;
hive.merge.smallfiles.avgsize=100000000;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
小文件合并的规则、算法需要讨论定制。
可参考如下链接:http://blog.itpub.net/26086233/viewspace-1119964/
http://www.linuxidc.com/Linux/2015-06/118391.htm
-->
<property>
<name>hive.vectorized.execution.enabled </name>
<value>true </value>
</property>
<property>
<name>hive.vectorized.execution.reduce.enabled </name>
<value>false </value>
</property>
<property>
<name>hive.merge.mapfiles </name>
<value>true </value>
</property>
<property>
<name>hive.merge.mapredfiles </name>
<value>false </value>
</property>
<property>
<name>hive.cbo.enable </name>
<value>false </value>
</property>
<property>
<name>hive.fetch.task.conversion </name>
<value>minimal </value>
</property>
<property>
<name>hive.fetch.task.conversion.threshold </name>
<value>268435456 </value>
</property>
<property>
<name>hive.limit.pushdown.memory.usage </name>
<value>0.1 </value>
</property>
<property>
<name>hive.merge.sparkfiles </name>
<value>true </value>
</property>
<property>
<name>hive.merge.smallfiles.avgsize </name>
<value>16777216 </value>
</property>
<property>
<name>hive.merge.size.per.task </name>
<value>268435456 </value>
</property>
<property>
<name>hive.optimize.reducededuplication </name>
<value>true </value>
</property>
<property>
<name>hive.optimize.reducededuplication.min.reducer </name>
<value>4 </value>
</property>
<property>
<name>hive.map.aggr </name>
<value>true </value>
</property>
<property>
<name>hive.map.aggr.hash.percentmemory </name>
<value>0.5 </value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition </name>
<value>false </value>
</property>
<property>
<name>spark.executor.memory </name>
<value>4294967296 </value>
</property>
<property>
<name>spark.driver.memory </name>
<value>2147483648 </value>
</property>
<property>
<name>spark.executor.cores </name>
<value>1 </value>
</property>
<property>
<name>spark.yarn.driver.memoryOverhead </name>
<value>512 </value>
</property>
<property>
<name>spark.yarn.executor.memoryOverhead </name>
<value>1024 </value>
</property>
<property>
<name>spark.dynamicAllocation.enabled </name>
<value>true </value>
</property>
<property>
<name>spark.dynamicAllocation.initialExecutors </name>
<value>1 </value>
</property>
<property>
<name>spark.dynamicAllocation.minExecutors </name>
<value>1 </value>
</property>
<property>
<name>spark.dynamicAllocation.maxExecutors </name>
<value>2147483647 </value>
</property>
<property>
<name>hive.metastore.execute.setugi </name>
<value>true </value>
</property>
<property>
<name>hive.support.concurrency </name>
<value>true </value>
</property>
<property>
<name>hive.zookeeper.quorum </name>
<value>host19.slave.cluster.enn.cn,host16.slave.cluster.enn.cn,host14.slave.cluster.enn.cn,host15.slave.cluster.enn.cn,host17.slave.cluster.enn.cn </value>
</property>
<property>
<name>hive.zookeeper.client.port </name>
<value>2181 </value>
</property>
<property>
<name>hive.zookeeper.namespace </name>
<value>hive_zookeeper_namespace_hive </value>
</property>
<property>
<name>hbase.zookeeper.quorum </name>
<value>host19.slave.cluster.enn.cn,host16.slave.cluster.enn.cn,host14.slave.cluster.enn.cn,host15.slave.cluster.enn.cn,host17.slave.cluster.enn.cn </value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort </name>
<value>2181 </value>
</property>
<property>
<name>hive.cluster.delegation.token.store.class </name>
<value>org.apache.hadoop.hive.thrift.MemoryTokenStore </value>
</property>
<property>
<name>hive.server2.enable.doAs </name>
<value>true </value>
</property>
<property>
<name>hive.server2.use.SSL </name>
<value>false </value>
</property>
<property>
<name>spark.shuffle.service.enabled </name>
<value>true </value>
</property>
<property>
<name>hive.cli.print.current.db </name>
<value>true </value>
</property>
<property>
<name>hive.exec.reducers.max </name>
<value>60 </value>
</property>
<property>
<name>hive.enable.spark.execution.engine </name>
<value>true </value>
</property>
</configuration>
 来自CODE的代码片
hive-site.xml

     以上已经测试通过,可以根据用户需要做改变。

     如有疑问、请留言!不足之处,请指正!谢谢!

原文地址:http://blog.csdn.net/high2011/article/details/52425430


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值