本来的出发点是想做一个能够提交hive ql的web接口,然后我可以在接口层面做很多定制的功能,于是开始找hive restful接口相关的信息,发现Hive有个webhcat服务,可以提交hive ql然后还可以将输出写入到指定的HDFS目录上,而且还有callback功能,在执行完成后给指定URL发送消息。看了一下源代码,原理其实和Oozie差不多,也是启动了一个单mapper任务来管理JOB,然后启动一个子进程,在子进程中执行hive -e命令,但是悲剧的是子进程的hive命令启动的yarn container不知道怎么设置内存,而且hiveql启动的mapreduce竟然是local模式的,然后数据量一大,yarn container就OOM了。而且所有的状态信息都是通过正则表达式去捞stderr中打印出来的信息来汇报的。这样真的是很尴尬的做法,最终查询结果会输出到stdout中,有时候还会有一些warn信息在stdout中,将这种stdout给客户端下载查看着实不行,于是放弃了webhcat,不过hcat提供的执行DDL语句的restful api还是蛮好的,内部是直接使用的metastoreclient来执行DDL语句修改元数据。
然后转而研究了一下hive-jdbc的代码,hive jdbc是用的hiveserver2服务,但是提供的statement执行query是阻塞模式的,虽然提交给thrift server是异步的,但是execute的实现里面用了一个while循环去获取执行状态,当执行完成时候才会退出循环。虽然感觉可以把相关代码捞出来自己去写代码发送请求执行hql,但是改动量有点大,提交任务之后会返回一个operationHandle用来管理这个任务,可以取消,可以获取状态,但是好像jdbc那种写法好像只能在当前线程中使用了啊,我想在其他线程中获取状态提供API给前端展示执行进度好像还是得改改改。
然后转而回到webhcat的坑上想研究一下怎么解决hive 执行的是local模式的问题,然后又开始漫长的搜索资料,在stackoverflow上翻到一片帖子,说不建议用webhcat来提交hive query,虽然webhcat的原理和oozie差不多,但是建议用oozie来提交。于是乎,我又去翻oozie的web service api,返现确实有提供直接运行hive ql的接口:
https://oozie.apache.org/docs/4.1.0/WebServicesAPI.html#Proxy_Hive_Job_Submission
Proxy Hive Job Submission
You can submit a Workflow that contains a single Hive action without writing a workflow.xml. Any requred Jars or other files must already exist in HDFS.
The following properties are required:
- fs.default.name : The NameNode
- mapred.job.tracker : The JobTracker
- user.name : The username of the user submitting the job
- oozie.hive.script : Contains the hive script you want to run (the actual script, not a file path)
- oozie.libpath : A directory in HDFS that contains necessary Jars for your job
- oozie.proxysubmission : Must be set to true
The following properties are optional:
- oozie.hive.script.params.size : The number of parameters you'll be passing to Hive
- oozie.hive.script.params.n : A parameter (variable definition for the script) in 'key=value' format, the 'n' should be an integer starting with 0 to indicate the parameter number
- oozie.hive.options.size : The number of options you'll be passing to Pig
- oozie.hive.options.n : An argument to pass to Hive, the 'n' should be an integer starting with 0 to indicate the option number
The oozie.hive.options.n parameters are sent directly to Hive without any modification unless they start with -D , in which case they are put into the element of the action.
Request:
POST /oozie/v1/jobs?jobtype=hive Content-Type: application/xml;charset=UTF-8 . <?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>user.name</name> <value>rkanter</value> </property> <property> <name>oozie.hive.script</name> <value> CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION '${INPUT}'; INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM test; </value> </property> <property> <name>oozie.hive.script.params.size</name> <value>2</value> </property> <property> <name>oozie.hive.script.params.0</name> <value>OUTPUT=/user/rkanter/examples/output-data/hive</value> </property> <property> <name>oozie.hive.script.params.1</name> <value>INPUT=/user/rkanter/examples/input-data/table</value> </property> <property> <name>oozie.libpath</name> <value>hdfs://localhost:8020/user/rkanter/share/lib/hive</value> </property> <property> <name>oozie.proxysubmission</name> <value>true</value> </property> </configuration>
Response:
HTTP/1.1 201 CREATED Content-Type: application/json;charset=UTF-8 . { id: "job-3" }
于是我在hive上建了一个测试表test,里面随便写了3条数据,就按照接口说明试着发送了一个请求,妈蛋,说table not found:test,are you kidding me???
查看了一下日志,发现hive根本不是连的现有的metastore,是创建了一个内置的derby数据库的metastore。那找不到表就好理解了。于是我就在请求的configuration中
加了一个配置:
<property> <name>hive.metastore.uris</name> <value>thrift://name85:9083</value> </property>
指定要连接的远程metastore。重新发送请求,我曹。还是找不到表,连的还是内嵌derby metastore!(顿时感觉做hive web api二次开发的的愿望要破灭)
于是又开始漫长的google过程。。。。有些文章说oozie hive action要设置<job-xml>节点,指定一个hdfs上的hive-site.xml文件。这样Hive的属性可以传播到
子进程中去:
Syntax:
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1"> ... <action name="[NODE-NAME]"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>[JOB-TRACKER]</job-tracker> <name-node>[NAME-NODE]</name-node> <prepare> <delete path="[PATH]"/> ... <mkdir path="[PATH]"/> ... </prepare> <job-xml>[HIVE SETTINGS FILE]</job-xml> <configuration> <property> <name>[PROPERTY-NAME]</name> <value>[PROPERTY-VALUE]</value> </property> ... </configuration> <script>[HIVE-SCRIPT]</script> <param>[PARAM-VALUE]</param> ... <param>[PARAM-VALUE]</param> <file>[FILE-PATH]</file> ... <archive>[FILE-PATH]</archive> ... </hive> <ok to="[NODE-NAME]"/>
<error to="[NODE-NAME]"/> </action> ... </workflow-app>于是在hue上编写了一个workflow.xml里面通过job-xml指定了一个hive-site.xml,确实找到了remote metastore,执行成功了。
可是TMD这个接口没有提供指定job-xml的web请求参数啊。于是只能下载oozie的源码下来看看到底是怎么回事,找找看有没有地方可以设置这个属性的。
而且我以前也是研究使用过oozie的,现在做任务调度用的也是oozie,我印象中oozie都是要生成了一个workflow.xml文件放在hdfs上指定app path来执行的.
BUT这个接口不需要生成workflow.xml不需要指定app path。始终有点好奇是怎么回事,于是就下载了oozie的源码(cdh5.7对应的oozie4.1),准备窥探一下究竟。
配置oozie server允许远程调试:
通过远程调试查看代码后发现,其实虽然我们传的是一个hql语句,但是接口内部还是会给我们生成一个dummy.hive文件,把hql放在里面,然后生成workflow定义内容。
其中请求参数oozie.hive.script.params.n可以替换掉hql里面的变量,
请求参数oozie.hive.options.n如果是以-D开头的话,将会添加到workflow定义中的hive action定义里的configuration中,否则将添加到hive shell 命令行。
而job-xml制定的是默认配置,configuration中配置的属性将会覆盖job-xml中指定的属性,所以可以采用oozie.hive.option参数来传递hive 属性,
于是在请求参数中多加了下面两个参数,将metastore uri配置进去:
<property>
<name>oozie.hive.options.size</name>
<value>1</value>
</property>
<property>
<name>oozie.hive.options.0</name>
<value>-Dhive.metastore.uris=thrift://name85:9083</value>
</property>
再重新提交请求,YES,执行成功了!最后的request raw payload形如:
<?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://name84:8020</value> </property> <property> <name>mapred.job.tracker</name> <value>name84:8032</value> </property> <property> <name>user.name</name> <value>hdfs</value> </property> <property> <name>oozie.hive.script</name> <value> INSERT OVERWRITE DIRECTORY '${OUTPUT}' select * from test </value> </property> <property> <name>oozie.hive.script.params.size</name> <value>1</value> </property> <property> <name>oozie.hive.script.params.0</name> <value>OUTPUT=/tmp/output-data3/hive</value> </property> <property> <name>oozie.libpath</name> <value>hdfs://name84:8020/user/oozie/share/lib/lib_20170725165530/hive</value> </property> <property> <name>oozie.proxysubmission</name> <value>true</value> </property> <property> <name>oozie.hive.options.size</name> <value>1</value> </property> <property> <name>oozie.hive.options.0</name> <value>-Dhive.metastore.uris=thrift://name85:9083</value> </property> </configuration>
由于我们日志都是json格式的日志,建表的时候row serde都是指定了json serde(用到了第三方包)。我试着到了一个线上表下,用oozie接口执行查询,
果不其然报错了,找不到第三方jar包中的序列化反序列化类。于是我把这个jar包扔到了hive action的sharelib目录里(上面的oozie.libpath指定的目录),
重新提交请求,OK成功了。然后我在试试自定义的UDF。好巧不巧的,正好就踩中一个坑!
因为我注册的是持久的UDF,在使用到的时候是会自动装载UDF的jar包的。但是却报了一个错:
2017-07-26 12:30:20,067 FATAL [IPC Server handler 1 on 42314] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1501042569430_0004_m_000000_0 - exited : java.lang.RuntimeException: Failed to load plan: hdfs://name84:8020/tmp/hive/hdfs/aea8e323-daa9-494b-8d56-e6f6071c1da0/hive_2017-07-26_12-30-07_109_6584100257170331154-1/-mr-10004/28cb7c1f-f35f-4230-b8a5-ff80cb07eadc/map.xml: org.apache.hive.com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: org.joda.time.format.DateTimeFormatterBuilder$PaddedNumber Serialization trace: iParsers (org.joda.time.format.DateTimeFormatterBuilder$Composite) iParser (org.joda.time.format.DateTimeFormatter) isoFormatter (brickhouse.udf.json.InspectorHandle$PrimitiveHandle) mapValHandle (brickhouse.udf.json.InspectorHandle$MapHandle) inspHandle (brickhouse.udf.json.FromJsonUDF) genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) colExprMap (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:451) at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:296) at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:268) at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:502) at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:477) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:717) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: org.joda.time.format.DateTimeFormatterBuilder$PaddedNumber Serialization trace: iParsers (org.joda.time.format.DateTimeFormatterBuilder$Composite) iParser (org.joda.time.format.DateTimeFormatter) isoFormatter (brickhouse.udf.json.InspectorHandle$PrimitiveHandle) mapValHandle (brickhouse.udf.json.InspectorHandle$MapHandle) inspHandle (brickhouse.udf.json.FromJsonUDF) genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) colExprMap (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.FilterOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:112) at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672) at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(Utilities.java:1070) at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:961) at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:975) at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:417) ... 13 more Caused by: java.lang.ArrayStoreException: org.joda.time.format.DateTimeFormatterBuilder$PaddedNumber at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:367) at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:276) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) ... 58 more
这个错咋一看好像是kyro序列化反序列化出问题了,于是又一顿google。。。。未果。后面看是ArrayStoreException错误,直觉好像是joda-time的jar包冲突了。
于是看了一下Oozie依赖的joda-time是2.1版本的,hive依赖的是1.6版本的,我的udf的jar包中还用的是2.9版本的!难道真的是这个原因导致的?
我先试试统一一下joda-time的版本。其实hive-exec.jar中直接把joda-time的源码打包进去了,所以我把udf的jar包中的joda-time依赖去掉了和hive的统一了。
然后我把oozie hive sharelib下面的joda-time-2.1.jar删除了重启了一下oozie服务。
再重新提交请求,喜出望外,果然执行成功了。
从坑里爬出来后,拍拍身上的泥土,终于可以开始设计我的hive执行引擎了,后续有坑继续补充。。。