说明:
在大数据开发中,由于数据量非常的大,有时候需要将关系型数据库(Oracle/Postgresql/Mysql等)中的表导入到hdfs中进行分布式存储。本文就oozie定时调用sqoop从oracle库中增量导入数据至hive表涉及到的问题做了详细的讨论,仅供参考。
1/ 定义sqoop job
要实现增量更新,首先需要创建相应的sqoop job,sqoop每次完成action之后都会将最新的last_value自动保存到sqoop的metastore中,用于替换下一次调用job时的last_value值,从而实现增量更新。
sqoop job \
--meta-connect jdbc:hsqldb:hsql://10.45.6.34:16000/sqoop \
--create cust_add_job \
--import --connect jdbc:oracle:thin:@10.45.6.34:1521:wsmp \
--username 123 \
--password 123 -m 1 \
--table CUST \
--hive-import --hive-table cust \
--incremental append --check-column CUST_ID --last-value 0
2/ 定义workflow.xml
应该说oozie使用自带的sqoop/hive包,而sqoop和hive安装目录下的配置文件oozie是读不到的,所以需要在workflow.xml指定相关的配置。
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-to-hive">
<start to="sqoop2hive"/>
<action name="sqoop2hive">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://10.45.6.34:9083</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://10.45.40.174:5432/dvsp_hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
</configuration>
<command>job --meta-connect jdbc:hsqldb:hsql://10.45.6.34:16000/sqoop --exec cust_add_job</command>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
3/ 定义coordinator.xml
这里定义每6分钟执行一次sqoop2hive的增量更新。
<coordinator-app name="sqoop2hive-coord" frequency="${coord:minutes(6)}" start="${start}" end="${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>