概要
为了调查hadoop生态圈里的制品,特地的了解了一下RDBMS和hdfs之间数据的导入和导出工具,并且调查了一些其他同类的产品,得出来的结论是:都是基于sqoop做的二次开发或者说是webUI包装,实质还是用的sqoop。比如pentaho的PDI,Oracle的ODI,都是基于此,另外,Hortnetwork公司的sandbox,Hue公司的Hue webUI,coulder的coulder manger,做个就更不错了,差不多hadoop下的制品都集成了,部署也不是很复杂,还是很强大的。
关于sqoop
apache sqoop现阶段分了2个系列制品,一个是sqoop1系列的,另一个是sqoop2系列的。相比较,sqoop1相对比较成熟,bug较少,但结构比较单一,现阶段的稳定版是1.4.6;sqoop2系列基于sqoop1的基础上,做了很大的改进,client跟server端分离,job跟connection做到了集成化管理,使用方面来看,比sqoop1简单多了,但部署比较复杂,且sqoop1不能跟sqoop2兼容,既存的一些应用脚本几乎要重写,但大的趋势来看,sqoop2将会变成主流。
#sqoop2从1.99.2以后,就没法将数据导入到hbase中,这一点,以后预定会在sqoop2.0.0这个稳定版中解决掉。
环境搭建
环境搭建依据官网的提示,这里着重说一下需要注意的是事项:
1.server/conf/sqoop.properties文件中需要修改的地方
org.apache.sqoop.repository.jdbc.url=jdbc:derby:@BASEDIR@/repository/sqoop;create=true
这里的sqoop是事先在mysql这边创建的数据库,并赋予了权限:
create database sqoop ;
create user sqoop identified by '123456';
grant all privileges on sqoop.* to sqoop;
flush privileges;
2.也是在sqoop.properties文件中修改hadoop的位置
org.apache.sqoop.Submission.engine.mapreduce.configuration.directory=your-hadoop-cluster-location
3.server/conf/catalina.properties文件中,追加hadoop/share下的所有lib文件。
common.loader=${Catalina.base}/lib,${CAtalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,your-hadoop-libs
4.【重要】修改hadoop的yarn-site.xml文件,追加如下信息:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
测试
hadoop跟sqoop环境启动。
1.启动hadoop start-all.sh脚本
2.启动sqoop
1.sqoop server以demaon启动后,会有如下信息:
[root@sv001 sqoop-1.99.3-bin-hadoop200]# ./bin/sqoop.sh server run
Sqoop home directory: /home/project/sqoop-1.99.3-bin-hadoop200
Setting SQOOP_HTTP_PORT: 12000
Setting SQOOP_ADMIN_PORT: 12001
Using CATALINA_OPTS:
Adding to CATALINA_OPTS: -Dsqoop.http.port=12000 -Dsqoop.admin.port=12001
Using CATALINA_BASE: /home/project/sqoop-1.99.3-bin-hadoop200/server
Using CATALINA_HOME: /home/project/sqoop-1.99.3-bin-hadoop200/server
Using CATALINA_TMPDIR: /home/project/sqoop-1.99.3-bin-hadoop200/server/temp
Using JRE_HOME: /usr/java/jdk1.7.0_67
Using CLASSPATH: /home/project/sqoop-1.99.3-bin-hadoop200/server/bin/bootstrap.jar
May 11, 2016 6:56:00 PM org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
May 11, 2016 6:56:00 PM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-12000
May 11, 2016 6:56:00 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 634 ms
May 11, 2016 6:56:00 PM