一、sqoop安装和配置
1.下载和解压,设置环境变量(略)
这里吐槽一下:sqoop2的配置和1差很多,网上很多都是1的。更坑的是1.99.4和1.99.7页差很多。
2.配置
修改sqoop/server/conf/catalina.properties(1.99.4需要设置,1.99.7不用设置这个)
把common.loader改成
common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/usr/local/src/hadoop-2.
6.1/share/hadoop/common/*.jar,/usr/local/src/hadoop-2.6.1/share/hadoop/common/lib/*.jar,/usr/local/src/hadoop-2.6.1/share/hadoop/hdfs/*.jar,/usr/local/src/hadoop-2.
6.1/share/hadoop/hdfs/lib/*.jar,/usr/local/src/hadoop-2.6.1/share/hadoop/mapreduce/*.jar,/usr/local/src/hadoop-2.6.1/share/hadoop/mapreduce/lib/*.jar,/usr/local/src
/hadoop-2.6.1/share/hadoop/tools/*.jar,/usr/local/src/hadoop-2.6.1/share/hadoop/tools/lib/*.jar,/usr/local/src/hadoop-2.6.1/share/hadoop/yarn/*.jar,/usr/local/src/h
adoop-2.6.1/share/hadoop/yarn/lib/*.jar,/usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib/*.jar
修改sqoop/server/conf/sqoop.properties
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/usr/local/src/hadoop-2.6.1/etc/hadoop
sqoop2支持hadoop的simple和kerberos两种验证机制。所以添加simple验证
org.apache.sqoop.security.authentication.type=SIMPLE
org.apache.sqoop.security.authentication.handler=org.apache.sqoop.security.authentication.SimpleAuthenticationHandler
org.apache.sqoop.security.authentication.anonymous=true
现在sqoop的日志是哪里启动,就在哪里,可以把有LOGDIR, BASEDIR引用的均替换为实际的绝对路径,这样就有固定日志了。
3.然后到对应目录新建目录 mkdir hadoop_lib
把hadoop相关依赖jar包拷贝到该目录,把sqoop/server/bin/*.jar和sqoop/server/lib/*.jar拷贝到该目录
cp /usr/local/src/hadoop-2.6.1/share/hadoop/common/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp /usr/local/src/hadoop-2.6.1/share/hadoop/common/lib/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp /usr/local/src/hadoop-2.6.1/share/hadoop/hdfs/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp /usr/local/src/hadoop-2.6.1/share/hadoop/hdfs/lib/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/hadoop-2.6.1/share/hadoop/mapreduce/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/hadoop-2.6.1/share/hadoop/mapreduce/lib/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/hadoop-2.6.1/share/hadoop/tools/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/hadoop-2.6.1/share/hadoop/tools/lib/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/hadoop-2.6.1/share/hadoop/yarn/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/hadoop-2.6.1/share/hadoop/yarn/lib/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/hadoop-2.6.1/share/hadoop/httpfs/tomcat/lib/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/sqoop-1.99.4-bin-hadoop200/server/bin/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
cp -rf /usr/local/src/sqoop-1.99.4-bin-hadoop200/server/lib/*.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/hadoop_lib
4.赋予权限(这个我在1.99.7没有设置也可以成功)
sudo chmod 777 -R /usr/local/src/sqoop-1.99.4-bin-hadoop200
5.配置Hadoop代理
Sqoop server 需要模拟用户访问集群内外的HDFS和其他资源,所以,需要配置Hadoop通过所谓proxyuser系统显示地允许这种模拟。也就是要在hadoop 目录的etc/hadoop/core-site.xml 中增加下面两个属性。两个value的地方写*或实际用户名均可。
<property>
<name>hadoop.proxyuser.sqoop2.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.sqoop2.groups</name>
<value>*</value>
</property>
这里我踩坑了,,把sqoop2换成自己的用户名,比如root,而且需要重启Hadoop才能生效。
6.验证
输入sqoop2-tool verify
7.开启
sqoop2-server start
也可以通过批处理的方式启动sqoop2-shell /path/script.sqoop
批处理里面可以这样写
# Specify company server
set server --host sqoop2.company.net
# Executing given job
start job --name 1
#号代表注释
二、下载链接驱动
1.到微软官网下载sqljdbc
把jar文件放到sqoop安装目录的lib文件夹里
cp sqljdbc41.jar /usr/local/src/sqoop-1.99.4-bin-hadoop200/server/lib
2.下载SQL Server-Hadoop Connector
Sqoop2开始不再需要驱动了,网上教程都是sqoop1的,坑
三、操作
参考官网文档https://sqoop.apache.org/docs/1.99.4/Sqoop5MinutesDemo.html
1.进入sqoop客户端
sqoop.sh client或者sqoop2-shell
set server --host master --port 12000 --webapp sqoop (host这里写自己的服务器名称)
最后一个--webapp官方文档说是指定的sqoop jetty服务器名称
show version --all 可以看到client和server都启动了
2.创建连接
查看连接show connector
创建连接create link -c 2(这里的1是指第一个connector也就是hdfs,2是指第二个connector,也就是jdbc,1.99.7就没有这个id了,需要通过-n 名字 来指定)
输入下面信息
Name: testlink
Link configuration
JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://192.168.116.10:3306/test
Username: root
Password: ******
JDBC Connection Properties:
There are currently 0 values in the map:
entry# protocol=tcp
There are currently 1 values in the map:
protocol = tcp
entry#
New link was successfully created with validation status OK and persistent
可以看到一件连接成功
接下来连接Hadoop试试,URI是hadoop中配置hdfs-site.xml中的属性fs.defaultFS的值,可以看到有两个连接了
3.创建任务(1.99.7是通过名字的方式)
create job -f 1 -t 2
-f指定from,,-t指定to。参数值是link里面的id
create job -f 1 -t 2
Creating job for links with from id 1 and to id 2
Please fill following values to create new job object
Name: sqoop
From database configuration
Schema name: test
Table name: student
Table SQL statement:
Table column names:
Partition column name:
Null value allowed for the partition column:
Boundary query:
ToJob configuration
Output format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
Choose: 0
Compression format:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 0
Custom compression format:
Output directory: #这里写的是hdfs的文件夹目录,而且是存在的空目录
Throttling resources
Extractors: 2
Loaders: 2
New job was successfully created with validation status OK and persistent id 1
4.开启任务
start job --jid 1
或者start job -n sqoop
查看任务状态status job --jid 1
删除任务和连接
delete job --jid 1
delete link --lid 1
5.报错处理:
如果遇到报错Exception has occurred during processing command
Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception
设置可查看具体出错信息set option --name verbose --value true
就可以看到报错的原因了。
如果遇到报错GENERIC_HDFS_CONNECTOR_0007:Invalid input/output directory
首先查看一下hdfs是否有对应的文件夹,如果有对应的文件夹还是报错,可能是core-site里面的代理设置错了。比如我就是,图中要写上当前的用户名,官网写sqoop2,我也写了,最开始查了很久才找到原因。改完之后,要重启Hadoop才能生效。
如果遇到报错: MAPREDUCE_0003:Can't get RunningJob instance -
是因为没有开启historyserver,
检查一下Hadoop里面mapred-site.xml里面是否已经设置了
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
如果已经设置了,那就到Hadoop的sbin目录里面开启服务。
./mr-jobhistory-daemon.sh start historyserver
如果运行完了报错Exception: Job Failed with status:3,
我们上yarn里面看看情况http://192.168.116.10:8088/cluster,果然是失败了
可以到http://192.168.116.10:19888/jobhistory/查看一下具体报错信息。
可以看到报错信息是com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
是MySQL连接出问题了。我的原因是没有授权,在mysql里面输入下面两句话授权所有ip都能通过密码访问。
GRANT ALL PRIVILEGES ON *.* TO root @'%' IDENTIFIED BY "123456";
FLUSH PRIVILEGES;
写在最后,,,建议使用1.99.7,,,,1.99.4好多莫名其妙问题。。。。楼主已弃坑,重新安装1.99.7了。。。。。。。