1.文档编写目的
通sqoop1一样,sqoop2同样也是在Hadoop和关系型数据库之间互传数据的工具,只不过sqoop2引入sqoop server,集中化管理connector,而sqoop1只是客户端工具。
下图是sqoop1的架构;
版本:目前sqoop1社区版最新为1.4.7
架构:使用sqoop客户端直接提交
访问方式:通过sqoop1命令在后台终端直接访问
安全性:命令或者脚本中直接指定数据库,用户名和密码
下图是sqoop2的架构:
版本:目前sqoop2社区版最新为1.99.7
架构:引入sqoop服务端,通过connector实现集中管理
访问方式:命令行或脚本,RESTful API,Java API,Web UI
安全性:引入基于角色的权限管理机制
我们看看sqoop1和sqoop2的功能差别:
无论Cloudera还是Apache官网都不建议sqoop2直接上生产,因为功能尚不完善。
本文Fayson会以实操的方式,从安装sqoop2到使用进行说明介绍。
- 内容概述
1.Sqoop2安装
2.在MySQL中建立库表并插入数据
3.将MySQL中的数据导入到HDFS通过Sqoop2
4.常见问题
- 测试环境
1.操作系统RedHat7.3
2.CM和CDH版本为5.13.1
3.集群已启用Kerberos
2.Sqoop2安装
1.从CM主页“添加服务”
2.选择Sqoop2
3.选择一个节点安装Sqoop2的服务端
4.配置不用修改,继续下一步,不过数据库Cloudera官网建议使用PostgreSQL。
5.等待服务器启动成功。
安装完毕。
3.Sqoop2的使用
3.1在MySQL中创建库表,并插入数据
create database sqoop2 default character set utf8; CREATE USER 'sqoop2'@'%' IDENTIFIED BY 'password'; GRANT ALL PRIVILEGES ON sqoop2. * TO 'sqoop2'@'%'; FLUSH PRIVILEGES; use sqoop2; create table widgets (id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, widget_name VARCHAR(64) NOT NULL, price DECIMAL(10,2), design_date DATE, version INT, design_comment VARCHAR(100)); insert into widgets values (null,'sprocket',0.25,'2010-02-10',1,'connects two gizmos'); insert into widgets values (null,'gizmo',4.00,'2009-11-30',4,null); insert into widgets values (null,'gadget',99.99,'1983-08-13',13,'our flagship product');
(可左右滑动)
3.2创建MySQL和HDFS的链接
1.设置客户端配置服务器
[root@ip-172-31-16-68 ~]# sqoop2 Sqoop home directory: /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/sqoop2 Feb 28, 2018 9:47:31 PM java.util.prefs.FileSystemPreferences$1 run INFO: Created user preferences directory. Sqoop Shell: Type 'help' or '\h' for help. sqoop:000> set server --host ip-172-31-16-68.ap-southeast-1.compute.internal --port 12000 Server is set successfully
(可左右滑动)
2.查看连接器
sqoop:000> show connector 0 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +----+------------------------+------------------+------------------------------------------------------+----------------------+ | Id | Name | Version | Class | Supported Directions | +----+------------------------+------------------+------------------------------------------------------+----------------------+ | 1 | kafka-connector | 1.99.5-cdh5.13.1 | org.apache.sqoop.connector.kafka.KafkaConnector | TO | | 2 | kite-connector | 1.99.5-cdh5.13.1 | org.apache.sqoop.connector.kite.KiteConnector | FROM/TO | | 3 | hdfs-connector | 1.99.5-cdh5.13.1 | org.apache.sqoop.connector.hdfs.HdfsConnector | FROM/TO | | 4 | generic-jdbc-connector | 1.99.5-cdh5.13.1 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO | +----+------------------------+------------------+------------------------------------------------------+----------------------+
(可左右滑动)
3.创建MySQL连接器
sqoop:000> create link -cid 4 Creating link for connector with id 4 Please fill following values to create new link object Name: mysql link Link configuration JDBC Driver Class: com.mysql.jdbc.Driver JDBC Connection String: jdbc:mysql://ip-172-31-16-68.ap-southeast-1.compute.internal/sqoop2 Username: sqoop2 Password: ******** JDBC Connection Properties: There are currently 0 values in the map: entry# New link was successfully created with validation status OK and persistent id 2
(可左右滑动)
4.创建HDFS连接器
sqoop:000> create link --cid 3 Creating link for connector with id 3 Please fill following values to create new link object Name: hdfs link Link configuration HDFS URI: hdfs://ip-172-31-16-68.ap-southeast-1.compute.internal:8020/ New link was successfully created with validation status OK and persistent id 1
(可左右滑动)
3.3创建Sqoop2的job
sqoop:000> create job -f 2 -t 1 Creating job for links with from id 2 and to id 1 Please fill following values to create new job object Name: sqoop2 From database configuration Schema name: sqoop2 Table name: widgets Table SQL statement: Table column names: Partition column name: Null value allowed for the partition column: Boundary query: ToJob configuration Override null value: Null value: Output format: 0 : TEXT_FILE 1 : SEQUENCE_FILE Choose: 1 Compression format: 0 : NONE 1 : DEFAULT 2 : DEFLATE 3 : GZIP 4 : BZIP2 5 : LZO 6 : LZ4 7 : SNAPPY 8 : CUSTOM Choose: 0 Custom compression format: Output directory: Throttling resources Extractors: Loaders: There are issues with entered data, please revise your input: Name: sqoop2 From database configuration Schema name: sqoop2 Table name: widgets Table SQL statement: Table column names: Partition column name: Null value allowed for the partition column: Boundary query: ToJob configuration Override null value: Null value: Output format: 0 : TEXT_FILE 1 : SEQUENCE_FILE Choose: 0 Compression format: 0 : NONE 1 : DEFAULT 2 : DEFLATE 3 : GZIP 4 : BZIP2 5 : LZO 6 : LZ4 7 : SNAPPY 8 : CUSTOM Choose: 0 Custom compression format: Error message: Can't be null nor empty Output directory: hdfs://ip-172-31-16-68.ap-southeast-1.compute.internal:8020/fayson Throttling resources Extractors: Loaders: New job was successfully created with validation status OK and persistent id 1 sqoop:000> show job +----+--------+----------------+--------------+---------+ | Id | Name | From Connector | To Connector | Enabled | +----+--------+----------------+--------------+---------+ | 1 | sqoop2 | 4 | 3 | true | +----+--------+----------------+--------------+---------+
(可左右滑动)
3.4.启动Sqoop2的job
1.在后端直接执行命令启动Sqoop2的job
sqoop:000> start job -j 1 -s Submission details Job ID: 1 Server URL: http://ip-172-31-16-68.ap-southeast-1.compute.internal:12000/sqoop/ Created by: fayson Creation date: 2018-02-28 22:33:00 EST Lastly updated by: fayson External ID: job_1519874824956_0002 http://ip-172-31-16-68.ap-southeast-1.compute.internal:8088/proxy/application_1519874824956_0002/ 2018-02-28 22:33:00 EST: BOOTING - Progress is not available 2018-02-28 22:33:11 EST: RUNNING - 0.00 % 2018-02-28 22:33:21 EST: SUCCEEDED Counters: org.apache.hadoop.mapreduce.JobCounter SLOTS_MILLIS_MAPS: 8800 MB_MILLIS_MAPS: 9011200 TOTAL_LAUNCHED_MAPS: 2 MILLIS_MAPS: 8800 VCORES_MILLIS_MAPS: 8800 SLOTS_MILLIS_REDUCES: 0 OTHER_LOCAL_MAPS: 2 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ: 0 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN: 0 org.apache.hadoop.mapreduce.TaskCounter MAP_INPUT_RECORDS: 0 MERGED_MAP_OUTPUTS: 0 PHYSICAL_MEMORY_BYTES: 406069248 SPILLED_RECORDS: 0 FAILED_SHUFFLE: 0 CPU_MILLISECONDS: 1890 COMMITTED_HEAP_BYTES: 619708416 VIRTUAL_MEMORY_BYTES: 3162779648 MAP_OUTPUT_RECORDS: 3 SPLIT_RAW_BYTES: 219 GC_TIME_MILLIS: 75 org.apache.hadoop.mapreduce.FileSystemCounter FILE_READ_OPS: 0 FILE_WRITE_OPS: 0 FILE_BYTES_READ: 0 FILE_LARGE_READ_OPS: 0 HDFS_BYTES_READ: 219 FILE_BYTES_WRITTEN: 336174 HDFS_LARGE_READ_OPS: 0 HDFS_BYTES_WRITTEN: 146 HDFS_READ_OPS: 2 HDFS_WRITE_OPS: 2 org.apache.sqoop.submission.counter.SqoopCounters ROWS_WRITTEN: 3 ROWS_READ: 3 Job executed successfully
(可左右滑动)
2.在YARN的8088界面查看
3.查看导入到HDFS的文件
[root@ip-172-31-16-68 ~]# hadoop fs -ls /fayson Found 4 items -rw-r--r-- 3 sqoop2 supergroup 55 2018-02-28 22:33 /fayson/9c013cbb-3f05-49cb-a3c1-fe70ac47241c.txt -rw-r--r-- 3 sqoop2 supergroup 91 2018-02-28 22:33 /fayson/f5d66a1b-4c7f-4dde-8549-7139e2dc7543.txt drwxr-xr-x - fayson supergroup 0 2018-02-22 21:14 /fayson/jars -rw-r--r-- 3 fayson supergroup 674 2018-02-04 09:35 /fayson/krb5.conf [root@ip-172-31-16-68 ~]# hadoop fs -cat /fayson/9c013cbb-3f05-49cb-a3c1-fe70ac47241c.txt /fayson/f5d66a1b-4c7f-4dde-8549-7139e2dc7543.txt 1,'sprocket',0.25,'2010-02-10',1,'connects two gizmos' 2,'gizmo',4.00,'2009-11-30',4,NULL 3,'gadget',99.99,'1983-08-13',13,'our flagship product'
(可左右滑动)
至此,Sqoop2的安装使用,到导入数据到HDFS完毕。
4.常见问题
1.因为Fayson的集群安装了Kerberos,需要配置一下sqoop2的Kerberos认证,否则会报错。如下:
sqoop:000> create link --cid 3 Creating link for connector with id 3 0 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception has occurred during processing command Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0004:Unable to find valid Kerberos ticket cache (kinit)
(可左右滑动)
将org.apache.sqoop.security.authentication.type=KERBEROS加入到/etc/sqoop2/conf/sqoop.properties
同步到所有节点
重启Sqoop2服务
2.需要在sqoop2服务节点的的sqoop2目录引入2个jar包,否则会报错如下:
引入以下两个jar包
[root@ip-172-31-16-68 lib]# cd /opt/cloudera/parcels/CDH/lib/hadoop/client [root@ip-172-31-16-68 client]# ln -s ../../hadoop-hdfs/lib/jackson-mapper-asl-1.8.8.jar . [root@ip-172-31-16-68 client]# ln -s ../../hadoop-hdfs/lib/jackson-core-asl-1.8.8.jar .
(可左右滑动)
说了这么多,最后总结重点,sqoop2会在CDH6直接放弃,所以sqoop2还是别用了。