1、SQOOP的简介
Apache Sqoop是在Hadoop生态体系和RDBMS体系(关系型数据库管理系统)之间传送数据的一种工具。
2、SQOOP安装及使用
SQOOP安装
1、上传并解压
tar -zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /usr/local/soft/
2、修改文件夹名字
mv sqoop-1.4.7.bin__hadoop-2.6.0/ sqoop-1.4.7
3、修改配置文件
# 切换到sqoop配置文件目录
cd /usr/local/soft/sqoop-1.4.7/conf
# 复制配置文件并重命名
cp sqoop-env-template.sh sqoop-env.sh
# vim sqoop-env.sh 编辑配置文件,并加入以下内容
export HADOOP_COMMON_HOME=/usr/local/soft/hadoop-2.7.6
export HADOOP_MAPRED_HOME=/usr/local/soft/hadoop-2.7.6
export HBASE_HOME=/usr/local/soft/hbase-1.4.6
export HIVE_HOME=/usr/local/soft/hive-1.2.1
export ZOOCFGDIR=/usr/local/soft/zookeeper-3.4.6/conf
export ZOOKEEPER_HOME=/usr/local/soft/zookeeper-3.4.6
# 切换到bin目录
cd /usr/local/soft/sqoop-1.4.7/bin
# vim configure-sqoop 修改配置文件,注释掉没用的内容(就是为了去掉警告信息)
4、修改环境变量
vim /etc/profile
# 将sqoop的目录加入环境变量
#SQOOP_HOME
export SQOOP_HOME=/usr/local/soft/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin
source /etc/profile
5、添加MySQL连接驱动
# 从HIVE中复制MySQL连接驱动到$SQOOP_HOME/lib
cp /usr/local/soft/hive-1.2.1/lib/mysql-connector-java-5.1.49.jar /usr/local/soft/sqoop-1.4.7/lib/
6、测试
# 打印sqoop版本
sqoop version
# 测试MySQL连通性
sqoop list-databases -connect jdbc:mysql://master:3306?useSSL=false -username root -password 123456
输出
22/04/18 19:46:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
22/04/18 19:46:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
22/04/18 19:46:50 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
hive
mysql
performance_schema
shujia
spider
sys
准备MySQL数据
登录MySQL数据库
mysql -u root -p123456;
创建student数据库
create database student;
切换数据库并导入数据
# mysql shell中执行
use student;
source /root/student.sql;
source /root/score.sql;
另外一种导入数据的方式
# linux shell中执行
mysql -u root -p123456 student</root/student.sql
mysql -u root -p123456 student</root/score.sql
使用Navicat运行SQL文件
也可以通过Navicat导入
右键student,点击运行SQL文件,文件...找到对应的文件,点击打开,点击开始,最后刷新表。
导出MySQL数据库
mysqldump -u root -p123456 数据库名>任意一个文件名.sql
import
从传统的关系型数据库导入HDFS、HIVE、HBASE…
MySQLToHDFS
编写脚本,保存为MySQLToHDFS.conf
import
--connect
jdbc:mysql://master:3306/student?useSSL=false
--username
root
--password
123456
--table
student
--m
2
--split-by
age
--target-dir
/sqoop/data/student1
--fields-terminated-by
','
执行脚本
sqoop --options-file MySQLToHDFS.conf
注意事项:
1、–m 表示指定生成多少个Map任务,不是越多越好,因为MySQL Server的承载能力有限
2、当指定的Map任务数>1,那么需要结合--split-by
参数,指定分割键,以确定每个map任务到底读取哪一部分数据,最好指定数值型的列,最好指定主键(或者分布均匀的列=>避免每个map任务处理的数据量差别过大)
3、如果指定的分割键数据分布不均,可能导致数据倾斜问题
4、分割的键最好指定数值型的,而且字段的类型为int、bigint这样的数值型
5、编写脚本的时候,注意:例如:--username
参数,参数值不能和参数名同一行
--username root // 错误的
// 应该分成两行
--username
root
6、运行的时候会报错InterruptedException,hadoop2.7.6自带的问题,忽略即可
21/01/25 14:32:32 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:716)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:476)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:652)
7、实际上sqoop在读取mysql数据的时候,用的是JDBC的方式,所以当数据量大的时候,效率不是很高
8、sqoop底层通过MapReduce完成数据导入导出,只需要Map任务,不需要Reduce任务
9、每个Map任务会生成一个文件
MySQLToHive
先会将MySQL的数据导出来并在HDFS上找个目录临时存放,默认为:/user/用户名/表名
然后再将数据加载到Hive中,加载完成后,会将临时存放的目录删除
编写脚本,并保存为MySQLToHIVE.conf文件
import
--connect
jdbc:mysql://master:3306/student?useSSL=false
--username
root
--password
123456
--table
score
--fields-terminated-by
"\t"
--lines-terminated-by
"\n"
--m
3
--split-by
student_id
--hive-import
--hive-overwrite
--create-hive-table
--hive-database
testsqoop
--hive-table
score
--delete-target-dir
在Hive中创建testsqoop库
hive> create database testsqoop;
如果遇到如下错误
22/04/18 09:08:42 ERROR tool.ImportTool: Import failed: java.io.IOException: java.la ng.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
at org.apache.sqoop.hive.HiveConfig.getHiveConf(HiveConfig.java:50)
at org.apache.sqoop.hive.HiveImport.getHiveArgs(HiveImport.java:392)
at org.apache.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.jav a:379)
at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:337)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:241)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:537)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.sqoop.hive.HiveConfig.getHiveConf(HiveConfig.java:44)
... 12 more
解决办法:
cp /usr/local/soft/hive-3.1.2/lib/hive-common-3.1.2.jar /usr/local/soft/sqoop-1.4.7/lib/
执行脚本
sqoop --options-file MySQLToHIVE.conf
–direct
加上这个参数,可以在导出MySQL数据的时候,使用MySQL提供的导出工具mysqldump,加快导出速度,提高效率
错误信息
22/04/18 09:42:33 INFO mapreduce.Job: Task Id : attempt_1650084984186_0048_m_000002_2, Status : FAILED
Error: java.io.IOException: Cannot run program "mysqldump": error=2, 没有那个文件或目录
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at java.lang.Runtime.exec(Runtime.java:620)
at java.lang.Runtime.exec(Runtime.java:485)
at org.apache.sqoop.mapreduce.MySQLDumpMapper.map(MySQLDumpMapper.java:405)
at org.apache.sqoop.mapreduce.MySQLDumpMapper.map(MySQLDumpMapper.java:49)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: error=2, 没有那个文件或目录
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 12 more
解决办法:
需要将master上的/usr/bin/mysqldump分发至 node1、node2的/usr/bin目录下
scp /usr/bin/mysqldump node1:/usr/bin/
scp /usr/bin/mysqldump node2:/usr/bin/
-e参数的使用
import
--connect
jdbc:mysql://master:3306/student
--username
root
--password
123456
--fields-terminated-by
"\t"
--lines-terminated-by
"\n"
--m
2
--split-by
student_id
--e
"select * from score where student_id=1500100011 and $CONDITIONS"
--target-dir
/sqoop/data/score3
--hive-import
--hive-overwrite
--create-hive-table
--hive-database
testsqoop
--hive-table
score3
执行:
sqoop --options-file MySQLToHive_SQL.conf
去hive中查看是否成功导入:
hive (testsqoop)> select * from score3;
OK
score3.student_id score3.cource_id score3.score
1500100011 1000001 42
1500100011 1000002 117
1500100011 1000003 80
1500100011 1000007 13
1500100011 1000008 26
1500100011 1000009 4
分析导入日志:
22/04/18 10:24:34 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
22/04/18 10:24:34 INFO tool.CodeGenTool: Beginning code generation
Mon Apr 18 10:24:34 CST 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
22/04/18 10:24:34 INFO manager.SqlManager: Executing SQL statement: select * from score where student_id=1500100011 and (1 = 0)
22/04/18 10:24:34 INFO manager.SqlManager: Executing SQL statement: select * from score where student_id=1500100011 and (1 = 0)
22/04/18 10:24:34 INFO manager.SqlManager: Executing SQL statement: select * from score where student_id=1500100011 and (1 = 0)
22/04/18 10:24:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/soft/hadoop-2.7.6
解释:
其中的三条SQL语句是为了检查SQL语句是否正确
BoundingValsQuery: SELECT MIN(student_id), MAX(student_id) FROM (select * from score where student_id=1500100011 and (1 = 1) ) AS t1
MySQLToHBase
编写脚本,并保存为MySQLToHBase.conf
import
--connect
jdbc:mysql://master:3306/student?useSSL=false
--username
root
--password
123456
--table
student
--hbase-table
stu
--hbase-create-table
--hbase-row-key
id
--m
1
--column-family
cf1
在HBase中创建student表
create 'stu','cf1'
执行脚本
sqoop --options-file MySQLToHBase.conf
export
HDFSToMySQL
编写脚本,并保存为HDFSToMySQL.conf
export
--connect
jdbc:mysql://master:3306/student?useSSL=false
--username
root
--password
123456
--table
student
-m
1
--columns
id,name,age,gender,clazz
--export-dir
/sqoop/data/student1/
--fields-terminated-by
','
先清空MySQL student表中的数据,不然会造成主键冲突
执行脚本
sqoop --options-file HDFSToMySQL.conf
查看sqoop help
sqoop help
21/04/26 15:50:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
# 查看import的详细帮助
sqoop import --help
–check-column
用来指定一些列,这些列在增量导入时用来检查这些数据是否作为增量数据进行导入,和关系型数据库中的自增字段及时间戳类似.
注意:这些被指定的列的类型不能使任意字符类型,如char、varchar等类型都是不可以的,同时–check-column可以去指定多个列
–incremental
用来指定增量导入的模式,两种模式分别为Append和Lastmodified
–last-value
指定上一次导入中检查列指定字段最大值
Error during import: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.
当–incremental lastmodified 并且输出的路径不为空,则需要加上 --append(追加) 或者 --merge-key(合并,当一个mapreduce任务运行完,还会启动另外一个mapreduce任务进行去重)
create table student
(
id int(10) not null AUTO_INCREMENT,
name char(5),
age int,
gender char(2),
clazz char(4),
primary key (id),
last_mod timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
) DEFAULT CHARSET=utf8;
指定字段的取值范围,增量导数据
适用于表不断的有新数据插入
导入id大于等于last-value的数据
sqoop import \
--connect 'jdbc:mysql://master:3306/student?useSSL=false' \
--username root \
--password 123456 \
--table student \
--target-dir /sqoop/data/student5 \
--split-by id \
--m 2 \
--fields-terminated-by ',' \
--incremental append \
--check-column id \
--last-value 1500100990
–check-column 和 --split-by 对应的字段可以不一致
导入–check-column指定的last_mod列的值大于等于–last-value “2022-04-19 14:56:30”
如果加上–merge-key 会在导入之后在做去重
sqoop import \
--connect 'jdbc:mysql://master:3306/student?useSSL=false' \
--username root \
--driver com.mysql.jdbc.Driver \
--password 123456 \
--table student \
--target-dir /sqoop/data/student6 \
--split-by id \
--m 2 \
--fields-terminated-by ',' \
--incremental lastmodified \
--check-column last_mod \
--last-value "2022-04-19 14:56:30" \
--merge-key \
id