大数据-ETL工具：Sqoop【关系型数据库(MySQL,Oracle...) ＜==（业务）数据==＞ Hive/HBase/HDFS】【Hadoop与关系数据库之间传送数据的工具】

u013250861

已于 2023-04-02 16:28:51 修改

阅读量1.5k

点赞数 1

分类专栏： # 大数据/数据采集(Flume/dataX) 文章标签：大数据 ETL Sqoop

于 2021-01-14 00:14:15 首次发布

本文链接：https://blog.csdn.net/u013250861/article/details/112597817

版权

大数据/数据采集(Flume/dataX) 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

在这里插入图片描述

我们常用的 ETL 工具有Sqoop、Kettle、Nifi：

Kettle虽然功能较完善，但当处理大数据量的时候瓶颈问题比较突出；
NiFi的功能强大，且支持大数据量操作，但NiFi集群是独立于Hadoop集群的，需要独立的服务器来支撑，强大也就意味着有上手门槛，学习难度大，用人成本高；
Sqoop专为关系型数据库和Hadoop之间的ETL而生，支持海量数据，符合项目的需求，且操作简单门槛低。

Sqoop与Kettle比较
在这里插入图片描述

如果需求只是将关系型数据库数据（Oracle、MySQL）迁移至非关系型数据库（HDFS、Hbase、Hive），推荐使用Sqoop工具，足够满足需求
如果是将不同种类的关系型数据库（Oracle、MySQL、SQL server）整合到同一个关系型数据库中，如MySQL。推荐使用Kettle，有GUI界面操作简单。

一、Sqoop介绍

Sqoop是Apache下的顶级项目，用来将Hadoop和关系型数据库中的数据相互转移，可以将一个关系型数据库（例如：MySQL，Oracle，PostgreSQL等）中的数据导入到Hadoop的HDFS中，也可以将HDFS的数据导入到关系型数据库中。

导入: 从RDBMS（关系型数据库）到hadoop
导出:从Hadoop到RDBMS（关系型数据库）

目前在各个公司应用广泛，且发展前景比较乐观。其特点在于：

专门为Hadoop而生，随Hadoop版本更新支持程度好，且原本即是从CDH版本孵化出来的开源项目，支持CDH的各个版本号。
它支持多种关系型数据库，比如mysql、oracle、postgresql等。
可以高效、可控的利用资源。
可以自动的完成数据映射和转换。
大部分企业还在使用sqoop1版本，sqoop1能满足公司的基本需求。
自带的辅助工具比较丰富，如sqoop-import、sqoop-list-databases、sqoop-list-tables等。

二、sqoop的安装

1、解压 jar 包到任意节点（比如：hadoop102节点）

sqoop是一个客户端工具，想在哪个服务器上使用就在该服务器上安装，不需要分发配置集群。

[whx@hadoop102 soft]$ tar -zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/

2、修改sqoop文件夹名称

[whx@hadoop102 module]$ ll
total 36
drwxrwxr-x.  9 whx whx 4096 Jan 31 14:45 flume
drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2
drwxrwxr-x.  8 whx whx 4096 Feb  2 10:48 hbase
drwxrwxr-x.  9 whx whx 4096 Jan 30 19:27 hive
drwxr-xr-x.  8 whx whx 4096 Dec 13  2016 jdk1.8.0_121
drwxr-xr-x.  8 whx whx 4096 Feb  1 16:32 kafka
drwxrwx---.  5 whx whx 4096 May 24  2019 phoenix
drwxr-xr-x.  9 whx whx 4096 Apr 27  2015 sqoop-1.4.6.bin__hadoop-2.0.4-alpha
drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10
[whx@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/  sqoop
[whx@hadoop102 module]$ ll
total 36
drwxrwxr-x.  9 whx whx 4096 Jan 31 14:45 flume
drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2
drwxrwxr-x.  8 whx whx 4096 Feb  2 10:48 hbase
drwxrwxr-x.  9 whx whx 4096 Jan 30 19:27 hive
drwxr-xr-x.  8 whx whx 4096 Dec 13  2016 jdk1.8.0_121
drwxr-xr-x.  8 whx whx 4096 Feb  1 16:32 kafka
drwxrwx---.  5 whx whx 4096 May 24  2019 phoenix
drwxr-xr-x.  9 whx whx 4096 Apr 27  2015 sqoop
drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10
[whx@hadoop102 module]$

3、修改配置文件

可以在/etc/profile中配置，导出为全局变量。或者在 /opt/module/sqoop/conf/sqoop-env-template.sh修改为：sqoop-env.sh。并配置 HADOOP_HOME,HIVE_HOME,HBASE_HOME,ZOOKEEPER_HOME

JAVA_HOME=/opt/module/jdk1.8.0_121
HADOOP_HOME=/opt/module/hadoop-2.7.2
ZOOKEEPER_HOME=/opt/module/hadoop-2.7.2
HIVE_HOME=/opt/module/hive
FLUME_HOME=/opt/module/flume
HBASE_HOME=/opt/module/hbase
PHOENIX_HOME=/opt/module/phoenix
PHOENIX_CLASSPATH=$PHOENIX_HOME
SQOOP_HOME=/opt/module/sqoop
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$HIVE_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin:$PHOENIX_HOME/bin:$SQOOP_HOME/bin

export PATH JAVA_HOME HADOOP_HOME ZOOKEEPER_HOME HIVE_HOME FLUME_HOME HBASE_HOME PHOENIX_HOME PHOENIX_CLASSPATH SQOOP_HOME

4、拷贝JDBC驱动

找到一个可用jdbc驱动，拷贝该jdbc驱动到sqoop的lib目录下

[whx@hadoop102 ~]$ cp /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar  /opt/module/sqoop/lib/

5、验证Sqoop

我们可以通过某一个command来验证sqoop配置是否正确：

[whx@hadoop102 ~]$ sqoop help

出现一些Warning警告（警告信息已省略），并伴随着帮助命令的输出：

[whx@hadoop102 ~]$ sqoop help
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:12:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.
[whx@hadoop102 ~]$

6、测试Sqoop是否能够成功连接Mysql数据库

[whx@hadoop102 ~]$ sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 123456
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:15:18 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 14:15:18 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 14:15:18 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
information_schema
metastore
mysql
performance_schema
test
[whx@hadoop102 ~]$

出现如下输出：

information_schema
metastore
mysql
performance_schema
test

在这里插入图片描述

三、sqoop的使用

在Mysql中新建一张表并插入一些数据用于测试

$ mysql -uroot -p123456
mysql> create database company;
mysql> create table company.staff(id int(4) primary key not null auto_increment, name varchar(255), sex varchar(255));
mysql> insert into company.staff(name, sex) values('Thomas', 'Male');
mysql> insert into company.staff(name, sex) values('Catalina', 'FeMale');

在这里插入图片描述

1、开启Hadoop-Yarn

[whx@hadoop102 ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-resourcemanager-hadoop102.out
hadoop103: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop103.out
hadoop101: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop101.out
hadoop102: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop102.out
[whx@hadoop102 ~]$ xcall.sh jps
要执行的命令是jps
----------------------------hadoop101----------------------------------
5411 HRegionServer
3476 DataNode
3591 QuorumPeerMain
6583 NodeManager
3354 NameNode
5547 HMaster
6718 Jps
----------------------------hadoop102----------------------------------
4336 QuorumPeerMain
18641 Jps
5123 SqlLine
14551 HRegionServer
4077 DataNode
18493 NodeManager
14381 HMaster
18190 ResourceManager
----------------------------hadoop103----------------------------------
3553 QuorumPeerMain
3474 SecondaryNameNode
5874 HRegionServer
3347 DataNode
7524 Jps
6009 HMaster
7387 NodeManager
[whx@hadoop102 ~]$

2、导入数据

在Sqoop中，“导入”概念指：从非大数据集群（RDBMS）向大数据集群（HDFS，HIVE，HBASE）中传输数据，叫做：导入，即使用import关键字。

2.1 Mysql到HDFS

查看mysql中所有的数据库

sqoop list-databases --connect jdbc:mysql://hadoop01:3306 --username root --password 123456

查看mysql中某个库下所有的表

sqoop list-tables --connect jdbc:mysql://hadoop01:3306/scm --username root --password 123456

注意:

当忘记某个属性的时候, 可以通过使用 --help方式查询

从mysql中将数据导入到HDFS(全量导入)

命令1:
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp 

通过测试发现, 只需要指定输入端, 即可将数据导入HDFS中, 默认情况下, 将数据导入到所操作用户对应家目录下, 建立一个与导入表同名的目录, 同时发现表中有几条数据, sqoop默认就会启动几个map来读取数据, 默认分割符号为 逗号

如何减少map的数量呢?  只需要添加  -m的参数即可
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
-m 1

如果想使用两个map呢? 建议添加--split-by 表示按照那个字段进行分割表数据
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--split-by id \
-m 2

想要修改其默认的分割符号: 更改为 空格
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--fields-terminated-by ' ' \
--split-by id \
-m 2

想要指定某个目的地:  --target-dir (指定目的地)  和   --delete-target-dir(目的地目录存在, 先删除)
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--fields-terminated-by ' ' \
--target-dir '/sqoop_emp' \
--delete-target-dir \
--split-by id \
-m 2

使用sqoop导入到hive中(全量导入)

第一步: 在hive中先建立目标表 
    create database sqooptohive;
    use sqooptohive;
    create table sqooptohive.emp_hive(
    	id int,
    	name string,
    	deg string,
    	salary int ,
    	dept string
    ) row format delimited fields terminated by '\t' stored as orc;

第二部: 执行数据导入操作:  HCataLog
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp \
--fields-terminated-by '\t' \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1



属性说明:
--hcatalog-database  指定数据库名称
--hcatalog-table   指定表的名称

注意:  使用此种方式, 在hive中建表的时候, 必须保证hive表字段和对应mysql表的字段名称保持一致

使用where条件的方式, 导入数据到HDFS(条件导入):

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp_add \
--target-dir /sqoop/emp_add \
--delete-target-dir \
-m 1  \
--where "city = 'sec-bad'"

使用SQL方式将数据导入到HDFS(条件导入)

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--query 'select phno from emp_conn where 1=1 and  $CONDITIONS' \
--target-dir /sqoop/emp_conn \
--delete-target-dir \
-m 1 

注意:
   1)当采用SQL的方式来导入数据的时候, SQL的最后面必须添加 $CONDITIONS 关键词
   2) 整个SQL如果使用 "" 包裹的  $CONDITIONS 关键词前面需要使用\进行转义
   	"select phno from emp_conn where 1=1 and  \$CONDITIONS"

使用SQL方式将数据导入hive(条件导入) – 增量导入方式

sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--query "select * from emp where id>1205 and  \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1

2.1.1 全部导入

连接的url
–connect jdbc:mysql://hadoop102:3306/company \
用户名
–username root \
密码
–password 123456 \
要导出哪个表的数据
–table staff \
将数据导入到hdfs的哪个路径
–target-dir /company \
如果目标目录存在就删除
–delete-target-dir \
导入到hdfs上时，mysql中的字段使用\t作为分隔符
–fields-terminated-by “\t” \
设置几个MapTask来运行
–num-mappers 2 \
基于ID列，将数据切分为2片，只有在–num-mappers>1时才需要指定,选的列最好不要有null值，否则null是无法被导入的！尽量选取主键列，数字列
–split-by id

[whx@hadoop102 ~]$ sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 2 \
--split-by id
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:43:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 14:43:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 14:43:51 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/02/04 14:43:51 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/04 14:43:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff` AS t LIMIT 1
21/02/04 14:43:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff` AS t LIMIT 1
21/02/04 14:43:52 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/module/hadoop-2.7.2
Note: /tmp/sqoop-whx/compile/0438fe2a8b732eae5cb55256e845c9c4/staff.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/02/04 14:43:53 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-whx/compile/0438fe2a8b732eae5cb55256e845c9c4/staff.jar
21/02/04 14:43:54 INFO tool.ImportTool: Destination directory /company is not present, hence not deleting.
21/02/04 14:43:54 WARN manager.MySQLManager: It looks like you are importing from mysql.
21/02/04 14:43:54 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
21/02/04 14:43:54 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
21/02/04 14:43:54 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
21/02/04 14:43:54 INFO mapreduce.ImportJobBase: Beginning import of staff
21/02/04 14:43:54 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
21/02/04 14:43:54 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
21/02/04 14:43:54 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.102:8032
21/02/04 14:43:56 INFO db.DBInputFormat: Using read commited transaction isolation
21/02/04 14:43:56 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `staff`
21/02/04 14:43:56 INFO mapreduce.JobSubmitter: number of splits:2
21/02/04 14:43:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612420994275_0001
21/02/04 14:43:56 INFO impl.YarnClientImpl: Submitted application application_1612420994275_0001
21/02/04 14:43:56 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1612420994275_0001/
21/02/04 14:43:56 INFO mapreduce.Job: Running job: job_1612420994275_0001
21/02/04 14:44:04 INFO mapreduce.Job: Job job_1612420994275_0001 running in uber mode : false
21/02/04 14:44:04 INFO mapreduce.Job:  map 0% reduce 0%
21/02/04 14:44:08 INFO mapreduce.Job:  map 50% reduce 0%
21/02/04 14:44:09 INFO mapreduce.Job:  map 100% reduce 0%
21/02/04 14:44:09 INFO mapreduce.Job: Job job_1612420994275_0001 completed successfully
21/02/04 14:44:09 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=270734
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=197
		HDFS: Number of bytes written=32
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Job Counters 
		Launched map tasks=2
		Other local map tasks=2
		Total time spent by all maps in occupied slots (ms)=5689
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=5689
		Total vcore-milliseconds taken by all map tasks=5689
		Total megabyte-milliseconds taken by all map tasks=5825536
	Map-Reduce Framework
		Map input records=2
		Map output records=2
		Input split bytes=197
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=271
		CPU time spent (ms)=1370
		Physical memory (bytes) snapshot=341798912
		Virtual memory (bytes) snapshot=4193730560
		Total committed heap usage (bytes)=229638144
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=32
21/02/04 14:44:09 INFO mapreduce.ImportJobBase: Transferred 32 bytes in 15.7015 seconds (2.038 bytes/sec)
21/02/04 14:44:09 INFO mapreduce.ImportJobBase: Retrieved 2 records.
[whx@hadoop102 ~]$

在这里插入图片描述

2.1.2 导入指定列 --columns（提示：columns中如果涉及到多列，用逗号分隔，分隔时不要添加空格）

只导出id和name 列
–columns id,name

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--columns id,name \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 1 \
--split-by id

2.1.3 导入指定行 --where

只导入复合过滤条件的行
–where ‘id >= 10 and id <= 20’ \

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--where 'id >= 10 and id <= 20' \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 1 \
--split-by id

2.1.4 查询导入

提示：must contain ‘$CONDITIONS’ in WHERE clause.
如果query后使用的是双引号，则$CONDITIONS前必须加转移符，防止shell识别为自己的变量。
执行查询的SQL，讲查询的数据进行导入，如果使用了–query,不加再用–table,–where,–columns
只要使用–query ，必须添加$CONDITONS,这个条件会被Sqoop自动替换为一些表达式
–query “select * from staff where $CONDITIONS and id <= 25” \

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--query "select * from staff where \$CONDITIONS and id <= 25" \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 2 \
--split-by id

2.2 Mysql到Hive

Sqoop导入到Hive：先将数据导入到HDFS，再将HDFS的数据Load到Hive表中。

导入到Hive
–hive-import \
导入到Hive表中字段的分隔符
–fields-terminated-by “\t” \
是否以insert overwrite方式覆盖导入数据
–hive-overwrite \
要导入的Hive表的名称，会自动帮助我们建表。建议还是在Hive中手动建表，需要注意和Mysql表的数据类型匹配
–hive-table staff_hive

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--num-mappers 1 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table staff_hive

提示：该过程分为两步，第一步将数据导入到HDFS，第二步将导入到HDFS的数据迁移到Hive仓库，第一步默认的临时目录是/user/atguigu/表名

3.3 Mysql到Hbase

sqoop1.4.6对应的是低版本的hbase，1.3.0的hbase版本相对来说有点高！
在执行导入时，sqoop是可以帮我们自动建表，如果sqoop使用sqoop1.4.6版本，hbase使用1.3.0hbase，则建表会失败。建议手动建表。
如果表不存在，hbase自动建表
–hbase-create-table \
导入的表名
–hbase-table “t_emp” \
mysql的哪一列作为rowkey
–hbase-row-key “id” \
导入的列族名
–column-family “info” \

$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--where 'id <= 5' \
--hbase-create-table \
--hbase-table "staff_hbase" \
--hbase-row-key "id" \
--column-family "info" \
--num-mappers 1 \
--split-by id

提示：sqoop1.4.6只支持HBase1.0.1之前的版本的自动创建HBase表的功能
解决方案：手动创建HBase表

hbase> create 'staff_hbase','info'

3、导出数据（ Hive/HDFS到Mysql）

在Sqoop中，“导出”概念指：从大数据集群（HDFS，HIVE）向非大数据集群（RDBMS）Mysql中传输数据，叫做：导出，即使用export关键字。
Mysql的表如果不存在，不会自动创建。所以需要提前创建。
不支持直接从HBase向非大数据集群（RDBMS）直接导出。
要导出的mysql的表名
–table staff2
–num-mappers 1 \
导出的数据在HDFS上的路径
–export-dir /company \
导出时，基于哪一列判断数据重复。如果要导入的表已经有数据了，此时可以指定–update-key参数，通过此参数，可以讲导入的数据，使用updata语句进行导入，此时，只会更新重复的数据，不重复的数据是无法导入的！
–update-key
如果希望遇到重复的数据，就更新，不重复的数据就新增导入，可以使用–update-key 结合 --update-mode(默认为updateonly)。
–update-mode=allowinsert
导出的数据的分隔符
–input-fields-terminated-by “\t”

[whx@hadoop102 ~]$ sqoop export \
--connect jdbc:mysql://hadoop102:3306/company?useUnicode=true&characterEncoding=utf-8' \
--username root \
--password 123456 \
--table staff2 \
--num-mappers 1 \
--export-dir /company \
--input-fields-terminated-by "\t"
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 15:25:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 15:25:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 15:25:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/02/04 15:25:37 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/04 15:25:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/02/04 15:25:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/02/04 15:25:37 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/module/hadoop-2.7.2
Note: /tmp/sqoop-whx/compile/3030961eef5236a3e482591dec69c120/staff2.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/02/04 15:25:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-whx/compile/3030961eef5236a3e482591dec69c120/staff2.jar
21/02/04 15:25:38 INFO mapreduce.ExportJobBase: Beginning export of staff2
21/02/04 15:25:38 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
21/02/04 15:25:39 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.102:8032
21/02/04 15:25:40 INFO input.FileInputFormat: Total input paths to process : 2
21/02/04 15:25:40 INFO input.FileInputFormat: Total input paths to process : 2
21/02/04 15:25:40 INFO mapreduce.JobSubmitter: number of splits:1
21/02/04 15:25:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
21/02/04 15:25:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612420994275_0002
21/02/04 15:25:41 INFO impl.YarnClientImpl: Submitted application application_1612420994275_0002
21/02/04 15:25:41 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1612420994275_0002/
21/02/04 15:25:41 INFO mapreduce.Job: Running job: job_1612420994275_0002
21/02/04 15:25:46 INFO mapreduce.Job: Job job_1612420994275_0002 running in uber mode : false
21/02/04 15:25:46 INFO mapreduce.Job:  map 0% reduce 0%
21/02/04 15:25:51 INFO mapreduce.Job:  map 100% reduce 0%
21/02/04 15:25:51 INFO mapreduce.Job: Job job_1612420994275_0002 completed successfully
21/02/04 15:25:51 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=135055
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=220
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=7
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=1
		Rack-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=2304
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=2304
		Total vcore-milliseconds taken by all map tasks=2304
		Total megabyte-milliseconds taken by all map tasks=2359296
	Map-Reduce Framework
		Map input records=2
		Map output records=2
		Input split bytes=182
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=177
		CPU time spent (ms)=490
		Physical memory (bytes) snapshot=168833024
		Virtual memory (bytes) snapshot=2093150208
		Total committed heap usage (bytes)=115343360
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=0
21/02/04 15:25:51 INFO mapreduce.ExportJobBase: Transferred 220 bytes in 12.3045 seconds (17.8796 bytes/sec)
21/02/04 15:25:51 INFO mapreduce.ExportJobBase: Exported 2 records.
[whx@hadoop102 ~]$

在这里插入图片描述

案例：

需求: 将hive中emp_hive表导出到mysql中(全量导出)

第一步: 需要在mysql中创建目标表 (必须操作)

CREATE TABLE `emp_out` (
  `id` INT(11) DEFAULT NULL,
  `name` VARCHAR(100) DEFAULT NULL,
  `deg` VARCHAR(100) DEFAULT NULL,
  `salary` INT(11) DEFAULT NULL,
  `dept` VARCHAR(10) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;

第二步: 执行sqoop的导出操作:

sqoop export \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp_out \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1

sqoop的相关参数

参数	说明
–connect	连接关系型数据库的URL
–username	连接数据库的用户名
–password	连接数据库的密码
–driver	JDBC的driver class
–query或–e	将查询结果的数据导入，使用时必须伴随参–target-dir，–hcatalog-table，如果查询中有where条件，则条件后必须加上CONDITIONS关键字。如果使用双引号包含sql，则CONDITIONS前要加上\以完成转义：\$CONDITIONS
–hcatalog-database	指定HCatalog表的数据库名称。如果未指定，default则使用默认数据库名称。提供 --hcatalog-database不带选项–hcatalog-table是错误的。
–hcatalog-table	此选项的参数值为HCatalog表名。该–hcatalog-table选项的存在表示导入或导出作业是使用HCatalog表完成的，并且是HCatalog作业的必需选项。
–create-hcatalog-table	此选项指定在导入数据时是否应自动创建HCatalog表。表名将与转换为小写的数据库表名相同。
–hcatalog-storage-stanza ‘stored as orc tblproperties (“orc.compress”=“SNAPPY”)’ \	建表时追加存储格式到建表语句中，tblproperties修改表的属性，这里设置orc的压缩格式为SNAPPY
-m	指定并行处理的MapReduce任务数量。 -m不为1时，需要用split-by指定分片字段进行并行导入，尽量指定int型。
–split-by id	如果指定-split by, 必须使用$CONDITIONS关键字, 双引号的查询语句还要加\
–hcatalog-partition-keys --hcatalog-partition-values	keys和values必须同时存在，相当于指定静态分区。允许将多个键和值提供为静态分区键。多个选项值之间用，（逗号）分隔。比如： --hcatalog-partition-keys year,month,day --hcatalog-partition-values 1999,12,31
–null-string ‘\N’ --null-non-string ‘\N’	指定mysql数据为空值时用什么符号存储，null-string针对string类型的NULL值处理，–null-non-string针对非string类型的NULL值处理
–hive-drop-import-delims	设置无视字符串中的分割符（hcatalog默认开启）
–fields-terminated-by ‘\t’	设置字段分隔符

四、sqoop命令脚本打包

使用sqoop命令脚本文件，可以实现定时任务来进行导入导出。

sql的命令编写在脚本中，参数名和参数值之间，需要换行

创建一个.opt文件

$ mkdir opt
$ touch opt/job_mysql2hdfs.opt

脚本格式：

--connect
jdbc:mysql://hadoop102:3306/company?useUnicode=true&characterEncoding=utf-8'
--username
root
--password
123456
--table
staff
--target-dir
/company3
--delete-target-dir
--num-mappers
1
--fields-terminated-by
"\t"
--split-by
id

执行脚本：

sqoop --options-file opt/job_mysql2hdfs.opt

u013250861

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
大数据-ETL工具：Sqoop【关系型数据库(MySQL,Oracle...) ＜==（业务）数据==＞ Hive/HBase/HDFS】【Hadoop与关系数据库之间传送数据的工具】

hive是基于Hadoop的一个数据仓库工具，用来进行数据提取、转化、加载，这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表，并提供SQL查询功能，能将SQL语句转变成MapReduce任务来执行。Hive的优点是学习成本低，可以通过类似SQL语句实现快速MapReduce统计，使MapReduce变得更加简单，而不必开发专门的MapReduce应用程序。hive十分适合对数据仓库进行统计分析。
复制链接

扫一扫

专栏目录