大数据-ETL工具:Sqoop【关系型数据库(MySQL,Oracle...) <==(业务)数据==> Hive/HBase/HDFS】【Hadoop与关系数据库之间传送数据的工具】

大数据/数据分析 专栏收录该内容
34 篇文章 0 订阅

在这里插入图片描述

我们常用的 ETL 工具有Sqoop、Kettle、Nifi:

  1. Kettle虽然功能较完善,但当处理大数据量的时候瓶颈问题比较突出;
  2. NiFi的功能强大,且支持大数据量操作,但NiFi集群是独立于Hadoop集群的,需要独立的服务器来支撑,强大也就意味着有上手门槛,学习难度大,用人成本高;
  3. Sqoop专为关系型数据库和Hadoop之间的ETL而生,支持海量数据,符合项目的需求,且操作简单门槛低。

Sqoop与Kettle比较
在这里插入图片描述

  • 如果需求只是将关系型数据库数据(Oracle、MySQL)迁移至非关系型数据库(HDFS、Hbase、Hive),推荐使用Sqoop工具,足够满足需求
  • 如果是将不同种类的关系型数据库(Oracle、MySQL、SQL server)整合到同一个关系型数据库中,如MySQL。推荐使用Kettle,有GUI界面操作简单。

一、Sqoop介绍

Sqoop是Apache下的顶级项目,用来将Hadoop和关系型数据库中的数据相互转移,可以将一个关系型数据库(例如:MySQL,Oracle,PostgreSQL等)中的数据导入到Hadoop的HDFS中,也可以将HDFS的数据导入到关系型数据库中。

  • 导入: 从RDBMS(关系型数据库)到hadoop
  • 导出:从Hadoop到RDBMS(关系型数据库)

目前在各个公司应用广泛,且发展前景比较乐观。其特点在于:

  1. 专门为Hadoop而生,随Hadoop版本更新支持程度好,且原本即是从CDH版本孵化出来的开源项目,支持CDH的各个版本号。
  2. 它支持多种关系型数据库,比如mysql、oracle、postgresql等。
  3. 可以高效、可控的利用资源。
  4. 可以自动的完成数据映射和转换。
  5. 大部分企业还在使用sqoop1版本,sqoop1能满足公司的基本需求。
  6. 自带的辅助工具比较丰富,如sqoop-import、sqoop-list-databases、sqoop-list-tables等。

二、sqoop的安装

1、解压 jar 包到任意节点(比如:hadoop102节点)

sqoop是一个客户端工具,想在哪个服务器上使用就在该服务器上安装,不需要分发配置集群。

[whx@hadoop102 soft]$ tar -zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/

2、修改sqoop文件夹名称

[whx@hadoop102 module]$ ll
total 36
drwxrwxr-x.  9 whx whx 4096 Jan 31 14:45 flume
drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2
drwxrwxr-x.  8 whx whx 4096 Feb  2 10:48 hbase
drwxrwxr-x.  9 whx whx 4096 Jan 30 19:27 hive
drwxr-xr-x.  8 whx whx 4096 Dec 13  2016 jdk1.8.0_121
drwxr-xr-x.  8 whx whx 4096 Feb  1 16:32 kafka
drwxrwx---.  5 whx whx 4096 May 24  2019 phoenix
drwxr-xr-x.  9 whx whx 4096 Apr 27  2015 sqoop-1.4.6.bin__hadoop-2.0.4-alpha
drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10
[whx@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/  sqoop
[whx@hadoop102 module]$ ll
total 36
drwxrwxr-x.  9 whx whx 4096 Jan 31 14:45 flume
drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2
drwxrwxr-x.  8 whx whx 4096 Feb  2 10:48 hbase
drwxrwxr-x.  9 whx whx 4096 Jan 30 19:27 hive
drwxr-xr-x.  8 whx whx 4096 Dec 13  2016 jdk1.8.0_121
drwxr-xr-x.  8 whx whx 4096 Feb  1 16:32 kafka
drwxrwx---.  5 whx whx 4096 May 24  2019 phoenix
drwxr-xr-x.  9 whx whx 4096 Apr 27  2015 sqoop
drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10
[whx@hadoop102 module]$ 

3、修改配置文件

可以在/etc/profile中配置,导出为全局变量。或者在 /opt/module/sqoop/conf/sqoop-env-template.sh修改为:sqoop-env.sh。并配置 HADOOP_HOME,HIVE_HOME,HBASE_HOME,ZOOKEEPER_HOME

JAVA_HOME=/opt/module/jdk1.8.0_121
HADOOP_HOME=/opt/module/hadoop-2.7.2
ZOOKEEPER_HOME=/opt/module/hadoop-2.7.2
HIVE_HOME=/opt/module/hive
FLUME_HOME=/opt/module/flume
HBASE_HOME=/opt/module/hbase
PHOENIX_HOME=/opt/module/phoenix
PHOENIX_CLASSPATH=$PHOENIX_HOME
SQOOP_HOME=/opt/module/sqoop
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$HIVE_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin:$PHOENIX_HOME/bin:$SQOOP_HOME/bin

export PATH JAVA_HOME HADOOP_HOME ZOOKEEPER_HOME HIVE_HOME FLUME_HOME HBASE_HOME PHOENIX_HOME PHOENIX_CLASSPATH SQOOP_HOME

4、拷贝JDBC驱动

找到一个可用jdbc驱动,拷贝该jdbc驱动到sqoop的lib目录下

[whx@hadoop102 ~]$ cp /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar  /opt/module/sqoop/lib/

5、验证Sqoop

我们可以通过某一个command来验证sqoop配置是否正确:

[whx@hadoop102 ~]$ sqoop help

出现一些Warning警告(警告信息已省略),并伴随着帮助命令的输出:

[whx@hadoop102 ~]$ sqoop help
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:12:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

See 'sqoop help COMMAND' for information on a specific command.
[whx@hadoop102 ~]$

6、测试Sqoop是否能够成功连接Mysql数据库

[whx@hadoop102 ~]$ sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 123456
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:15:18 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 14:15:18 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 14:15:18 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
information_schema
metastore
mysql
performance_schema
test
[whx@hadoop102 ~]$ 

出现如下输出:

information_schema
metastore
mysql
performance_schema
test

在这里插入图片描述

三、sqoop的使用

在Mysql中新建一张表并插入一些数据用于测试

$ mysql -uroot -p123456
mysql> create database company;
mysql> create table company.staff(id int(4) primary key not null auto_increment, name varchar(255), sex varchar(255));
mysql> insert into company.staff(name, sex) values('Thomas', 'Male');
mysql> insert into company.staff(name, sex) values('Catalina', 'FeMale');

在这里插入图片描述

1、开启Hadoop-Yarn

[whx@hadoop102 ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-resourcemanager-hadoop102.out
hadoop103: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop103.out
hadoop101: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop101.out
hadoop102: starting nodemanager, logging to /opt/module/hadoop-2.7.2/logs/yarn-whx-nodemanager-hadoop102.out
[whx@hadoop102 ~]$ xcall.sh jps
要执行的命令是jps
----------------------------hadoop101----------------------------------
5411 HRegionServer
3476 DataNode
3591 QuorumPeerMain
6583 NodeManager
3354 NameNode
5547 HMaster
6718 Jps
----------------------------hadoop102----------------------------------
4336 QuorumPeerMain
18641 Jps
5123 SqlLine
14551 HRegionServer
4077 DataNode
18493 NodeManager
14381 HMaster
18190 ResourceManager
----------------------------hadoop103----------------------------------
3553 QuorumPeerMain
3474 SecondaryNameNode
5874 HRegionServer
3347 DataNode
7524 Jps
6009 HMaster
7387 NodeManager
[whx@hadoop102 ~]$ 

2、导入数据

在Sqoop中,“导入”概念指:从非大数据集群(RDBMS)向大数据集群(HDFS,HIVE,HBASE)中传输数据,叫做:导入,即使用import关键字。

2.1 Mysql到HDFS

  1. 查看mysql中所有的数据库
sqoop list-databases --connect jdbc:mysql://hadoop01:3306 --username root --password 123456
  1. 查看mysql中某个库下所有的表
sqoop list-tables --connect jdbc:mysql://hadoop01:3306/scm --username root --password 123456

注意:

​ 当忘记某个属性的时候, 可以通过使用 --help方式查询

  1. 从mysql中将数据导入到HDFS(全量导入)
命令1:
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp 

通过测试发现, 只需要指定输入端, 即可将数据导入HDFS中, 默认情况下, 将数据导入到所操作用户对应家目录下, 建立一个与导入表同名的目录, 同时发现表中有几条数据, sqoop默认就会启动几个map来读取数据, 默认分割符号为 逗号

如何减少map的数量呢?  只需要添加  -m的参数即可
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
-m 1

如果想使用两个map呢? 建议添加--split-by 表示按照那个字段进行分割表数据
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--split-by id \
-m 2

想要修改其默认的分割符号: 更改为 空格
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--fields-terminated-by ' ' \
--split-by id \
-m 2

想要指定某个目的地:  --target-dir (指定目的地)  和   --delete-target-dir(目的地目录存在, 先删除)
sqoop import \
--connect jdbc:mysql://hadoop01:3306/test \
--username root --password 123456 \
--table emp \
--fields-terminated-by ' ' \
--target-dir '/sqoop_emp' \
--delete-target-dir \
--split-by id \
-m 2
  1. 使用sqoop导入到hive中(全量导入)
第一步: 在hive中先建立目标表 
    create database sqooptohive;
    use sqooptohive;
    create table sqooptohive.emp_hive(
    	id int,
    	name string,
    	deg string,
    	salary int ,
    	dept string
    ) row format delimited fields terminated by '\t' stored as orc;

第二部: 执行数据导入操作:  HCataLog
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp \
--fields-terminated-by '\t' \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1



属性说明:
--hcatalog-database  指定数据库名称
--hcatalog-table   指定表的名称

注意:  使用此种方式, 在hive中建表的时候, 必须保证hive表字段和对应mysql表的字段名称保持一致

  1. 使用where条件的方式, 导入数据到HDFS(条件导入):
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp_add \
--target-dir /sqoop/emp_add \
--delete-target-dir \
-m 1  \
--where "city = 'sec-bad'"
  1. 使用SQL方式将数据导入到HDFS(条件导入)
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--query 'select phno from emp_conn where 1=1 and  $CONDITIONS' \
--target-dir /sqoop/emp_conn \
--delete-target-dir \
-m 1 

注意:
   1)当采用SQL的方式来导入数据的时候, SQL的最后面必须添加 $CONDITIONS 关键词
   2) 整个SQL如果使用 "" 包裹的  $CONDITIONS 关键词前面需要使用\进行转义
   	"select phno from emp_conn where 1=1 and  \$CONDITIONS"

  1. 使用SQL方式将数据导入hive(条件导入) – 增量导入方式
sqoop import \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--query "select * from emp where id>1205 and  \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1

2.1.1 全部导入

  • 连接的url
    –connect jdbc:mysql://hadoop102:3306/company \
  • 用户名
    –username root \
  • 密码
    –password 123456 \
  • 要导出哪个表的数据
    –table staff \
  • 将数据导入到hdfs的哪个路径
    –target-dir /company \
  • 如果目标目录存在就删除
    –delete-target-dir \
  • 导入到hdfs上时,mysql中的字段使用\t作为分隔符
    –fields-terminated-by “\t” \
  • 设置几个MapTask来运行
    –num-mappers 2 \
  • 基于ID列,将数据切分为2片,只有在–num-mappers>1时才需要指定,选的列最好不要有null值,否则null是无法被导入的!尽量选取主键列,数字列
    –split-by id
[whx@hadoop102 ~]$ sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 2 \
--split-by id
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 14:43:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 14:43:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 14:43:51 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/02/04 14:43:51 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/04 14:43:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff` AS t LIMIT 1
21/02/04 14:43:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff` AS t LIMIT 1
21/02/04 14:43:52 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/module/hadoop-2.7.2
Note: /tmp/sqoop-whx/compile/0438fe2a8b732eae5cb55256e845c9c4/staff.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/02/04 14:43:53 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-whx/compile/0438fe2a8b732eae5cb55256e845c9c4/staff.jar
21/02/04 14:43:54 INFO tool.ImportTool: Destination directory /company is not present, hence not deleting.
21/02/04 14:43:54 WARN manager.MySQLManager: It looks like you are importing from mysql.
21/02/04 14:43:54 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
21/02/04 14:43:54 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
21/02/04 14:43:54 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
21/02/04 14:43:54 INFO mapreduce.ImportJobBase: Beginning import of staff
21/02/04 14:43:54 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
21/02/04 14:43:54 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
21/02/04 14:43:54 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.102:8032
21/02/04 14:43:56 INFO db.DBInputFormat: Using read commited transaction isolation
21/02/04 14:43:56 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `staff`
21/02/04 14:43:56 INFO mapreduce.JobSubmitter: number of splits:2
21/02/04 14:43:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612420994275_0001
21/02/04 14:43:56 INFO impl.YarnClientImpl: Submitted application application_1612420994275_0001
21/02/04 14:43:56 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1612420994275_0001/
21/02/04 14:43:56 INFO mapreduce.Job: Running job: job_1612420994275_0001
21/02/04 14:44:04 INFO mapreduce.Job: Job job_1612420994275_0001 running in uber mode : false
21/02/04 14:44:04 INFO mapreduce.Job:  map 0% reduce 0%
21/02/04 14:44:08 INFO mapreduce.Job:  map 50% reduce 0%
21/02/04 14:44:09 INFO mapreduce.Job:  map 100% reduce 0%
21/02/04 14:44:09 INFO mapreduce.Job: Job job_1612420994275_0001 completed successfully
21/02/04 14:44:09 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=270734
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=197
		HDFS: Number of bytes written=32
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Job Counters 
		Launched map tasks=2
		Other local map tasks=2
		Total time spent by all maps in occupied slots (ms)=5689
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=5689
		Total vcore-milliseconds taken by all map tasks=5689
		Total megabyte-milliseconds taken by all map tasks=5825536
	Map-Reduce Framework
		Map input records=2
		Map output records=2
		Input split bytes=197
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=271
		CPU time spent (ms)=1370
		Physical memory (bytes) snapshot=341798912
		Virtual memory (bytes) snapshot=4193730560
		Total committed heap usage (bytes)=229638144
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=32
21/02/04 14:44:09 INFO mapreduce.ImportJobBase: Transferred 32 bytes in 15.7015 seconds (2.038 bytes/sec)
21/02/04 14:44:09 INFO mapreduce.ImportJobBase: Retrieved 2 records.
[whx@hadoop102 ~]$ 

在这里插入图片描述

2.1.2 导入指定列 --columns(提示:columns中如果涉及到多列,用逗号分隔,分隔时不要添加空格)

  • 只导出id和name 列
    –columns id,name
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--columns id,name \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 1 \
--split-by id

2.1.3 导入指定行 --where

  • 只导入复合过滤条件的行
    –where ‘id >= 10 and id <= 20’ \
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--where 'id >= 10 and id <= 20' \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 1 \
--split-by id

2.1.4 查询导入

  • 提示:must contain ‘$CONDITIONS’ in WHERE clause.
  • 如果query后使用的是双引号,则$CONDITIONS前必须加转移符,防止shell识别为自己的变量。
  • 执行查询的SQL,讲查询的数据进行导入,如果使用了–query,不加再用–table,–where,–columns
  • 只要使用–query ,必须添加$CONDITONS,这个条件会被Sqoop自动替换为一些表达式
    –query “select * from staff where $CONDITIONS and id <= 25” \
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--query "select * from staff where \$CONDITIONS and id <= 25" \
--target-dir /company \
--delete-target-dir \
--fields-terminated-by "\t" \
--num-mappers 2 \
--split-by id

2.2 Mysql到Hive

Sqoop导入到Hive:先将数据导入到HDFS,再将HDFS的数据Load到Hive表中。

  • 导入到Hive
    –hive-import \
  • 导入到Hive表中字段的分隔符
    –fields-terminated-by “\t” \
  • 是否以insert overwrite方式覆盖导入数据
    –hive-overwrite \
  • 要导入的Hive表的名称,会自动帮助我们建表。建议还是在Hive中手动建表,需要注意和Mysql表的数据类型匹配
    –hive-table staff_hive
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--num-mappers 1 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table staff_hive

提示:该过程分为两步,第一步将数据导入到HDFS,第二步将导入到HDFS的数据迁移到Hive仓库,第一步默认的临时目录是/user/atguigu/表名

3.3 Mysql到Hbase

  • sqoop1.4.6对应的是低版本的hbase,1.3.0的hbase版本相对来说有点高!
  • 在执行导入时,sqoop是可以帮我们自动建表,如果sqoop使用sqoop1.4.6版本,hbase使用1.3.0hbase,则建表会失败。 建议手动建表。
  • 如果表不存在,hbase自动建表
    –hbase-create-table \
  • 导入的表名
    –hbase-table “t_emp” \
  • mysql的哪一列作为rowkey
    –hbase-row-key “id” \
  • 导入的列族名
    –column-family “info” \
$ bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/company \
--username root \
--password 123456 \
--table staff \
--where 'id <= 5' \
--hbase-create-table \
--hbase-table "staff_hbase" \
--hbase-row-key "id" \
--column-family "info" \
--num-mappers 1 \
--split-by id

提示:sqoop1.4.6只支持HBase1.0.1之前的版本的自动创建HBase表的功能
解决方案:手动创建HBase表

hbase> create 'staff_hbase','info'

3、导出数据( Hive/HDFS到Mysql)

  • 在Sqoop中,“导出”概念指:从大数据集群(HDFS,HIVE)向非大数据集群(RDBMS)Mysql中传输数据,叫做:导出,即使用export关键字。

  • Mysql的表如果不存在,不会自动创建。所以需要提前创建。

  • 不支持直接从HBase向非大数据集群(RDBMS)直接导出。

  • 要导出的mysql的表名
    –table staff2
    –num-mappers 1 \

  • 导出的数据在HDFS上的路径
    –export-dir /company \

  • 导出时,基于哪一列判断数据重复。如果要导入的表已经有数据了,此时可以指定–update-key参数,通过此参数,可以讲导入的数据,使用updata语句进行导入,此时,只会更新重复的数据,不重复的数据是无法导入的!
    –update-key

  • 如果希望遇到重复的数据,就更新,不重复的数据就新增导入,可以使用–update-key 结合 --update-mode(默认为updateonly)。
    –update-mode=allowinsert

  • 导出的数据的分隔符
    –input-fields-terminated-by “\t”

[whx@hadoop102 ~]$ sqoop export \
--connect jdbc:mysql://hadoop102:3306/company?useUnicode=true&characterEncoding=utf-8' \
--username root \
--password 123456 \
--table staff2 \
--num-mappers 1 \
--export-dir /company \
--input-fields-terminated-by "\t"
Warning: /opt/module/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/module/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/02/04 15:25:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
21/02/04 15:25:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/02/04 15:25:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/02/04 15:25:37 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/phoenix-4.14.2-HBase-1.3-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/04 15:25:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/02/04 15:25:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `staff2` AS t LIMIT 1
21/02/04 15:25:37 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/module/hadoop-2.7.2
Note: /tmp/sqoop-whx/compile/3030961eef5236a3e482591dec69c120/staff2.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/02/04 15:25:38 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-whx/compile/3030961eef5236a3e482591dec69c120/staff2.jar
21/02/04 15:25:38 INFO mapreduce.ExportJobBase: Beginning export of staff2
21/02/04 15:25:38 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
21/02/04 15:25:39 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
21/02/04 15:25:39 INFO client.RMProxy: Connecting to ResourceManager at hadoop102/192.168.1.102:8032
21/02/04 15:25:40 INFO input.FileInputFormat: Total input paths to process : 2
21/02/04 15:25:40 INFO input.FileInputFormat: Total input paths to process : 2
21/02/04 15:25:40 INFO mapreduce.JobSubmitter: number of splits:1
21/02/04 15:25:40 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
21/02/04 15:25:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612420994275_0002
21/02/04 15:25:41 INFO impl.YarnClientImpl: Submitted application application_1612420994275_0002
21/02/04 15:25:41 INFO mapreduce.Job: The url to track the job: http://hadoop102:8088/proxy/application_1612420994275_0002/
21/02/04 15:25:41 INFO mapreduce.Job: Running job: job_1612420994275_0002
21/02/04 15:25:46 INFO mapreduce.Job: Job job_1612420994275_0002 running in uber mode : false
21/02/04 15:25:46 INFO mapreduce.Job:  map 0% reduce 0%
21/02/04 15:25:51 INFO mapreduce.Job:  map 100% reduce 0%
21/02/04 15:25:51 INFO mapreduce.Job: Job job_1612420994275_0002 completed successfully
21/02/04 15:25:51 INFO mapreduce.Job: Counters: 30
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=135055
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=220
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=7
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=0
	Job Counters 
		Launched map tasks=1
		Rack-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=2304
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=2304
		Total vcore-milliseconds taken by all map tasks=2304
		Total megabyte-milliseconds taken by all map tasks=2359296
	Map-Reduce Framework
		Map input records=2
		Map output records=2
		Input split bytes=182
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=177
		CPU time spent (ms)=490
		Physical memory (bytes) snapshot=168833024
		Virtual memory (bytes) snapshot=2093150208
		Total committed heap usage (bytes)=115343360
	File Input Format Counters 
		Bytes Read=0
	File Output Format Counters 
		Bytes Written=0
21/02/04 15:25:51 INFO mapreduce.ExportJobBase: Transferred 220 bytes in 12.3045 seconds (17.8796 bytes/sec)
21/02/04 15:25:51 INFO mapreduce.ExportJobBase: Exported 2 records.
[whx@hadoop102 ~]$ 

在这里插入图片描述

案例:

需求: 将hive中emp_hive表导出到mysql中(全量导出)

  • 第一步: 需要在mysql中创建目标表 (必须操作)
CREATE TABLE `emp_out` (
  `id` INT(11) DEFAULT NULL,
  `name` VARCHAR(100) DEFAULT NULL,
  `deg` VARCHAR(100) DEFAULT NULL,
  `salary` INT(11) DEFAULT NULL,
  `dept` VARCHAR(10) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;

  • 第二步: 执行sqoop的导出操作:
sqoop export \
--connect jdbc:mysql://192.168.52.150:3306/test \
--username root \
--password 123456 \
--table emp_out \
--hcatalog-database sqooptohive \
--hcatalog-table emp_hive \
-m 1

sqoop的相关参数

参数说明
–connect连接关系型数据库的URL
–username连接数据库的用户名
–password连接数据库的密码
–driverJDBC的driver class
–query或–e 将查询结果的数据导入,使用时必须伴随参–target-dir,–hcatalog-table,如果查询中有where条件,则条件后必须加上CONDITIONS关键字。 如果使用双引号包含sql,则​CONDITIONS前要加上\以完成转义:\$CONDITIONS
–hcatalog-database指定HCatalog表的数据库名称。如果未指定,default则使用默认数据库名称。提供 --hcatalog-database不带选项–hcatalog-table是错误的。
–hcatalog-table此选项的参数值为HCatalog表名。该–hcatalog-table选项的存在表示导入或导出作业是使用HCatalog表完成的,并且是HCatalog作业的必需选项。
–create-hcatalog-table此选项指定在导入数据时是否应自动创建HCatalog表。表名将与转换为小写的数据库表名相同。
–hcatalog-storage-stanza ‘stored as orc tblproperties (“orc.compress”=“SNAPPY”)’ \建表时追加存储格式到建表语句中,tblproperties修改表的属性,这里设置orc的压缩格式为SNAPPY
-m指定并行处理的MapReduce任务数量。 -m不为1时,需要用split-by指定分片字段进行并行导入,尽量指定int型。
–split-by id如果指定-split by, 必须使用$CONDITIONS关键字, 双引号的查询语句还要加\
–hcatalog-partition-keys --hcatalog-partition-valueskeys和values必须同时存在,相当于指定静态分区。允许将多个键和值提供为静态分区键。多个选项值之间用,(逗号)分隔。比如: --hcatalog-partition-keys year,month,day --hcatalog-partition-values 1999,12,31
–null-string ‘\N’ --null-non-string ‘\N’指定mysql数据为空值时用什么符号存储,null-string针对string类型的NULL值处理,–null-non-string针对非string类型的NULL值处理
–hive-drop-import-delims设置无视字符串中的分割符(hcatalog默认开启)
–fields-terminated-by ‘\t’设置字段分隔符

四、sqoop命令脚本打包

使用sqoop命令脚本文件,可以实现定时任务来进行导入导出。

sql的命令编写在脚本中,参数名和参数值之间,需要换行

创建一个.opt文件

$ mkdir opt
$ touch opt/job_mysql2hdfs.opt

脚本格式:

--connect
jdbc:mysql://hadoop102:3306/company?useUnicode=true&characterEncoding=utf-8'
--username
root
--password
123456
--table
staff
--target-dir
/company3
--delete-target-dir
--num-mappers
1
--fields-terminated-by
"\t"
--split-by
id

执行脚本:

sqoop --options-file opt/job_mysql2hdfs.opt
  • 0
    点赞
  • 0
    评论
  • 1
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值