Apache Sqoop 工具
Sqoop 概述
Apache Sqoop(TM)是⼀种旨在在Apache Hadoop和结构化数据存储(例如关系数据库)之间高效传输批量数据的⼯具。通过内嵌的MapReduce程序实现关系型数据库和HDFS、Hbase、Hive等数据的倒入导出。
Sqoop 安装
1、访问sqoop的⽹址http://sqoop.apache.org/,选择相应的sqoop版本下载;
本案例选择下载的是 1.4.7 下载地址:https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
2、下载完相应的工具包后,解压Sqoop;
[root@CentOS ~]# tar -zxf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /usr/
[root@CentOS ~]# cd /usr/
[root@CentOS usr]# mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop-1.4.7
[root@CentOS ~]# cd /usr/sqoop-1.4.7/
3、配置SQOOP_HOME 环境变量;
[root@CentOS sqoop-1.4.7]# vi ~/.bashrc
SQOOP_HOME=/usr/sqoop-1.4.7
HADOOP_HOME=/usr/hadoop-2.9.2
HIVE_HOME=/usr/apache-hive-1.2.2-bin
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$SQOOP_HOM
E/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export HADOOP_HOME
export CLASSPATH
export HIVE_HOME
export SQOOP_HOME
[root@CentOS sqoop-1.4.7]# source ~/.bashrc
由于此工具是通过Hbase、Hive、HDFS 进行数据传输导入导出,所以需要保证其可以正常使用;
4、修改conf下的sqoop-env.sh.template配置文件;
[root@CentOS sqoop-1.4.7]# mv conf/sqoop-env-template.sh conf/sqoop-env.sh
[root@CentOS sqoop-1.4.7]# vi conf/sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/usr/hadoop-2.9.2
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/usr/hadoop-2.9.2
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
export HIVE_HOME=/usr/apache-hive-1.2.2-bin
#Set the path for where zookeper config dir is
export ZOOCFGDIR=/usr/zookeeper-3.4.6/conf
5、将MySQL驱动jar拷贝到Sqoop的lib目录下;
[root@CentOS ~]# cp /usr/apache-hive-1.2.2-bin/lib/mysql-connector-java-5.1.48.jar
/usr/sqoop-1.4.7/lib/
此案例演示的是将数据导入导出到MySQL中,所以需要导入MySQL的驱动jar;
6、验证Sqoop是否安装成功;
[root@CentOS ~]# sqoop version
Warning: /usr/sqoop-1.4.7/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/sqoop-1.4.7/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/02/06 21:01:10 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
[root@CentOS ~]# sqoop list-tables --connect jdbc:mysql://10.15.0.1:3306/ayf --username root --password root
更多导入导出细节参考官方文档:http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html
##sqoop-import
Import工具将单个表从RDBMS导入到HDFS。表中的每一行在HDFS中均表示为单独的记录,记录可以存储为文本文件(每行一个记录),也可以二进制表现形式存储为Avro或SequenceFiles;
- 命令形式:
$ sqoop import(generic-args)(import-args)
$ sqoop-import(generic-args)(import-args)
全表导入HDFS
参数 | 默认值 | 含义 |
---|---|---|
–connect | 连接的数据库地址 | |
–username | 连接的数据库的用户名 | |
–password | 连接的数据库的密码 | |
–table | 想要导出数据的表 | |
–target-dir | /usr/用户名/导入的表名 | 要导出到HDFS中的目录 |
–delete-target-dir | 表示如果在HDFS中有该目录,则先删除在导入数据到该目录下 | |
–num-mappers | 4 | 表示设置的MapTask个数,决定最终在HDFS中生成的文件件个数 (将Table中的数据分成几个文件分别存储) |
–fields-terminated-by | 指定字段的分割符号 |
例:
sqoop import \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table book \
--num-mappers 4 \
--fields-terminated-by '\t' \
--target-dir /mysql/test/books \
--delete-target-dir
字段导入HDFS
参数 | 含义 |
---|---|
–columns | 指定要查询的字段 |
–where | 指定过滤条件 |
例:
sqoop import \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table book \
--columns "book_name,book_price" \
--where "book_id > 2 or book_name like '%三国演义%'" \
--target-dir /mysql/test/books1 \
--delete-target-dir \
--num-mappers 4 \
--fields-terminated-by '\t'
导入查询HDFS
参数 | 含义 |
---|---|
–query | 查询SQL |
–split-by | 选择拆分列 |
注:如果要并行导入查询结果,则每个Map任务将需要执行查询的副本,其结果由Sqoop推断的边界条件进行分区。
您的查询必须包含令牌
$CONDITIONS
(相当于一个占位符),每个Sqoop进程将用唯一条件表达式替换该令牌。您还必须使用
--split-by
选择拆分列(给上面占位符赋值)。
例:
sqoop import \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--num-mappers 3 \
--fields-terminated-by '\t' \
--query 'select book_id, book_name,book_price,time from book where $CONDITIONS LIMIT 7' \
--split-by book_id \
--target-dir /mysql/test/books2 \
--delete-target-dir
全量导入HIVE
参数 | 含义 |
---|---|
–hive-import | 将数据导入到Hive |
–hive-overwrite | 如果表已经存在,将原有数据覆盖 |
–hive-table | 指定导入Hive中那张表里 |
导入前需要提前导入Hive相关jar;
1、hive-common-xxx.jar 2、/hive-exec-xxx.jar
[root@CentOS ~]# cp /usr/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar /usr/sqoop-1.4.7/lib/
[root@CentOS ~]# cp /usr/apache-hive-1.2.2-bin/lib/hive-exec-1.2.2.jar /usr/sqoop-1.4.7/lib/
sqoop import \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table book \
--num-mappers 3 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table test.book
分区导入HIVE
参数 | 含义 |
---|---|
–hive-partition-key | 指定分区表的字段 |
–hive-partition-value | 指定分区值 |
sqoop import \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table t_files \
--num-mappers 1 \
--hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-table test.files \
--hive-partition-key types \
--hive-partition-value 'png'
全量导入Hbase
参数 | 含义 |
---|---|
–hbase-table | 写入Hbase的表 |
–column-family | 导入的列簇 |
–hbase-create-table | 创建表 |
–hbase-row-key | 指定字段作为rowkey |
–hbase-bulkload | 启动Hbase 批量写入 |
sqoop import \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table book \
--num-mappers 3 \
--hbase-table test:book \
--column-family cf1 \
--hbase-create-table \
--hbase-row-key book_id \
--hbase-bulkload
启动Hbase服务,创建test数据库,book由系统自动动创建!
sqoop-export
Export工具将⼀组文件从HDFS导出回RDBMS。目标表必须已经存在于数据库中。根据用户指定的定界符,读取输入文件并将其解析为一组记录。
HDFS -> MySQL
1、准备数据
0 zhangsan true 20 2020-01-11
1 lisi false 25 2020-01-10
3 wangwu true 36 2020-01-17
4 zhaoliu false 50 1990-02-08
5 win7 true 20 1991-02-08
2、MySQL建表
create table t_user(
id int primary key auto_increment,
name VARCHAR(32),
sex boolean,
age int,
birthDay date
) CHARACTER SET=utf8;
3、sqoop指令
sqoop export \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table t_user \
--update-key id \
--update-mode allowinsert \
--export-dir /demo/src/t_user \
--input-fields-terminated-by '\t'
参数 | 含义 |
---|---|
–export-dir | 导出的数据 |
–input-fields-terminated-by | 字段分割符号 |
–update-mode | 导入模式可选值 updateonly / allowinsert , updateonly仅仅会更新已经存在的记录 |
HBASE -> RDBMS
HBASE -> HIVE
HIVE-RDBMS 等价 HDFS => RDBMS
1、准备测试数据
7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,\N,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,\N,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,\N,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,\N,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,1500,\N,20
7839,KING,PRESIDENT,\N,1981-11-17 00:00:00,5000,\N,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,\N,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,\N,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,\N,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,\N,10
在Hbase中创建一张表为 test:t_employee
在HIVE中执行以下sql;
create database if not exists test;
use test;
drop table if exists t_employee;
CREATE TABLE t_employee(
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate TIMESTAMP,
sal DECIMAL(7,2),
comm DECIMAL(7,2),
deptno INT)
row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by '>'
lines terminated by '\n'
stored as textfile;
load data local inpath '/root/t_employee' overwrite into table t_employee;
drop table if exists t_employee_hbase;
create external table t_employee_hbase(
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate TIMESTAMP,
sal DECIMAL(7,2),
comm DECIMAL(7,2),
deptno INT)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES("hbase.columns.mapping"=
":key,cf1:ename,cf1:job,cf1:mgr,cf1:hiredate,cf1:sal,cf1:comm,cf1:deptno")
TBLPROPERTIES("hbase.table.name" = "test:t_employee");
insert overwrite table t_employee_hbase select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_employee;
2、先尝试把Hbase中的数据存入HDFS中;
INSERT OVERWRITE DIRECTORY '/employee' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE select empno,ename,job,mgr,hiredate,sal,comm,deptno from t_employee_hbase;
3、将HDFS中数据导出RDBMS
sqoop export \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table t_employee \
--update-key id \
--update-mode allowinsert \
--export-dir /employee/000000_0 \
--input-fields-terminated-by ',' \
--input-null-string '\\N' \
--input-null-non-string '\\N';
from t_employee_hbase;
3、将HDFS中数据导出RDBMS
sqoop export \
--connect jdbc:mysql://10.15.0.1:3306/ayf?characterEncoding=UTF-8 \
--username root \
--password root \
--table t_employee \
--update-key id \
--update-mode allowinsert \
--export-dir /employee/000000_0 \
--input-fields-terminated-by ',' \
--input-null-string '\\N' \
--input-null-non-string '\\N';