sqoop增量导入

最新推荐文章于 2021-01-25 20:26:48 发布

逗点儿

最新推荐文章于 2021-01-25 20:26:48 发布

阅读量1.7k

点赞数 2

分类专栏： sqoop

本文链接：https://blog.csdn.net/weixin_39216383/article/details/79186642

版权

sqoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

sqoop的增量导入分为多种模式，有append和lastmodified两种模式。

需要应用的主要sqoop参数有：

–check-column：指定增量导入的依赖字段，通常为自增的主键id或者时间戳

–incremental：指定导入的模式（append或lastmodified）

–last-value：指定导入的上次最大值也就是这次开始的值

Append模式

1.建立自增主键表：

create table test(
id int(20) primary key not null AUTO_INCREMENT,
name varchar(32)
)charset=utf8;

2.插入数据：

insert into test(id,name) values(1,'xiaozhao');
insert into test(id,name) values(2,'xiaozhang');
insert into test(id,name) values(3,'xiaosun');
insert into test(id,name) values(5,'xiaoli');
insert into test(id,name) values(6,'xiaozhou');
insert into test(id,name) values(4,'xiaowu');

3.利用sqoop将表导入到HDFS的/test下：

 sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test -m 1 --target-dir /test/test

4.再次插入数据：

insert into test(id,name) values(7,'xiaozheng');
insert into test(id,name) values(8,'xiaowang');

5.展示数据：

mysql> select * from test;
+----+-----------+
| id | name      |
+----+-----------+
|  1 | xiaozhao  |
|  2 | xiaozhang |
|  3 | xiaosun   |
|  4 | xiaowu    |
|  5 | xiaoli    |
|  6 | xiaozhou  |
|  7 | xiaozheng |
|  8 | xiaowang  |
+----+-----------+
8 rows in set (0.00 sec)

6.利用append模式增量导入

sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test -m 1 --check-column id --incremental append --last-value 5 --target-dir /test/test

6.查看结果

hadoop:hadoop:/home/hadoop:>hadoop fs -ls test
18/01/28 12:12:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
Found 3 items
-rw-r--r--   1 hadoop supergroup          0 2018-01-28 11:49 test/_SUCCESS
-rw-r--r--   1 hadoop supergroup         62 2018-01-28 11:49 test/part-m-00000
-rw-r--r--   1 hadoop supergroup         34 2018-01-28 12:11 test/part-m-00001
hadoop:hadoop:/home/hadoop:>hadoop fs -cat test/part-m-00001
18/01/28 12:12:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
6,xiaozhou
7,xiaozheng
8,xiaowang
hadoop:hadoop:/home/hadoop:>hadoop fs -cat test/part-m-00000
18/01/28 12:13:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
1,xiaozhao
2,xiaozhang
3,xiaosun
4,xiaowu
5,xiaoli
6,xiaozhou

PS：大家可以发现，last-value我取5，所以sqoop从6开始导入，也是就是说这是个开区间。

Lastmodified模式

该模式区别于append的是可以指定为一个时间戳字段，按照时间顺序导入，另外这种模式可以指定增量数据在HDFS存在的方式，–append和append模式一样，都是附件，–merge-key 是合并的方式，最终增量结果为一个文件：part-r-00000。

1.建立带有时间戳的sql表

create table test2(
id int,
name varchar(32),
lasttime timestamp default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)charset=utf8;

2.插入数据：

insert into test2(id,name) values(1,'xiaozhao');
insert into test2(id,name) values(2,'xiaozhang');
insert into test2(id,name) values(3,'xiaosun');
insert into test2(id,name) values(4,'xiaowu');
insert into test2(id,name) values(5,'xiaoli');
insert into test2(id,name) values(6,'xiaozhou');

3.导入HDFS

sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2

4.继续插入数据

insert into test2(id,name) values(7,'xiaozheng')

5.sqoop增量导入（-append方式）


sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2 --check-column lasttime --incremental lastmodified --last-value "2018-01-28 12:24:40" --append

6.查看结果：


hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-m-00000

18/01/28 12:28:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,xiaozhao,2018-01-28 12:23:58.0
2,xiaozhang,2018-01-28 12:23:58.0
3,xiaosun,2018-01-28 12:24:00.0
4,xiaowu,2018-01-28 12:24:29.0
5,xiaoli,2018-01-28 12:24:39.0
6,xiaozhou,2018-01-28 12:24:40.0

hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-m-00001
18/01/28 12:35:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
6,xiaozhou,2018-01-28 12:24:40.0
7,xiaozheng,2018-01-28 12:29:28.0

在这里有一个问题，就是大家发现，我指定的时间是第一次导入执行的时间，然而这个–append对时间这个变量来说是闭区间，导致了数据的少量冗余。而且每次增量导入就产生一个文件，的确不合适。针对这两个问题，选择另一个方式：–merge-key，这种方式可以对重复数据进行合并，同时这种方式进行一次完整的MR操作，生成part-r-00000，同时也会对数据进行更新。

7.修改数据：

update test2 set name = 'MARK' where id = 1;

8.查看数据：

mysql> select * from test2;
+------+-----------+---------------------+
| id   | name      | lasttime            |
+------+-----------+---------------------+
|    1 | MARK      | 2018-01-28 12:44:37 |
|    2 | xiaozhang | 2018-01-28 12:23:58 |
|    3 | xiaosun   | 2018-01-28 12:24:00 |
|    4 | xiaowu    | 2018-01-28 12:24:29 |
|    5 | xiaoli    | 2018-01-28 12:24:39 |
|    6 | xiaozhou  | 2018-01-28 12:24:40 |
|    7 | xiaozheng | 2018-01-28 12:29:28 |
+------+-----------+---------------------+
7 rows in set (0.00 sec)

9.sqoop增量导入（–merge-key）

 sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2 --check-column lasttime --incremental lastmodified --last-value "2018-01-28 12:24:40" --merge-key id

10.查看结果：

hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-r-00000
18/01/28 12:52:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,MARK,2018-01-28 12:44:37.0
2,xiaozhang,2018-01-28 12:23:58.0
3,xiaosun,2018-01-28 12:24:00.0
4,xiaowu,2018-01-28 12:24:29.0
5,xiaoli,2018-01-28 12:24:39.0
6,xiaozhou,2018-01-28 12:24:40.0
7,xiaozheng,2018-01-28 12:29:28.0