sqoop增量数据采集,并与全量数据的合并

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/kx306_csdn/article/details/90213718

一、在MySQL数据库中创建测试表 game_player

CREATE TABLE `game_player` (
  `player_id` int(10) NOT NULL AUTO_INCREMENT,
  `player_name` varchar(64) DEFAULT NULL,
  `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `update_time` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`player_id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1;

 二、插入测试数据(T-2) T日:2019-05-16

insert into game_player(player_name,create_time) values('hanxin','2019-05-14 10:10:10');
insert into game_player(player_name,create_time) values('dianwei','2019-05-14 10:10:20');

mysql> select * from game_player;
+-----------+-------------+---------------------+-------------+
| player_id | player_name | create_time         | update_time |
+-----------+-------------+---------------------+-------------+
|         1 | hanxin      | 2019-05-14 10:10:10 | NULL        |
|         2 | dianwei     | 2019-05-14 10:10:20 | NULL        |
+-----------+-------------+---------------------+-------------+

三、首次导入数据到hive表all_game_player中,全量导入 

sqoop import \
--connect jdbc:mysql://192.168.0.1:3306/mydb \
--username root \
--password ********* \
--table game_player \
--target-dir /user/hive/external/king_player \
--fields-terminated-by '\t' \
--hive-import \
--hive-database default \
--hive-table all_game_player \
--hive-partition-key day \
--hive-partition-value "2019-05-14"
hive> select * from all_game_player;
OK
1	hanxin	2019-05-14 10:10:10.0	null	2019-05-14
2	dianwei	2019-05-14 10:10:20.0	null	2019-05-14

四、在MySQL中插入与更新测试数据(插入T-1,T日数据,更新T-2日数据)

insert into game_player(player_name,create_time) values('houyi','2019-05-15 10:10:10');
insert into game_player(player_name,create_time) values('luban','2019-05-15 10:10:11');
insert into game_player(player_name,create_time) values('baiqi','2019-05-15 10:10:12');
insert into game_player(player_name,create_time) values('yase','2019-05-15 10:10:13');
insert into game_player(player_name,create_time) values('direnjie','2019-05-16 10:10:12');
insert into game_player(player_name,create_time) values('yuji','2019-05-16 10:10:10');
update game_player set player_name='libai' where player_id=1;

mysql> select * from game_player;
+-----------+-------------+---------------------+-------------+
| player_id | player_name | create_time         | update_time |
+-----------+-------------+---------------------+-------------+
|         1 | hanxin      | 2019-05-14 10:10:10 | NULL        |
|         2 | dianwei     | 2019-05-14 10:10:20 | NULL        |
|         3 | houyi       | 2019-05-15 10:10:10 | NULL        |
|         4 | luban       | 2019-05-15 10:10:11 | NULL        |
|         5 | baiqi       | 2019-05-15 10:10:12 | NULL        |
|         6 | yase        | 2019-05-15 10:10:13 | NULL        |
|         7 | direnjie    | 2019-05-16 10:10:12 | NULL        |
|         8 | yuji        | 2019-05-16 10:10:10 | NULL        |
+-----------+-------------+---------------------+-------------+

五、增量导入到hive表inc_game_player中( T-1新增的数据+T-1发生变化的数据)

sqoop import \
--connect jdbc:mysql://192.168.0.1:3306/mydb \
--username root \
--password ********* \
--table game_player \
--where "(create_time>='2019-05-15 00:00:00' and create_time<'2019-05-16 00:00:00') or (update_time>='2019-05-15 00:00:00')" \
--target-dir /user/hive/external/king_player \
--fields-terminated-by '\t' \
--hive-import \
--hive-database default \
--hive-table inc_game_player
hive> select * from inc_game_player;
OK
1	libai	2019-05-14 10:10:10.0	2019-05-15 03:34:47.0
3	houyi	2019-05-15 10:10:10.0	null
4	luban	2019-05-15 10:10:11.0	null
5	baiqi	2019-05-15 10:10:12.0	null
6	yase	2019-05-15 10:10:13.0	null

六、增量数据与全量数据合并,并插入到all_game_player新的分区

insert overwrite table all_game_player partition(day='2019-05-15')
select coalesce(a.player_id,i.player_id) as player_id,
coalesce(i.player_name,a.player_name) as player_name,
coalesce(i.create_time,a.create_time) as create_time,
coalesce(i.update_time,a.update_time) as update_time
from all_game_player a full outer join inc_game_player i on a.player_id=i.player_id;

hive> select * from all_game_player where day='2019-05-15';
OK
1	libai	2019-05-14 10:10:10.0	2019-05-15 03:34:47.0	2019-05-15
2	dianwei	2019-05-14 10:10:20.0	null	2019-05-15
3	houyi	2019-05-15 10:10:10.0	null	2019-05-15
4	luban	2019-05-15 10:10:11.0	null	2019-05-15
5	baiqi	2019-05-15 10:10:12.0	null	2019-05-15
6	yase	2019-05-15 10:10:13.0	null	2019-05-15

增量数据与全量数据合并请参考:https://blog.csdn.net/kx306_csdn/article/details/89508323

 

展开阅读全文

没有更多推荐了,返回首页