Sqoop 增量同步（自动更新 last value）

ztcheck

已于 2022-03-25 11:19:16 修改

阅读量3.3k

点赞数 2

分类专栏： sqoop 文章标签： spark hadoop hdfs

于 2021-07-29 12:22:30 首次发布

本文链接：https://blog.csdn.net/lifewujianqiang/article/details/119207164

版权

sqoop 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

我们在使用Sqoop进行增量同步的时候，是需要指定 last value 的。但一般我们都是自动化进行数据同步的，这就需要有一个地方，能够自动记录和填充上次增量同步的 last value。

抛开手动维护这个 last value，繁琐，而且还很容易失败。后面查了下Sqoop 的官网，发现 Sqoop job 提供了类似的功能。

这里记录下，供后续查看。

官方文档上面有详细的说明，这里说明下几个点：

1. 目的

The job tool allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.
重点在于重复执行
If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.
增量同步的推荐方式。

2. 用法

提交命令：

$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

参数：
Argument Description

--create <job-id>	Define a new saved job with the specified job-id (name). 
A second Sqoop command-line, separated by a -- should be specified; this defines 
the saved job.
--delete <job-id>	Delete a saved job.
--exec <job-id>	Given a job defined with --create, run the saved job.
--show <job-id>	Show the parameters for a saved job.
--list	List all saved jobs

执行时，覆盖参数：

The exec action allows you to override arguments of the saved job by supplying
them after a --. For example, if the database were changed to require a username, 
we could specify the username and password with:

$ sqoop job --exec myjob -- --username someuser -P
Enter password:

3. 关于metadata的参数

sqoop job 执行时，默认情况下是需要指定 --meta-connect ，并且要输入密码的，那么如果我们不想在每次执行 job 的时候，都去输入密码、meta的连接地址，有可以怎么做呢？
sqooop 提供两个参数：

sqoop.metastore.client.enable.autoconnect=true
sqoop.metastore.client.record.password=true

这样就可以不用每次执行都输入密码了。
这里输入的密码为 connect 的数据库的连接密码

4. 增量同步的说明

Incremental imports are performed by comparing the values in a check column against a
reference value for the most recent import. For example, if the --incremental append 
argument was specified, along with --check-column id and --last-value 100, all rows 
with id > 100 will be imported. If an incremental import is run from the command line,
the value which should be specified as --last-value in a subsequent incremental 
import will be printed to the screen for your reference. If an incremental import 
is run from a saved job, this value will be retained in the saved job. Subsequent 
runs of sqoop job --exec someIncrementalJob will continue to import only newer
rows than those previously imported.

大意就是，使用 sqoop job 进行增量同步时，sqoop 的 meta 会自动记录上次同步的 last value ，这样，后续再次执行执行同步任务时，不用指定 last value 也可以拉取到最新的数据了。非常好用！！

5. 实例

sqoop job \
--create myjob1 \

-- import \
--connect jdbc:mysql://xxx:3306/xxx\
--username root \
--password root \
--table category \
--target-dir /user/survey/sqoop1 \
--incremental append \
--check-column updated_at  \
--last-value "2019-06-12 12:39:43" \
--m 1 \
--fields-terminated-by "\001"

或

sqoop job \
--create t_goods_detail_info_job \
-- import \
--connect jdbc:mysql://***:30003/ali_api \
--username ali_api \
--password *** \
--table t_goods_detail_info \
--hive-import \
--hive-database ods \
--hive-table ods_t_goods_detail_info \
--hive-drop-import-delims \
--fields-terminated-by '\001' \
--null-string '\\N' \
--null-non-string '\\N' \
--map-column-hive datetime=String,timestamp=String,json=String \
--incremental append   \
--check-column id  \
--last-value 1 \
--m 1

这里，有个细节说明下：
这时候会在hdfs的/user/survey/sqoop1文件夹下生成一个新的文件,文件中会有两条记录,一条是新增的记录,另一条是修改后的数据,在全量导入生成的文件中有修改前的数据
注意:这个时候的–append和append模式不同,此处的append代表lastmodified模式下的append类型,可选值有append | merge-key
如果将上面的–append换成–merge-key id ,hdfs上不会生成新文件,而是生成一个经过reduce合并的总文件,文件里是修改后的老数据和新增数据,这里是根据 id 进行合并
更新下：

Sqoop 的 lastmodified 和–append 不能和Hive table 一起用，会报错。但是数据导入时使用–target则没有问题；
Sqoop 中的merge-key貌似对Hive 不生效，不能合并更新的数据

ztcheck

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Sqoop 增量同步（自动更新 last value）

我们在使用Sqoop进行增量同步的时候，是需要指定 last value 的。但一般我们都是自动化进行数据同步的，这就需要有一个地方，能够自动记录和填充上次增量同步的 last value。抛开手动维护这个 last value，繁琐，而且还很容易失败。后面查了下Sqoop 的官网，发现 Sqoop job 提供了类似的功能。这里记录下，供后续查看。官方文档上面有详细的说明，这里说明下几个点：1. 目的The job tool allows you to create and work w
复制链接

扫一扫