数据回流upsert模式

最新推荐文章于 2024-05-10 09:35:18 发布

走过冬季

最新推荐文章于 2024-05-10 09:35:18 发布

阅读量3.4k

点赞数

分类专栏：离线数仓

本文链接：https://blog.csdn.net/winterPassing/article/details/102506679

版权

离线数仓专栏收录该内容

3 篇文章

订阅专栏

背景

现有的数据回流方案中，为避免数据重复，有一些前置操作（delete/truncate）。在回流之前执行这些操作会对查询造成一些瞬时影响。

针对这种场景我们需要做一些优化（update or insert）。

适用场景

准实时项目回流（数据回流频繁）
数据累加型计算
无需删除之前计算结果

要求

数据加工表中包含唯一业务字段。

实现方式

原回流方式

#删除目标表当天数据，避免重复回流

delete_sql="truncate table $target_table;"

 echo "delete_sql=$delete_sql"

 sqoop eval --connect jdbc:mysql://$HOSTNAME:$PORT/$DBNAME --username $username --password $password -e "$delete_sql" &&

 #导出到MYSQL

 sqoop export --connect jdbc:mysql://$HOSTNAME:$PORT/$DBNAME?tinyInt1isBit=false --username $username --password $password --table $target_table  --fields-terminated-by '\001' --input-null-string '\\N' --input-null-non-string '\\N' --export-dir $hdfs_temp_dir &&

改造后

#导出到MYSQL

 sqoop export -Dsqoop.export.statements.per.transaction=200 \

 --connect jdbc:mysql://$HOSTNAME:$PORT/$DBNAME?tinyInt1isBit=false \

--username $username --password $password --table $target_table -m 1 --fields-terminated-by '\001' --input-null-string '\\N' \

--input-null-non-string '\\N' --export-dir $hdfs_temp_dir --update-key server_id --update-mode allowinsert &&

参数说明：

--update-key：设置更新唯一字段，可以设置多个，用逗号隔开

--update-mode：allowinsert允许插入

注：key需要添加唯一索引，如果是多个字段，添加组合唯一索引。

ps: 解决批量更新死锁问题，-m 1，设置回流并发度为1

异常信息：

Caused by: java.sql.BatchUpdateException: Deadlock found when trying to get lock; try restarting transaction

同时可设置以下任一参数，增加回流QPS。

-指定每个insert语句中插入的记录条数，默认值100
-Dsqoop.export.records.per.statement=100

--指定每个事务中插入的记录条数，默认值100
-Dsqoop.export.statements.per.transaction=100

# 2021.06.29 更新

不指定字段时，发现ID等字段也会更新，通过以下参数屏蔽不需要更新的字段

--columns: 指定mysql需要更新的字段，屏蔽如自增id，edit_time等不需要更新的字段