Sqoop三种同步策略

最新推荐文章于 2023-06-09 10:12:42 发布

ButterFly0612

最新推荐文章于 2023-06-09 10:12:42 发布

阅读量1.4k

点赞数

分类专栏：采集工具文章标签： sqoop 数据库 bash

本文链接：https://blog.csdn.net/qq_64495672/article/details/127077102

版权

采集工具专栏收录该内容

1 篇文章 0 订阅

订阅专栏

全量覆盖（全量表：一般用于维度表）
- 不需要分区，先删后写，直接覆盖

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://hadoop01:3306/数据库名?useUnicode=true&characterEncoding=UTF8&autoReconnect=true' \
--username 用户名 \
--password-file file:///export/data/sqoop.passwd \
--query "select * from 表名 where 1=1 and\$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_district \
-m 1

注意：特殊语法使用-D：用于指定修改Sqoop程序的属性org.apache.sqoop.splitter.allow_text_splitter=true：如果--split-by指定的是文本类型，需要开启该参数 where 1= 1用于在程序开发时，方便加条件拼接SQL条件使用，这里可以不加，也可以暂时不管由于Hive中这张表时orc类型，所以必须使用hcatalog方式同步由于使用hcatalog，只能追加，所以每次必须先清空表以后，再进行同步清空表，后面执行脚本采集清空表， truncate table 表名; 后面执行脚本采集

仅同步新增（增量表）
- 增量的同步数据
- 永远只有新增，不会发生更新
- 场景：登录记录表，访问日志表等

DROP TABLE if exists yp_ods.t_user_login;
CREATE TABLE yp_ods.t_user_login(
id string,
login_user string,
login_type string COMMENT '登录类型（登陆时使用）',
client_id string COMMENT '推送标示id(登录、第三方登录、注册、支付回调、给用户推送消息时使用)',
login_time string,
login_ip string,
logout_time string
)COMMENT '用户登录记录表'
partitioned by (dt string)
row format delimited 
fields terminated by '\t'stored as 
orctblproperties ('orc.compress' = 'ZLIB');

# Sqoop固定同步命令
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://hadoop01:3306/数据库?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password-file file:///export/data/sqoop.passwd \
--query "select *, '${yesterday}' as dt fromt_user_login where 1=1 and (login_time between'${yesterday} 00:00:00' and '${yesterday}23:59:59') and \$CONDITIONS" \
# substr(login_time,0,10)='2021-12-19' 这个也是去昨天日期的
--hcatalog-database yp_ods \
--hcatalog-table t_user_login \
-m 1

新增及更新同步
- 每天新增一个日期分区，同步并存储当天的新增和更新数据
- 例子：用户表，订单表，商品表

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://hadoop01:3306/yipin?useUnicode=true&characterEncoding=UTF8&autoReconnect=true' \
--username root \
--password-file file:///export/data/sqoop.passwd \
--query 'select * from table where 1=1 substr(create_time,0,10) = 昨天的日期 or substr(update_time,0,10) = 昨天的日期' and $CONDITIONS" \
--target-dir /database/table/daystr = 昨天的日期
--hcatalog-database yp_ods \ 
--hcatalog-table t_store \
-m 1

取昨天日期：yesterday=`date -d '-1day' +%Y-%m-%d`

sh -x 脚本路径可以看脚本执行的详细日志

如果要导入很多张表：

#!/bin/bash
# 定义日期
if [ $# -ne 0 ]
then
    pdt=$1
else
    pdt=`date -d '-1 day' +%Y%m%d`
fi

# 表比较多: 不同的采集方式的表名放入不同的文件中

# 全量采集表名: ods_full_tbnames.txt
while read full_tbname
do
    /usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
    --connect 'jdbc:mysql://xxx.xxx.xx.xxx:3306/数据库名?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
    --username root \
    --password-file file:///export/data/sqoop.passwd \
    --query 'select * from ${full_tbname} where 1=1 and $CONDITIONS" \
    --hcatalog-database 数据库名 \ 
    --hcatalog-table ${full_tbname} \
    -m 1
done < ods_full_tbname.txt

# 增量采集表：ods_incr_new_tbnames.txt

while read tbname
do
    /usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
    --connect 'jdbc:mysql://xxx.xxx.xx.xxx:3306/数据库名?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
    --username root \
    --password-file file:///export/data/sqoop.passwd \
    --query 'select * from ${tbname} where 1=1 and substr(create_time,0,10)='${pdt}' and $CONDITIONS" \
    --hcatalog-database 数据库名 \ 
    --hcatalog-table ${tbname} \
    -m 1
done < ods_incr_new_tbnames.txt