项目场景:
最近在将hive导入clickhouse,全量导入没有问题,增量导入时出现问题,这里记录下来
hive源表:
DROP TABLE IF EXISTS dwd_test;
CREATE EXTERNAL TABLE dwd_test
(
id string COMMENT 'ID',
name string COMMENT '名字',
birthday string COMMENT '生日'
) COMMENT 'test'
PARTITIONED BY (`dt` string)
STORED AS PARQUET
LOCATION '/warehouse/cut/dwd/dwd_test/'
TBLPROPERTIES ("parquet.compression" = "lzo");
临时表
DROP TABLE IF EXISTS dwd_test_tmp;
CREATE TABLE dwd_test_tmp
(
id string COMMENT 'ID',
name string COMMENT '名字',
birthday string COMMENT '生日'
) COMMENT 'test'
STORED AS PARQUET
LOCATION '/warehouse/cut/dwd/dwd_test_tmp/'
TBLPROPERTIES ("parquet.compression" = "lzo");
插入临时表
insert into table dwd_test_tmp
values ('1', 'abc', '2021-10-26 00:00:00'),
('2', 'def', '2021-10-27 00:00:00');
临时表数据动态分区插入dwd_test
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_test partition (dt)
select id, name, birthday, date_format(birthday, 'yyyy-MM-dd') dt
from cut.dwd_test_tmp;
clickhouse建表:
cut数据库下建表
drop table if exists dwd_test;
create table dwd_test
(
id String,
name String,
birthday DateTime
)
engine =MergeTree
partition by toYYYYMMDD(birthday)
ORDER BY (id);
抽取脚本:
vim mytest1.sh
#!/bin/bash
# 环境变量
unset SPARK_HOME
export SPARK_HOME=$SPARK2_HOME
SEATUNNEL_HOME=/u/module/seatunnel-1.5.1
# 接收两个参数,第一个为要抽取的表,第二个为抽取时间
# 若输入的第一个值为first,不输入第二参数则直接退出脚本
if [[ $1 = first ]]; then
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
# 若输入的第一个值为all,不输入第二参数则取前一天
elif [[ $1 = all ]]; then
# 判断非空,如果不传时间默认取前一天数据,传时间就取设定,主要是用于手动传参
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d '-1 day' +%F`
fi
else
if [ -n "$2" ] ;then
do_date=$2
else
echo "请传入日期参数"
exit
fi
fi
echo "日期:$do_date"
import_conf(){
# 打印数据传输脚本并赋值
cat>$SEATUNNEL_HOME/jobs/hive2ck_test.conf<<!EOF
spark {
spark.sql.catalogImplementation = "hive"
spark.app.name = "hive2clickhouse"
spark.executor.instances = 4
spark.executor.cores = 4
spark.executor.memory = "4g"
}
input {
hive {
# pre_sql = "select id,name,'${do_date}' as birthday from default.student"
# pre_sql = "select id,name,'${do_date}' as birthday from default.student"
pre_sql = "$1"
table_name = "$2"
}
}
filter {}
output {
clickhouse {
host = "$3"
database = "$4"
table = "$5"
# fields = ["id","name","birthday"]
fields = $6
username = "default"
password = ""
}
}
!EOF
$SEATUNNEL_HOME/bin/start-waterdrop.sh --config $SEATUNNEL_HOME/jobs/hive2ck_test.conf -e client -m 'local[4]'
}
import_test(){
import_conf "select id,name, birthday from cut.dwd_test where dt='${do_date}'" "dwd_test" "hadoop101:8123" "cut" "dwd_test" "[\"id\",\"name\",\"birthday\"]"
}
case $1 in
"all")
import_test
;;
esac
执行:
chmod u+x mytest1.sh
./mytest1.sh all 2021-10-27
报错信息:
有点多,找不到原因,但是只要不加 dt 这个过滤条件,加其他条件执行都没问题。