首先看下hive建表,这里的字段类型,直接关系导ck的配置:
CREATE EXTERNAL TABLE ol_ck_int
(
user_id bigint comment '会员id'
,three_xx_name string comment '三级xx名称'
,xx_name string comment 'xx项'
,xx_value int comment 'xx项取值'
)comment '用户标签表'
PARTITIONED BY(dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '...';
ck建表:
DROP TABLE IF EXISTS default.ol_ck_int;
CREATE TABLE default.ol_ck_int
(
user_id UInt64,
three_xx_name String,
xx_name String,
xx_value Int16
)
ENGINE = MergeTree
PARTITION BY three_class_name
ORDER BY (three_xx_name,xx_name, xx_value)
SETTINGS index_granularity = 8192;
导入ck的配置信息 .conf:
cat>label_int.conf<<EOF
spark {
spark.app.name = "WaterdropTest001"
#executor的数量(数据量大可以适当增大)
spark.executor.instances = 10
#每个excutor核数(并行度,数据量大可以适当增大到服务器核数一半以下,尽量不要影响clickhouse)
spark.executor.cores = 1
#每个excutor内存(不能小于512m)
spark.executor.memory = "3g"
}
input {
hdfs {
#hive表对应的hdfs路径,这里hive采用了分区表
path = "hdfs://hadoopcluster/ol/ol_ck_int/dt=${date}"
result_table_name = "ol_ck_int"
#这里hive建表指定TEXTFILE存储类型,因为测试这里配置orc不支持
format = "text"
}
}
filter {
split {
#根据分隔符切割后给每个列的名字
fields = ["user_id", "three_xx_name","xx_name","xx_value"]
#这里指的是hive的字段分隔符,不然无法切割
delimiter = "\\\001"
}
convert {
#因为刚切割后所有字段类型为string,如果不转化就会报错
#可以转化的类型string、integer、long、float、double和boolean
source_field = "user_id"
new_type = "long"
}
convert {
source_field = "xx_value"
new_type = "long"
}
}
output {
clickhouse {
host = "100.18.31.117:8123"
database = "default"
table = "ol_ck_int"
fields = ["user_id","three_xx_name", "xx_name", "xx_value"]
username = "xx"
password = "xxx"
retry = 2
}
}
EOF
执行报错信息:
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
分析原因可能是 convert 配置的类型有问题。
改配置:
convert {
#因为刚切割后所有字段类型为string,如果不转化就会报错
#可以转化的类型string、integer、long、float、double和boolean
source_field = "user_id"
new_type = "integer"
}
convert {
source_field = "xx_value"
new_type = "integer"
}
执行报错信息:
Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
再次改配置:
convert {
#因为刚切割后所有字段类型为string,如果不转化就会报错
#可以转化的类型string、integer、long、float、double和boolean
source_field = "user_id"
new_type = "long"
}
convert {
source_field = "xx_value"
new_type = "integer"
}
执行成功。
分析可得:
1.hive中字段经Waterdrop 刚切割后所有字段类型为string,如果不转化就会报错,即hive字段原来是string的不需要动,如果不是,比如bigint 、int ,就需要配置convert ,其中一个不转就会报错:java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer;
2.convert 配置的new_type 要看ck对应字段的类型,UInt64 对应 long,Int16 对应 integer;
关于这一点也是这次踩坑踩出来的,不知道哪里有相关文档供参考,欢迎大家多多讨论,提供宝贵意见,谢谢!