依据Druid对标签数据做分析时,将标签数据依据数据类型拆分成两个大类:枚举型和连续数值型,枚举值指标签值都为字符串,连续数值型指标签值都为数值型。
1T的数据量入库生成Druid索引耗时大约40min-1h
Druid数据源表的结构如下
CREATE TABLE `XXXX.XXXX_druid_enum/number_input`(
`tag_name` string COMMENT '标签名',
`tag_value` string COMMENT '标签值',
`id_str` string COMMENT '用户id',
`dt_time` string COMMENT '日期yyyy-MM-dd 00:00:00')
PARTITIONED BY (
`dt` string COMMENT '日期yyyymmdd',
`id_type` string COMMENT '索引id类型',
`portrait_proj` string COMMENT '项目名称缩写')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
;
一、枚举型
1. 枚举型非长尾分布型
适用于标签值很多,并且需要全部展示出来的标签,即所有标签值数据都会写入druid提供查询。
首先,通过Spark任务从所有标签数据中过滤出枚举型非长尾分布的标签数据部分,落盘到HDFS上,并去掉_SUCCESS文件(入库Druid的时候有这个文件存在会报错)
//Spark部分
val sparkBuilder: SparkSession.Builder = SparkSession.builder()
val sparkSession: SparkSession = sparkBuilder.getOrCreate()
val sc: SparkContext = sparkSession.sparkContext
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.getConf.registerKryoClasses(Array(classOf[JSONArray], classOf[JSONObject]))
sc.setLogLevel("warn")
//从原始标签数据中过滤出非长尾分布枚举型部分
val rowData = sc.textFile(inPutPath).flatMap(x => {
val buffer = new ListBuffer[(String, String, String)]
val arr = x.split(splitType, -1)
val DsId = arr(0)
val DsJson = JSONObject.fromObject(arr(1))
val tags = DsJson.keys()
while (tags.hasNext) {
val tagKey = tags.next()
//过滤出符合条件的标签tagKey,并对每一个tagKey逐一处理
if (mysqlResultBc.value.contains(tagKey)) {
val tagValue = DsJson.get(tagKey)
val tagValueArr = JSONArray.fromObject(tagValue)
if (tagValueArr.size() > 0) {
for (i <- 0 until tagValueArr.size()) {
val tagValueSingle = tagValueArr.get(i)
buffer.+=((tagKey.toString, tagValueSingle.toString, DsId.toString))
}
}
}
}
buffer.toIterator
})
//然后将过滤、扁平化后的数据落盘HDFS
val druidHdfsData = rowData.map(x => {
var finalData = ""
val tag_name = x._1
val tag_value = x._2
val id_str = x._3
finalData = tag_name.toString + "\u0001" + tag_value.toString + "\u0001" + id_str.toString + "\u0001" + sday.toString + " 00:00:00"
finalData
}).filter(x => !"".equals(x.trim))
druidHdfsData.saveAsTextFile(outPutPath)
2. 枚举型长尾分布型
适用于标签值很多,并且只展示uv倒序TopN的标签,其余标签统一按“”其他“”处理,即只有uv倒序TopN的标签值数据都会写入druid提供查询,此外别的标签值统一归到“其他”。
对枚举型长尾分布型标签的处理,背景原因是在基于Druid对用户标签做多维分析的时候,遇到了“枚举值或连续数值”过多的情况,这样会导致:
(1)多维分析难以web直方图展示,柱子太多;
(2)维度值过多,druid的计算和存储压力大;
(3)标签值筛选和选择时加载量大、加载慢;
(4)标签值分布太混乱,常用的标签值展示位置不一定靠前,web体验差,所以可以通过uv降序排TopN,只展示Top20列,其余列的值可以查询获取。
还是首先用spark作业清洗出这部分标签数据,
然后用hiveSql将这部分原始标签数据清洗成TopN及“其他”的形式。
SET mapred.job.name=dataToES_druid_enum_top_uv_rank_XXXX;
SET mapreduce.map.java.opts=-Xmx3072m -XX:+UseG1GC;
SET mapreduce.reduce.java.opts=-Xmx6144m -XX:+UseG1GC;
SET mapred.map.tasks=500;
SET mapred.reduce.tasks=500;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=100000000;
SET hive.mapjoin.smalltable.filesize=200000000;
SET hive.exec.parallel=true;
INSERT overwrite TABLE XXXX.XXXX_druid_enum_top_uv_rank_step2 partition (dt='@sday_key', id_type='@id_type', portrait_proj='@project')
select m.tag_name
, nvl(n.tag_value, '其他') AS tag_value
, m.id_str
, n.uv
, nvl(n.rank, @topn_plus_param) AS rank
, m.dt_time
from (
select tag_name, tag_value, id_str, dt_time from
XXXX.XXXX_druid_enum_top_step1
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project'
) m
left join (
select c.tag_name, c.tag_value, c.uv, c.rank from (
select b.tag_name, b.tag_value, b.uv
, row_number() over(partition by b.tag_name order by b.uv desc) AS rank
from (
select a.tag_name, a.tag_value, count(a.id_str) AS uv
from (
select tag_name, tag_value, id_str
from XXXX.XXXX_druid_enum_top_step1
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project'
group by tag_name, tag_value, id_str
) a
group by a.tag_name, a.tag_value
) b
) c
where c.rank<=@topn_param
) n
on m.tag_name = n.tag_name and m.tag_value = n.tag_value
;
然后将枚举型非长尾分布的标签数据,和枚举型长尾分布TopN的标签数据,两表union,做成枚举型标签的合并数据,将union的数据压缩格式转为gz压缩,然后统一写入Druid。
SET mapred.job.name=dataToES_druid_enum_union_2gz_@id_type_@project_@sday_key;
SET mapreduce.map.java.opts=-Xmx3072m -XX:+UseG1GC;
SET mapreduce.reduce.java.opts=-Xmx6144m -XX:+UseG1GC;
SET mapred.map.tasks=500;
SET mapred.reduce.tasks=500;
SET hive.exec.compress.intermediate=TRUE;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.parallel=true;
INSERT overwrite TABLE XXXX.XXXX_druid_enum_input partition (dt='@sday_key', id_type='@id_type', portrait_proj='@project')
select tag_name, tag_value, id_str, dt_time from (
select tag_name, tag_value, id_str, dt_time from XXXX.XXXX_druid_enum_small_large_step
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project'
UNION ALL
select tag_name, tag_value, id_str, dt_time from XXXX.XXXX_druid_enum_top_uv_rank_step2
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project'
) m
;
3. 入库Druid的json
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "granularity",
"dataGranularity" : "DAY",
"inputPath" : "/XXXX/hdfs路径/XXXXX",
"filePattern" : ".*",
"pathFormat" : "'dt='yyyyMMdd'/id_type=${id_type}/portrait_proj=${project}'"
}
},
"dataSchema" : {
"dataSource" : "da_wanxiang_tag_analysis_${id_type}_${project}_enum_metric",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "DAY",
"intervals" : ["${sday}T00:00:00.000+08:00/${eday}T00:00:00.000+08:00"]
},
"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "tsv",
"delimiter" : "\u0001",
"listDelimiter" : "\u0002",
"columns" : [
"tag_name"
, "tag_value"
, "id_str"
, "dt_time"
],
"dimensionsSpec" : {
"dimensions" : [
"tag_name",
"tag_value"
]
},
"timestampSpec" : {
"format" : "yyyy-MM-dd HH:mm:ss",
"column" : "dt_time"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "uuid_unique",
"type" : "hyperUnique",
"fieldName" : "id_str"
}
]
},
"tuningConfig" : {
"type" : "hadoop",
"maxRowsInMemory" : 150000,
"ignoreInvalidRows":true,
"useCombiner" : true,
"buildV9Directly" : true,
"numBackgroundPersistThreads" : 1 ,
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 1750000,
"maxPartitionSize" : 2000000
},
"jobProperties" : {
"mapreduce.job.classloader": "true"
}
}
}
}
二、连续数值型
首先还是通过spark作业过滤出连续数值型标签数据,并扁平化
然后将连续数值型标签数据入库Druid
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "granularity",
"dataGranularity" : "DAY",
"inputPath" : "/XXXX/hdfs路径/XXXXX",
"filePattern" : ".*",
"pathFormat" : "'dt='yyyyMMdd'/id_type=${id_type}/portrait_proj=${project}'"
}
},
"dataSchema" : {
"dataSource" : "da_wanxiang_tag_analysis_${id_type}_${project}_number_metric",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "DAY",
"intervals" : ["${sday}T00:00:00.000+08:00/${eday}T00:00:00.000+08:00"]
},
"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "tsv",
"delimiter" : "\u0001",
"listDelimiter" : "\u0002",
"columns" : [
"tag_name"
, "tag_value"
, "id_str"
, "dt_time"
],
"dimensionsSpec" : {
"dimensions" : [
"tag_name"
]
},
"timestampSpec" : {
"format" : "yyyy-MM-dd HH:mm:ss",
"column" : "dt_time"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "uuid_unique",
"type" : "hyperUnique",
"fieldName" : "id_str"
},
{
"name": "tag_value_sum",
"type": "longSum",
"fieldName": "tag_value"
},
{
"type" : "approxHistogramFold",
"name" : "hist_value",
"fieldName" : "tag_value",
"resolution" : 5000,
"numBuckets" : 5000
}
]
},
"tuningConfig" : {
"type" : "hadoop",
"maxRowsInMemory" : 150000,
"ignoreInvalidRows":true,
"useCombiner" : true,
"buildV9Directly" : true,
"numBackgroundPersistThreads" : 1 ,
"partitionsSpec" : {
"type" : "hashed",
"targetPartitionSize" : 1750000,
"maxPartitionSize" : 2000000
},
"jobProperties" : {
"mapreduce.job.classloader": "true"
}
}
}
}
三、
Druid提交任务的命令
curl -X 'POST' -H 'Content-Type:application/json' -d @XXXX.json http://XXXX:XXXX/druid/indexer/v1/task