Druid支持枚举型和连续数值型标签分析——直方图/分位数

依据Druid对标签数据做分析时,将标签数据依据数据类型拆分成两个大类:枚举型和连续数值型,枚举值指标签值都为字符串,连续数值型指标签值都为数值型。

1T的数据量入库生成Druid索引耗时大约40min-1h

Druid数据源表的结构如下

CREATE TABLE `XXXX.XXXX_druid_enum/number_input`(
  `tag_name` string COMMENT '标签名', 
  `tag_value` string COMMENT '标签值', 
  `id_str` string COMMENT '用户id', 
  `dt_time` string COMMENT '日期yyyy-MM-dd 00:00:00')
PARTITIONED BY ( 
  `dt` string COMMENT '日期yyyymmdd', 
  `id_type` string COMMENT '索引id类型', 
  `portrait_proj` string COMMENT '项目名称缩写')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
  COLLECTION ITEMS TERMINATED BY '\u0002' 
  MAP KEYS TERMINATED BY '\u0003'
;

一、枚举型

1. 枚举型非长尾分布型

适用于标签值很多,并且需要全部展示出来的标签,即所有标签值数据都会写入druid提供查询。

首先,通过Spark任务从所有标签数据中过滤出枚举型非长尾分布的标签数据部分,落盘到HDFS上,并去掉_SUCCESS文件(入库Druid的时候有这个文件存在会报错)

//Spark部分
val sparkBuilder: SparkSession.Builder = SparkSession.builder()
val sparkSession: SparkSession = sparkBuilder.getOrCreate()
val sc: SparkContext = sparkSession.sparkContext

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.getConf.registerKryoClasses(Array(classOf[JSONArray], classOf[JSONObject]))
sc.setLogLevel("warn")

//从原始标签数据中过滤出非长尾分布枚举型部分
val rowData = sc.textFile(inPutPath).flatMap(x => {
val buffer = new ListBuffer[(String, String, String)]

val arr = x.split(splitType, -1)
val DsId = arr(0)
val DsJson = JSONObject.fromObject(arr(1))
val tags = DsJson.keys()

while (tags.hasNext) {
  val tagKey = tags.next()
  //过滤出符合条件的标签tagKey,并对每一个tagKey逐一处理
  if (mysqlResultBc.value.contains(tagKey)) {
    val tagValue = DsJson.get(tagKey)
    val tagValueArr = JSONArray.fromObject(tagValue)
    if (tagValueArr.size() > 0) {
      for (i <- 0 until tagValueArr.size()) {
        val tagValueSingle = tagValueArr.get(i)
        buffer.+=((tagKey.toString, tagValueSingle.toString, DsId.toString))
      }
    }
  }
}
buffer.toIterator
})

//然后将过滤、扁平化后的数据落盘HDFS
val druidHdfsData = rowData.map(x => {
var finalData = ""
val tag_name = x._1
val tag_value = x._2
val id_str = x._3
finalData = tag_name.toString + "\u0001" + tag_value.toString + "\u0001" + id_str.toString + "\u0001" + sday.toString + " 00:00:00"
finalData
}).filter(x => !"".equals(x.trim))

druidHdfsData.saveAsTextFile(outPutPath)

2. 枚举型长尾分布型

适用于标签值很多,并且只展示uv倒序TopN的标签,其余标签统一按“”其他“”处理,即只有uv倒序TopN的标签值数据都会写入druid提供查询,此外别的标签值统一归到“其他”。

对枚举型长尾分布型标签的处理,背景原因是在基于Druid对用户标签做多维分析的时候,遇到了“枚举值或连续数值”过多的情况,这样会导致:

(1)多维分析难以web直方图展示,柱子太多;

(2)维度值过多,druid的计算和存储压力大;

(3)标签值筛选和选择时加载量大、加载慢;

(4)标签值分布太混乱,常用的标签值展示位置不一定靠前,web体验差,所以可以通过uv降序排TopN,只展示Top20列,其余列的值可以查询获取。

还是首先用spark作业清洗出这部分标签数据,

然后用hiveSql将这部分原始标签数据清洗成TopN及“其他”的形式。

SET mapred.job.name=dataToES_druid_enum_top_uv_rank_XXXX;
SET mapreduce.map.java.opts=-Xmx3072m -XX:+UseG1GC;
SET mapreduce.reduce.java.opts=-Xmx6144m -XX:+UseG1GC;
SET mapred.map.tasks=500;
SET mapred.reduce.tasks=500;
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=100000000;
SET hive.mapjoin.smalltable.filesize=200000000;
SET hive.exec.parallel=true;

INSERT overwrite TABLE XXXX.XXXX_druid_enum_top_uv_rank_step2 partition (dt='@sday_key', id_type='@id_type', portrait_proj='@project')
select m.tag_name
, nvl(n.tag_value, '其他') AS tag_value
, m.id_str
, n.uv 
, nvl(n.rank, @topn_plus_param) AS rank
, m.dt_time
from (
select tag_name, tag_value, id_str, dt_time from 
XXXX.XXXX_druid_enum_top_step1
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project' 
) m
left join (
select c.tag_name, c.tag_value, c.uv, c.rank from (
select b.tag_name, b.tag_value, b.uv
, row_number() over(partition by b.tag_name order by b.uv desc) AS rank
from (
select a.tag_name, a.tag_value, count(a.id_str) AS uv
from (
select tag_name, tag_value, id_str
from XXXX.XXXX_druid_enum_top_step1
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project'
group by tag_name, tag_value, id_str
) a
group by a.tag_name, a.tag_value
) b
) c
where c.rank<=@topn_param
) n
on m.tag_name = n.tag_name and m.tag_value = n.tag_value
;

然后将枚举型非长尾分布的标签数据,和枚举型长尾分布TopN的标签数据,两表union,做成枚举型标签的合并数据,将union的数据压缩格式转为gz压缩,然后统一写入Druid。

SET mapred.job.name=dataToES_druid_enum_union_2gz_@id_type_@project_@sday_key;
SET mapreduce.map.java.opts=-Xmx3072m -XX:+UseG1GC;
SET mapreduce.reduce.java.opts=-Xmx6144m -XX:+UseG1GC;
SET mapred.map.tasks=500;
SET mapred.reduce.tasks=500;
SET hive.exec.compress.intermediate=TRUE;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.parallel=true;

INSERT overwrite TABLE XXXX.XXXX_druid_enum_input partition (dt='@sday_key', id_type='@id_type', portrait_proj='@project')
select tag_name, tag_value, id_str, dt_time from (
select tag_name, tag_value, id_str, dt_time from XXXX.XXXX_druid_enum_small_large_step
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project' 
UNION ALL
select tag_name, tag_value, id_str, dt_time from XXXX.XXXX_druid_enum_top_uv_rank_step2
where dt='@sday_key' and id_type='@id_type' and portrait_proj='@project' 
) m
;

3. 入库Druid的json

{
  "type" : "index_hadoop",
  "spec" : {
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "granularity",
        "dataGranularity" : "DAY",
        "inputPath" : "/XXXX/hdfs路径/XXXXX",
        "filePattern" : ".*",
        "pathFormat" : "'dt='yyyyMMdd'/id_type=${id_type}/portrait_proj=${project}'"
      }
    },
    "dataSchema" : {
      "dataSource" : "da_wanxiang_tag_analysis_${id_type}_${project}_enum_metric",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "DAY",
        "intervals" : ["${sday}T00:00:00.000+08:00/${eday}T00:00:00.000+08:00"]
      },
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "tsv",
          "delimiter" : "\u0001",
          "listDelimiter" : "\u0002",
          "columns" : [
           "tag_name"
          , "tag_value"
          , "id_str"
          , "dt_time"
          ],
          "dimensionsSpec" : {
            "dimensions" : [
            "tag_name",
            "tag_value"
            ]
          },
          "timestampSpec" : {
            "format" : "yyyy-MM-dd HH:mm:ss",
            "column" : "dt_time"
          }
        }
      },
      "metricsSpec" : [
        {
          "name" : "count",
          "type" : "count"
        },
        {
          "name" : "uuid_unique",
          "type" : "hyperUnique",
          "fieldName" : "id_str"
        }
      ]
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "maxRowsInMemory" : 150000,
      "ignoreInvalidRows":true,
      "useCombiner" : true,
      "buildV9Directly" : true,
      "numBackgroundPersistThreads" : 1 ,
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 1750000,
        "maxPartitionSize" : 2000000
      },
      "jobProperties" : {
        "mapreduce.job.classloader": "true"
      }
    }
  }
}

二、连续数值型

首先还是通过spark作业过滤出连续数值型标签数据,并扁平化

然后将连续数值型标签数据入库Druid

{
  "type" : "index_hadoop",
  "spec" : {
    "ioConfig" : {
      "type" : "hadoop",
      "inputSpec" : {
        "type" : "granularity",
        "dataGranularity" : "DAY",
        "inputPath" : "/XXXX/hdfs路径/XXXXX",
        "filePattern" : ".*",
        "pathFormat" : "'dt='yyyyMMdd'/id_type=${id_type}/portrait_proj=${project}'"
      }
    },
    "dataSchema" : {
      "dataSource" : "da_wanxiang_tag_analysis_${id_type}_${project}_number_metric",
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "DAY",
        "intervals" : ["${sday}T00:00:00.000+08:00/${eday}T00:00:00.000+08:00"]
      },
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "tsv",
          "delimiter" : "\u0001",
          "listDelimiter" : "\u0002",
          "columns" : [
           "tag_name"
          , "tag_value"
          , "id_str"
          , "dt_time"
          ],
          "dimensionsSpec" : {
            "dimensions" : [
            "tag_name"
            ]
          },
          "timestampSpec" : {
            "format" : "yyyy-MM-dd HH:mm:ss",
            "column" : "dt_time"
          }
        }
      },
      "metricsSpec" : [
        {
          "name" : "count",
          "type" : "count"
        },
        {
          "name" : "uuid_unique",
          "type" : "hyperUnique",
          "fieldName" : "id_str"
        },
        {
          "name": "tag_value_sum",
          "type": "longSum",
          "fieldName": "tag_value"
        },
        {
          "type" : "approxHistogramFold",
          "name" : "hist_value",
          "fieldName" : "tag_value",
          "resolution" : 5000,
          "numBuckets" : 5000
        }
      ]
    },
    "tuningConfig" : {
      "type" : "hadoop",
      "maxRowsInMemory" : 150000,
      "ignoreInvalidRows":true,
      "useCombiner" : true,
      "buildV9Directly" : true,
      "numBackgroundPersistThreads" : 1 ,
      "partitionsSpec" : {
        "type" : "hashed",
        "targetPartitionSize" : 1750000,
        "maxPartitionSize" : 2000000
      },
      "jobProperties" : {
        "mapreduce.job.classloader": "true"
      }
    }
  }
}

三、

Druid提交任务的命令

curl -X 'POST' -H 'Content-Type:application/json' -d @XXXX.json http://XXXX:XXXX/druid/indexer/v1/task

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值