最近公司的技术机构有了一些改变,OLAP层采用了Druid,之前的使用过程中一直没有问题,但是最近有需求,需要使用Multi-value类型的查询,如果在hive中我们是可以使用array_contains(Array<T>, value)
函数来处理,但是将数据推送到Druid中后并不知道是否可以多值filter.后来查询Druid文档发现是可以的。
Multi-value dimensions
下面是我的操作
Hive建表
DROP TABLE IF EXISTS levin.array_test;
CREATE EXTERNAL TABLE levin.array_test(
timestamp bigint,
sno string,
faults array<bigint>, //多值字段 采用array类型
order_num bigint
)
STORED AS PARQUET
LOCATION 'hdfs://hadoop/user/s_info_quality/warehouse/levin/array_test';
数据格式
timestamp | sno | faults | order_num |
---|---|---|---|
1451577600000 | 1712093900515 | [4170,4221] | 1 |
1451577600000 | 1710113901653 | 4083 | 1 |
array_test.json
{
"type":"index_hadoop",
"spec":{
"ioConfig":{
"inputSpec":{
"paths":"/user/s_info_quality/warehouse/levin/array_test",
"type":"static",
"inputFormat":"io.druid.data.input.parquet.DruidParquetInputFormat"
},
"type":"hadoop"
},
"dataSchema":{
"dataSource":"array_test",
"granularitySpec":{
"type":"uniform",
"segmentGranularity": {
"type": "period",
"period": "P1Y",
"timeZone": "Asia/Shanghai"
},
"queryGranularity":"hour",
"interval":[
"2016-01-01T00:00:00+0800/2019-03-11T00:00:00+0800"
]
},
"parser":{
"type":"parquet",
"parseSpec":{
"format":"timeAndDims",
"dimensionsSpec":{
"dimensions":[
"sno",
"faults"
]
},
"timestampSpec":{
"column":"timestamp",
"format":"auto"
}
}
},
"metricsSpec":[
{
"fieldName":"order_num",
"name":"order_num",
"type":"longSum"
}
],
"transformSpec": {
"filter": {
"type": "interval",
"dimension": "__time",
"intervals": [
"2000-01-01T00:00:00+08:00/2025-01-01T00:00:00+08:00"
]
}
}
},
"tuningConfig":{
"type":"hadoop",
"partitionsSpec":{
"type":"hashed",
"targetPartitionSize":6250000
},
"jobProperties":{
"mapreduce.job.queuename":"root.production.test",
"mapreduce.job.user.classpath.first":"true",
"mapreduce.reduce.memory.mb": "6144",
"mapreduce.reduce.java.opts":"-server -Xmx2560m -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
"mapreduce.map.memory.mb":"4096",
"mapreduce.map.java.opts":"-server -Xmx2458m -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
},
"useCombiner": true,
"ignoreInvalidRows": true,
"overwriteFiles": true
}
}
}
查询方式
使用Postman header 为Content-Type:application/json
{
"queryType": "search",
"dataSource": "array_test",
"granularity": "day",
"searchDimensions": [
"sno",
"d_service_type",
"sn",
"faults"
],
"filter": {
"type": "or",
"fields": [
{
"type": "selector",
"dimension": "sno",
"value": "1712093900515 "
},
{
"type": "selector",
"dimension": "faults",
"value": "4170"
}
]
},
"limit":10,
"sort" : {
"type": "lexicographic"
},
"intervals": [
"2010-01-01T00:00:00.000/2055-01-03T00:00:00.000"
]
}
如上方式的话在结果中faults的value会是{\"element\": 4221}
这种格式的,随后查询文档需要添加配置
parquet.avro.add-list-element-records:false
到tunningConfig.JobProperties中即可解决
Data changes during ingestion of Array using druid-parquet-extensions
参考文档
http://druid.io/docs/latest/development/extensions-contrib/parquet.html