记一次Druid 提取Hadoop Parquet 多值字段配置({\"element\"问题)

最近公司的技术机构有了一些改变,OLAP层采用了Druid,之前的使用过程中一直没有问题,但是最近有需求,需要使用Multi-value类型的查询,如果在hive中我们是可以使用array_contains(Array<T>, value)函数来处理,但是将数据推送到Druid中后并不知道是否可以多值filter.后来查询Druid文档发现是可以的。
Multi-value dimensions
下面是我的操作

Hive建表

DROP TABLE IF EXISTS levin.array_test;
CREATE EXTERNAL TABLE levin.array_test(
  timestamp bigint,
  sno string,
  faults array<bigint>,    //多值字段 采用array类型
  order_num bigint
)
STORED AS PARQUET
LOCATION 'hdfs://hadoop/user/s_info_quality/warehouse/levin/array_test';

数据格式

timestampsnofaultsorder_num
14515776000001712093900515[4170,4221]1
1451577600000171011390165340831

array_test.json

{
  "type":"index_hadoop",
  "spec":{
    "ioConfig":{
      "inputSpec":{
        "paths":"/user/s_info_quality/warehouse/levin/array_test",
        "type":"static",
        "inputFormat":"io.druid.data.input.parquet.DruidParquetInputFormat"
      },
      "type":"hadoop"
    },
    "dataSchema":{
      "dataSource":"array_test",
      "granularitySpec":{
        "type":"uniform",
        "segmentGranularity": {
          "type": "period",
          "period": "P1Y",
          "timeZone": "Asia/Shanghai"
        },
        "queryGranularity":"hour",
        "interval":[
          "2016-01-01T00:00:00+0800/2019-03-11T00:00:00+0800"
        ]
      },
      "parser":{
        "type":"parquet",
        "parseSpec":{
          "format":"timeAndDims",
          "dimensionsSpec":{
            "dimensions":[
              "sno",
              "faults"
            ]
          },
          "timestampSpec":{
            "column":"timestamp",
            "format":"auto"
          }
        }
      },
      "metricsSpec":[
        {
          "fieldName":"order_num",
          "name":"order_num",
          "type":"longSum"
        }
      ],
      "transformSpec": {
        "filter": {
          "type": "interval",
          "dimension": "__time",
          "intervals": [
            "2000-01-01T00:00:00+08:00/2025-01-01T00:00:00+08:00"
          ]
        }
      }
    },
    "tuningConfig":{
      "type":"hadoop",
      "partitionsSpec":{
        "type":"hashed",
        "targetPartitionSize":6250000
      },
      "jobProperties":{
        "mapreduce.job.queuename":"root.production.test",
        "mapreduce.job.user.classpath.first":"true",
        "mapreduce.reduce.memory.mb": "6144",
        "mapreduce.reduce.java.opts":"-server -Xmx2560m -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
        "mapreduce.map.memory.mb":"4096",
        "mapreduce.map.java.opts":"-server -Xmx2458m -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
      },
      "useCombiner": true,
      "ignoreInvalidRows": true,
      "overwriteFiles": true
    }
  }
}

查询方式

使用Postman header 为Content-Type:application/json

{
  "queryType": "search",
  "dataSource": "array_test",
  "granularity": "day",
  "searchDimensions": [
    "sno",
    "d_service_type",
    "sn",
    "faults"
  ],
  "filter": {
  "type": "or",
  "fields": [
    {
      "type": "selector",
      "dimension": "sno",
      "value": "1712093900515  "
    },
    {
      "type": "selector",
      "dimension": "faults",
      "value": "4170"
    }
  ]
},
  
  "limit":10,
  "sort" : {
    "type": "lexicographic"
  },
  "intervals": [
    "2010-01-01T00:00:00.000/2055-01-03T00:00:00.000"
  ]
}

如上方式的话在结果中faults的value会是{\"element\": 4221}这种格式的,随后查询文档需要添加配置
parquet.avro.add-list-element-records:false到tunningConfig.JobProperties中即可解决
Data changes during ingestion of Array using druid-parquet-extensions

参考文档
http://druid.io/docs/latest/development/extensions-contrib/parquet.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值