记一次 Apache Druid 查询速度优化

最新推荐文章于 2024-06-24 10:58:24 发布

徴心

最新推荐文章于 2024-06-24 10:58:24 发布

阅读量4.7k

点赞数 1

分类专栏：性能 Nosql

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/q2365921/article/details/96435846

版权

Nosql 同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

性能

4 篇文章 0 订阅

订阅专栏

最近产品中有一个Druid的查询大概5s左右的响应,需要优化一下,这篇博文主要记录的就是这次优化的思路和具体方案

背景

表的格式为parquet，数据行数1400w+(由于我们的数据都是离线抽取的,已经做好聚合了),timestamp(所有数据都一样)
segmentGranularity.period:P1D

分析步骤

其实就以我们本身的数据条数,应该不会造成这样的慢查询的,而且其他组的数据量也有10+亿的数据放在Druid上查询,也是很快的,
那么我就想到了是不是因为timestamp设计造成的,因为目前我们把不同国家的数据的timestamp限制死了
比如这样

国家	timstap
cn	20160101
in	20160102

因为两边的任务完成时间不一致,并且不能等到全部成功之后汇总到一起推送,那么采用数据中的时间戳,那么就会有问题
如果cn的数据完成了数据中存在20160101的数据,过了一会in的任务完成了,其中也存在20160101的数据,那么就会造成覆盖,因为period是P1D,会将一天内的数据放到同一个block中。

猜想

根据上面的分析,我想我们之所以慢,应该是由于我们同一个国家的数据都放到了一个Block中,而同一个block应该会将数据放到一个机器中,那么应该会造成查询热点。

印证

为了印证上面的猜想,我将数据中写死的timestamp切换成数据内的时间戳,并且改变了segmentGranularity.period为P3M,试验了一下,同一个SQL查询变成了2s左右,事实证明我的猜想应该是正确的。最后就是反复调整参数了

优化参考

已知Druid中的表是按照时间划分的,更进一步的划分是按照block块划分的,在block内数据被划分为一个或者多个segment,每个segment都是一个文件,通常都包含几百万行的数据。并且segment还有一个version的概念,当批量覆盖的时候如果遇到相同的时间段,那么会在集群中使用新的version替换旧的(Note:由于新的数据不是马上就能查看到,是需要新数据全部加载完毕的时候才能看到)。
Druid中的表是分布式数据存储的,会对数据进行分区,其中分区规范是通过segmentGranularity来配置的。

上面是我总结的一些Druid的知识,下面官方的设计文档中明确给出的建议,摘至官方

Apache Druid (incubating) stores its index in segment files, which are partitioned by time. In a basic setup, one segment file is created for each time interval, where the time interval is configurable in the segmentGranularity parameter of the granularitySpec, which is documented here. For Druid to operate well under heavy query load, it is important for the segment file size to be within the recommended range of 300mb-700mb. If your segment files are larger than this range, then consider either changing the granularity of the time interval or partitioning your data and tweaking the targetPartitionSize in your partitionsSpec (a good starting point for this parameter is 5 million rows). See the sharding section below and the ‘Partitioning specification’ section of the Batch ingestion documentation for more information.

segment的大小最好在300mb-700mb范围
partitionsSpec.targetPartitionSize最好在500w左右
如果segment的大小不在建议范围,可以调整granularitySpec.segmentGranularity 配置改变它
查看Segment元数据信息来获得上面想要了解的信息
Post下面的json

{
  "queryType":"segmentMetadata",
  "dataSource":"info_sample_datasource",
  "intervals":["2013-01-01/2014-01-01"]
}

下面是返回的元数据结果, 因为保密原因,下面的返回结果都是修正过的

[ {
 //通过返回的结构,我分析id的组成是由datasource-segmentBeginDate-segmentEndDate-pushDataDate 组成的,可以理解为一个id即为一个segment
  "id" : "info_sample_datasource_2016-03-24T16:00:00.000Z_2016-03-25T16:00:00.000Z_2019-07-18T10:06:23.248Z",
  "intervals" : [ "2013-05-13T00:00:00.000Z/2013-05-14T00:00:00.000Z" ],
  "columns" : {
    "__time" : { "type" : "LONG", "hasMultipleValues" : false, "size" : 407240380, "cardinality" : null, "errorMessage" : null },
    "dim1" : { "type" : "STRING", "hasMultipleValues" : false, "size" : 100000, "cardinality" : 1944, "errorMessage" : null },
    "dim2" : { "type" : "STRING", "hasMultipleValues" : true, "size" : 100000, "cardinality" : 1504, "errorMessage" : null },
    "metric1" : { "type" : "FLOAT", "hasMultipleValues" : false, "size" : 100000, "cardinality" : null, "errorMessage" : null }
  },
  "aggregators" : {
    "metric1" : { "type" : "longSum", "name" : "metric1", "fieldName" : "metric1" }
  },
  "queryGranularity" : {
    "type": "none"
  },
  //segment
  "size" : 300000,
  "numRows" : 5000000
} ]

文档

https://druid.apache.org/docs/latest/design/segments.html Segment设计
https://druid.apache.org/docs/latest/ingestion/index.html 数据抽取
https://druid.apache.org/docs/latest/querying/segmentmetadataquery.html Segment 元数据查看

徴心

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
记一次 Apache Druid 查询速度优化

最近产品中有一个Druid的查询大概5s左右的响应,需要优化一下,这篇博文主要记录的就是这次优化的思路和具体方案背景表的格式为parquet，数据行数1400w+(由于我们的数据都是离线抽取的,已经做好聚合了),timestamp(所有数据都一样)segmentGranularity.period:P1D分析步骤其实就以我们本身的数据条数,应该不会造成这样的慢查询的,而且其他组的数据量也...
复制链接

扫一扫

专栏目录