druid compact task 和index task 任务比较

最新推荐文章于 2022-02-13 07:49:36 发布

Mr_小白不白

最新推荐文章于 2022-02-13 07:49:36 发布

阅读量1.3k

点赞数

分类专栏： druid 文章标签： druid ingest task

本文链接：https://blog.csdn.net/xiaobai51509660/article/details/88656466

版权

druid 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

druid中提供了各种的ingest task ，其中包括了compact和index task ,以下对两种task的应用场景以及优缺点进行了比较

（1）compact task

合并指定interval之间的所有segments .语句如下：

{
    "type": "compact",
    "id": <task_id>,
    "dataSource": <task_datasource>,
    "interval": <interval to specify segments to be merged>,
    "dimensions" <custom dimensionsSpec>,
    "tuningConfig" <index task tuningConfig>,
    "context": <task context>
}

其主要作用是合并小的segments ，将指定的interval的segments进行合并，合并个数可以根据tuningConfig的targetPartitionSize进行配置。我们主用用于定期的按天的维度合并历史的segments ，以减少segments的个数和存储，提高查询性能。

compact task 执行时内部会转化成index task ,compact的dimensions配置经我测试，并不启作用，它会继承datasource的dimensionSpec和metricSpec，dimension和metric的设置不是很灵活 .另外rollup是否起作用决定于interval 期间的所有segments都是rolluped，且rollup的粒度无法更改，另外可通过segmentMetadata查询获取segments的元数据信息。

针对上述compact task的缺陷，可以采用index task

（2）index task

index task 是index hadoop task 任务的简化版，也主要用于处理历史数据，可以用来操作较少的数据集。其示例如下：

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "timestamp",
            "format" : "auto"
          },
          "dimensionsSpec" : {
            "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
            "dimensionExclusions" : [],
            "spatialDimensions" : []
          }
        }
      },
      "metricsSpec" : [
        {
          "type" : "count",
          "name" : "count"
        },
        {
          "type" : "doubleSum",
          "name" : "added",
          "fieldName" : "added"
        },
        {
          "type" : "doubleSum",
          "name" : "deleted",
          "fieldName" : "deleted"
        },
        {
          "type" : "doubleSum",
          "name" : "delta",
          "fieldName" : "delta"
        }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE",
        "intervals" : [ "2013-08-31/2013-09-01" ]
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : "examples/indexing/",
        "filter" : "wikipedia_data.json"
       }
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 75000
    }
  }
}

其优点是：

1: 可以灵活的指定dimensionsSpec，可灵活的指定dimension ，去除多余dimension .

2: 可以灵活的指定metricSpec ,灵活的统计mertric.

3:重新进行预聚合，queryGranularity

4：设定segmentGranularity的周期。

5: 也可以根据targetPartitionSize设置segments大小，合并小的segments .

经过上述的比较，index task较compact task 具有较好的灵活性。建议采用index task .