7-Druid-Querying

最新推荐文章于 2023-01-20 23:53:15 发布

hjw199089

最新推荐文章于 2023-01-20 23:53:15 发布

阅读量2.2k

点赞数 1

分类专栏： [21]Druid 文章标签： Druid

[21]Druid 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

原文

http://druid.io/docs/0.10.1/querying/querying.html

其他：

Transforming Dimension Values

The following JSON fields can be used in a query to operate on dimension values.

http://druid.io/docs/0.10.1/querying/dimensionspecs.html

Query Context

The query context is used for various query configuration parameters. The following parameters apply to all querie

http://druid.io/docs/0.10.1/querying/query-context.html

Multi-value dimensions

http://druid.io/docs/0.10.1/querying/multi-value-dimensions.html

一、Querying

Queries 是以HTTP REST形式向nodes请求 (Broker, Historical, or Realtime). JSON形式标识请求的内容：

curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>

其他形式的第三方查询库： client libraries

http://druid.io/libraries.html

calcite库

：http://calcite.apache.org/docs/druid_adapter.html

Apache Calcite - SQL parser, planner and query engine whose Druid adapter can query data residing in Druid, and combine it with data in other locations; has local and remote JDBC drivers powered by Avatica
implydata/plyql - A command line and HTTP interface for issuing SQL queries to Druid

SQL Support for Druid

http://druid.io/docs/0.10.1/querying/sql.html

Queries分类

Aggregation Queries

Metadata Queries

Search Queries

查询方式使用建议

建议使用 Timeseries and TopN queries 而非 GroupBy. GroupBy 复杂，实现低效. 对于单维度的grouping sorting ，topN 查询优于 groupBys.

Query的取消

 DELETE /druid/v2/{queryId}

例如取消query ID = 123的查询可以用（在制定端口发送请求）

 curl -X DELETE "http://host:port/druid/v2/abc123"

Query Errors

如果查询失败，可得到 HTTP 500 response 包含JSON对象如下：

 {
  "error" : "Query timeout",
  "errorMessage" : "Timeout waiting for task.",
  "errorClass" : "java.util.concurrent.TimeoutException",
  "host" : "druid1.example.com:8083"
}

field	description
error	具体error code
errorMessage	详细信息
errorClass	异常类
host	异常端口

code	description
`Query timeout`	超时
`Query interrupted`	query中断，可能是JVM关闭
`Query cancelled`	通过api取消
`Resource limit exceeded`	The query exceeded a configured resource limit (e.g. groupBy maxResults).
`Unknown exception`	其他异常

二、聚合查询-Timeseries查询

请求中包含timeseries对象，返回一个json数组，其中每个json对象代表一个结果值

timeseries query object :

 {
  "queryType": "timeseries",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "descending": "true",
  "filter": {
    "type": "and",
    "fields": [
      { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
      { "type": "or",
        "fields": [
          { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
          { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
        ]
      }
    ]
  },
  "aggregations": [
    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
    { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
  ],
  "postAggregations": [
    { "type": "arithmetic",
      "name": "sample_divide",
      "fn": "/",
      "fields": [
        { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
        { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
      ]
    }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}

7个主要部分:

property	description	required?
queryType	"timeseries"	yes
dataSource	数据源	yes
descending	是否排序Default is `false`(ascending).	no
intervals	ISO-8601 Intervals. 指定查询的范围	yes
granularity	Granularities 查询结果分桶的粒度	yes
filter	Filters	no
aggregations	Aggregations	no
postAggregations	Post Aggregations	no
context	Context	no

上面的查询结果总的是：2个点，每个数据点包含sample_fieldName1的 (long) sum ，sample_fieldName2的 (double) sum ，以及复合结果sample_fieldName1/sample_fieldName2 如下：

[ { "timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z","result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } } ]

`无结果时的零值填充`Zero-filling

对于timeseries 粒度中没有值的情况，用0填充，如interval 2012-01-01/2012-01-04，其中2012-01-02没有数，则为:

 [
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": { "sample_name1": <some_value> }
  },
  {
   "timestamp": "2012-01-02T00:00:00.000Z",
   "result": { "sample_name1": 0 }
  },
  {
    "timestamp": "2012-01-03T00:00:00.000Z",
    "result": { "sample_name1": <some_value> }
  }
]

可以取消0填充，不在结果中显示，context中设置skipEmptyBuckets = ture，这样结果如下：

 {
  "queryType": "timeseries",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "aggregations": [
    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ],
  "context" : {
    "skipEmptyBuckets": "true"
  }
}

三、聚合查询-TopN查询

TopNs是每个节点将topK 发送到broker，然后broker将topN 返回，K在 max(1000, threshold)范围，可设置

topN query object :

 {
  "queryType": "topN",
  "dataSource": "sample_data",
  "dimension": "sample_dim",
  "threshold": 5,
  "metric": "count",
  "granularity": "all",
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "dim1",
        "value": "some_value"
      },
      {
        "type": "selector",
        "dimension": "dim2",
        "value": "some_other_val"
      }
    ]
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "count",
      "fieldName": "count"
    },
    {
      "type": "doubleSum",
      "name": "some_metric",
      "fieldName": "some_metric"
    }
  ],
  "postAggregations": [
    {
      "type": "arithmetic",
      "name": "average",
      "fn": "/",
      "fields": [
        {
          "type": "fieldAccess",
          "name": "some_metric",
          "fieldName": "some_metric"
        },
        {
          "type": "fieldAccess",
          "name": "count",
          "fieldName": "count"
        }
      ]
    }
  ],
  "intervals": [
    "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
  ]
}.

各部分说明：

property	description	required?
queryType	类型"topN"	yes
dataSource	数据源.	yes
intervals	查询范围	yes
granularity	结果分桶间隔	yes
filter	See Filters	no
aggregations	See Aggregations	no
postAggregations	See Post Aggregations	no
dimension	按照某个维度取top	yes
threshold	限制N	yes
metric	top list的计算指标，对这些进行排序	yes
context	See Context	no

results would look like so:

 [
  {
    "timestamp": "2013-08-31T00:00:00.000Z",
    "result": [
      {
        "dim1": "dim1_val",
        "count": 111,
        "some_metrics": 10669,
        "average": 96.11711711711712
      },
      {
        "dim1": "another_dim1_val",
        "count": 88,
        "some_metrics": 28344,
        "average": 322.09090909090907
      },
      {
        "dim1": "dim1_val3",
        "count": 70,
        "some_metrics": 871,
        "average": 12.442857142857143
      },
      {
        "dim1": "dim1_val4",
        "count": 62,
        "some_metrics": 815,
        "average": 13.14516129032258
      },
      {
        "dim1": "dim1_val5",
        "count": 60,
        "some_metrics": 2787,
        "average": 46.45
      }
    ]
  }
]

Behavior on multi-value dimensions

topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, all values from matching rows will be used to generate one group per value. It's possible for a query to return more groups than there are rows. For example, a topN on the dimension tags with filter "t1" AND "t3" would match only row1, and generate a result with three groups: t1, t2, and t3. If you only need to include values that match your filter, you can use a filtered dimensionSpec. This can also improve performance.

See Multi-value dimensions for more details.

混叠Aliasing

现在的TopN是非精确值，

threshold可以通过 server parameter druid.query.topN.minTopNThreshold 设置，但是必须重启servers

精确结果必须groupBy后sort

容忍大约的topN排序在一个维度dimension有超过1000个维度值的情况，可用两次查询得到精确结果，一个得到大约的 topN 个dimension的 values，另一个值查询这第一步的这topN 值的结果。

Example First query:

 {
    "aggregations": [
             {
                 "fieldName": "L_QUANTITY_longSum",
                 "name": "L_QUANTITY_",
                 "type": "longSum"
             }
    ],
    "dataSource": "tpch_year",
    "dimension":"l_orderkey",
    "granularity": "all",
    "intervals": [
        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
    ],
    "metric": "L_QUANTITY_",
    "queryType": "topN",
    "threshold": 2
}

Example second query:

 {
    "aggregations": [
             {
                 "fieldName": "L_TAX_doubleSum",
                 "name": "L_TAX_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_DISCOUNT_doubleSum",
                 "name": "L_DISCOUNT_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_EXTENDEDPRICE_doubleSum",
                 "name": "L_EXTENDEDPRICE_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_QUANTITY_longSum",
                 "name": "L_QUANTITY_",
                 "type": "longSum"
             },
             {
                 "name": "count",
                 "type": "count"
             }
    ],
    "dataSource": "tpch_year",
    "dimension":"l_orderkey",
    "filter": {
        "fields": [
            {
                "dimension": "l_orderkey",
                "type": "selector",
                "value": "103136"
            },
            {
                "dimension": "l_orderkey",
                "type": "selector",
                "value": "1648672"
            }
        ],
        "type": "or"
    },
    "granularity": "all",
    "intervals": [
        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
    ],
    "metric": "L_QUANTITY_",
    "queryType": "topN",
    "threshold": 2
}

四、聚合查询-groupBy查询

 {
  "queryType": "groupBy",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "dimensions": ["country", "device"],
  "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
  "filter": {
    "type": "and",
    "fields": [
      { "type": "selector", "dimension": "carrier", "value": "AT&T" },
      { "type": "or", 
        "fields": [
          { "type": "selector", "dimension": "make", "value": "Apple" },
          { "type": "selector", "dimension": "make", "value": "Samsung" }
        ]
      }
    ]
  },
  "aggregations": [
    { "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
    { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
  ],
  "postAggregations": [
    { "type": "arithmetic",
      "name": "avg_usage",
      "fn": "/",
      "fields": [
        { "type": "fieldAccess", "fieldName": "data_transfer" },
        { "type": "fieldAccess", "fieldName": "total_usage" }
      ]
    }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
  "having": {
    "type": "greaterThan",
    "aggregation": "total_usage",
    "value": 100
  }
}

property	description	required?
queryType	同上	yes
dataSource	同上	yes
dimensions	需要groupBy的维度	yes
limitSpec	See LimitSpec.	no
having	See Having.	no
granularity	同上	yes
filter	See Filters	no
aggregations	See Aggregations	no
postAggregations	See Post Aggregations	no
intervals	同上	yes
context	同上	no

按照天 day， ["country", "device"] 聚合，limit 5000条，total_usage > 100

 [ 
  {
    "version" : "v1",
    "timestamp" : "2012-01-01T00:00:00.000Z",
    "event" : {
      "country" : <some_dim_value_one>,
      "device" : <some_dim_value_two>,
      "total_usage" : <some_value_one>,
      "data_transfer" :<some_value_two>,
      "avg_usage" : <some_avg_usage_value>
    }
  }, 
  {
    "version" : "v1",
    "timestamp" : "2012-01-01T00:00:12.000Z",
    "event" : {
      "dim1" : <some_other_dim_value_one>,
      "dim2" : <some_other_dim_value_two>,
      "sample_name1" : <some_other_value_one>,
      "sample_name2" :<some_other_value_two>,
      "avg_usage" : <some_other_avg_usage_value>
    }
  },
...
]

五、元数据查询-Time Boundary查询

时间范围查询，返回数据集的最大最小时间点:

 {
    "queryType" : "timeBoundary",
    "dataSource": "sample_datasource",
    "bound"     : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set 
    "filter"    : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional
}

property	description	required?
queryType	查询类型	yes
dataSource	数据源	yes
bound	minTime或minTime，默认返回两者	no
filter	See Filters	no
context	See Context	no

结果示例:

 [ {
  "timestamp" : "2013-05-09T18:24:00.000Z",
  "result" : {
    "minTime" : "2013-05-09T18:24:00.000Z",
    "maxTime" : "2013-05-09T18:37:00.000Z"
  }
} ]

六、元数据查询-Data Source Metadata查询

返回数据集最后一次数据摄入的时间，不能包括rollup

 {
    "queryType" : "dataSourceMetadata",
    "dataSource": "sample_datasource"
}

property	description	required?
queryType	类型	yes
dataSource	数据源	yes
context	See Context	no

结果

 [ {
  "timestamp" : "2013-05-09T18:24:00.000Z",
  "result" : {
    "maxIngestedEventTime" : "2013-05-09T18:24:09.007Z",
  }
} ]

七、Search查询

返回某个维度，维度值为指定值的结果

如下返回dim1和dim2中包含Ke的数据

{

  "queryType": "search",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "searchDimensions": [
    "dim1",
    "dim2"
  ],
  "query": {
    "type": "insensitive_contains",
    "value": "Ke"
  },
  "sort" : {
    "type": "lexicographic"
  },
  "intervals": [
    "2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"
  ]
}

There are several main parts to a search query:

property	description	required?
queryType	类型	yes
dataSource	数据源	yes
granularity	粒度	yes
filter	See Filters.	no
limit	定义每个节点的最大返回值	no (default to 1000)
intervals	范围	yes
searchDimensions	查询的维度	no
query	SearchQuerySpec.	yes
sort	An object specifying how the results of the search should be sorted. Possible types are "lexicographic" (the default sort), "alphanumeric", "strlen", and "numeric". See Sorting Orders for more details. 排序方式，默认lexicographic，还有alphanumeric	no
context	Context	no

 [
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [
      {
        "dimension": "dim1",
        "value": "Ke$ha",
        "count": 3
      },
      {
        "dimension": "dim2",
        "value": "Ke$haForPresident",
        "count": 1
      }
    ]
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": [
      {
        "dimension": "dim1",
        "value": "SomethingThatContainsKe",
        "count": 1
      },
      {
        "dimension": "dim2",
        "value": "SomethingElseThatContainsKe",
        "count": 2
      }
    ]
  }
]

八、Select查询

返回原生行，支持分页

  {
   "queryType": "select",
   "dataSource": "wikipedia",
   "descending": "false",
   "dimensions":[],
   "metrics":[],
   "granularity": "all",
   "intervals": [
     "2013-01-01/2013-01-02"
   ],
   "pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
 }

property	description	required?
queryType	类型	yes
dataSource	数据源	yes
intervals	数据范围	yes
descending	排序	no
filter	See Filters	no
dimensions	如果为空返回全部	no
metrics	如果为空返回全部	no
granularity	粒度 Granularities	yes
pagingSpec	在所有scanned segments中的offsets，pagingIdentifiers为下一个query分页用	yes
context		no

The format of the result is:

  [{
  "timestamp" : "2013-01-01T00:00:00.000Z",
  "result" : {
    "pagingIdentifiers" : {
      "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4
    },
    "events" : [ {
      "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 0,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "1",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "11._korpus_(NOVJ)",
        "language" : "sl",
        "newpage" : "0",
        "user" : "EmausBot",
        "count" : 1.0,
        "added" : 39.0,
        "delta" : 39.0,
        "variation" : 39.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 1,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "112_U.S._580",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 70.0,
        "delta" : 70.0,
        "variation" : 70.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 2,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "113_U.S._243",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 77.0,
        "delta" : 77.0,
        "variation" : 77.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 3,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "113_U.S._73",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 70.0,
        "delta" : 70.0,
        "variation" : 70.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 4,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "113_U.S._756",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 68.0,
        "delta" : 68.0,
        "variation" : 68.0,
        "deleted" : 0.0
      }
    } ]
  }
} ]

threshold 决定命中多少个，每一个有一个offset所为索引，descending 时offset为负数

Result Pagination

PagingSpec决定分页，threshold决定每个页中多少数据，开始时必须有一个pagingIdentifiers且为空

 "pagingSpec":{"pagingIdentifiers": {}, "threshold":5}

返回时pagingIndentifers不为空，标识是结果。当取下一个结果时必须pagingIdentifiers设置为上一个结果的pagingIdentifiers值

九、Components-Datasources

数据源

Table Data Source

一个table作为源

 {
    "type": "table",
    "name": "<string_value>"
}

Union Data Source

几个table（same schema）做union作为源

 {
       "type": "union",
       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
}

Query Data Source

This is used for nested groupBys and is only currently supported for groupBys.

 {
    "type": "query",
    "query": {
        "type": "groupBy",
        ...
    }
}

十、Components-Query Filters

相当于SQL中where

Selector filter

selector 将一个特定的dimension = 特定的值，也可组成复杂的布尔表达式，Filtering with Extraction Functions

语法如下:

 "filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }

相当于 WHERE <dimension_string> = '<dimension_value_string>'.

Column Comparison filter

维度间的比较

 "filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

相当于WHERE <dimension_a> = <dimension_b>.

Regular expression filter

Java regular expression. 维度和given pattern正则， Filtering with Extraction Functions

 "filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

Logical expression filters

AND

 "filter": { "type": "and", "fields": [<filter>, <filter>, ...] }

OR

 "filter": { "type": "or", "fields": [<filter>, <filter>, ...] }

NOT

The grammar for a NOT filter is as follows:

 "filter": { "type": "not", "field": <filter> }

JavaScript filter

The JavaScript filter matches a dimension against the specified JavaScript function predicate. The filter matches values for which the function returns true.

The function takes a single argument, the dimension value, and returns either true or false.

 {
  "type" : "javascript",
  "dimension" : <dimension_string>,
  "function" : "function(value) { <...> }"
}

Example The following matches any dimension values for the dimension name between 'bar' and 'foo'

 {
  "type" : "javascript",
  "dimension" : "name",
  "function" : "function(x) { return(x >= 'bar' && x <= 'foo') }"
}

The JavaScript filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

JavaScript-based functionality is disabled by default. Please refer to the Druid JavaScript programming guide for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

Search filter

在部分string匹配

 {
    "filter": {
        "type": "search",
        "dimension": "product",
        "query": {
          "type": "insensitive_contains",
          "value": "foo" 
        }        
    }
}

property	description	required?
type	类型	yes
dimension	维度	yes
query	类型.	yes
extractionFn	Extraction function 作用在维度上	no

Search Query Spec

Contains

property	description	required?
type	一般是 "contains".	yes
value	匹配的字符	yes
caseSensitive	是否大小写敏感	no (default == false)

Insensitive Contains

property	description	required?
type	"insensitive_contains".	yes
value	匹配的字符	yes

"insensitive_contains" 等同于 "contains" search with "caseSensitive": false (or not provided).

Fragment

property	description	required?
type	This String should always be "fragment".	yes
values	A JSON array of String values to run the search over.	yes
caseSensitive	Whether strings should be compared as case sensitive or not. Default: false(insensitive)	no

In filter

类似于 SQL query:

  SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')

The grammar for a IN filter is as follows:

 {
    "type": "in",
    "dimension": "outlaw",
    "values": ["Good", "Bad", "Ugly"]
}

The IN filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

Like filter

像SQL中的SQL LIKE， "%" (matches any number of characters) and "_" (matches any one character).

property	type	description	required?
type	String	This should always be "like".	yes
dimension	String	The dimension to filter on	yes
pattern	String	LIKE pattern, such as "foo%" or "___bar".	yes
escape	String	An escape character that can be used to escape special characters.	no
extractionFn	Extraction function	Extraction function to apply to the dimension	no

Like filters support the use of extraction functions, see Filtering with Extraction Functions for details.

last_name LIKE "D%" (i.e. last_name starts with "D").

 {
    "type": "like",
    "dimension": "last_name",
    "pattern": "D%"
}

Bound filter

维度值的范围 greater than, less than, greater than or equal to, less than or equal to, and "between" (if both "lower" and "upper" are set).

property	type	description	required?
type	String	"bound".	yes
dimension	String	维度	yes
lower	String	The lower bound for the filter	no
upper	String	The upper bound for the filter	no
lowerStrict	Boolean	Perform strict comparison on the lower bound ("<" instead of "<=")	no, default: false
upperStrict	Boolean	Perform strict comparison on the upper bound (">" instead of ">=")	no, default: false
ordering	String	不同bound间比较的排序方法 Sorting Orders	no, default: "lexicographic"
extractionFn	Extraction function	Extraction function to apply to the dimension	no

（1） 21 <= age <= 31:

{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }

（2）foo <= name <= hoo, using the default lexicographic sorting order. json

{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }

（3）21 < age < 31

{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }

（4）age < 31.

{ "type": "bound", "dimension": "age", "upper": "31" , "upperStrict": true, "ordering": "numeric" }

（5）age >= 18

{ "type": "bound", "dimension": "age", "lower": "18" , "ordering": "numeric" }

Interval Filter

对long millisecond values比较，适合__time column, long metric columns和dimensions中s long milliseconds的维度，ISO 8601 intervals， left-closed and right-open matching (i.e., start <= time < end).

property	type	description	required?
type	String	"interval".	yes
dimension	String	dimension	yes
intervals	Array	time ranges	yes
extractionFn	Extraction function	Extraction function to apply to the dimension	no

October 1-7, 2014 and November 15-16, 2014.

{ "type" : "interval", "dimension" : "__time", "intervals" : [ "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z", "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z" ] }

等同于下面的形式

 {
    "type": "or",
    "fields": [
      {
        "type": "bound",
        "dimension": "__time",
        "lower": "1412121600000",
        "lowerStrict": false,
        "upper": "1412640000000" ,
        "upperStrict": true,
        "ordering": "numeric"
      },
      {
         "type": "bound",
         "dimension": "__time",
         "lower": "1416009600000",
         "lowerStrict": false,
         "upper": "1416096000000" ,
         "upperStrict": true,
         "ordering": "numeric"
      }
    ]
}

Column types

按照类型过滤

Druid supports filtering on timestamp, string, long, and float columns.

Filtering on numeric columns

myFloatColumn = 10.1:

 "filter": {
  "type": "selector",
  "dimension": "myFloatColumn",
  "value": "10.1"
}

10 <= myFloatColumn < 20:

 "filter": {
  "type": "bound",
  "dimension": "myFloatColumn",
  "ordering": "numeric",
  "lowerBound": "10",
  "lowerStrict": false,
  "upperBound": "20",
  "upperStrict": true
}

Filtering on the Timestamp Column

filtering on a long timestamp value:

 "filter": {
  "type": "selector",
  "dimension": "__time",
  "value": "124457387532"
}

Filtering on day of week:

 "filter": {
  "type": "selector",
  "dimension": "__time",
  "value": "Friday",
  "extractionFn": {
    "type": "timeFormat",
    "format": "EEEE",
    "timeZone": "America/New_York",
    "locale": "en"
  }
}

Filtering on a set of ISO 8601 intervals:

 {
    "type" : "interval",
    "dimension" : "__time",
    "intervals" : [
      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",
      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"
    ]
}

十一、Components-Aggregations

在数据进入Druid前的聚合. 查询时的聚合

Count aggregator

 { "type" : "count", "name" : <output_name> }

对列count，不一定是摄入的原生数据量，可以在摄入时配置roll up，为了计算摄入数据量在 ingestion time时设置一个count ，在查询时用longSum aggregator

Sum aggregators

`longSum` aggregator

 { "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }

fieldName – name 做64-bitsum 结果命名为output_name

`doubleSum` aggregator

64-bit floating point value

 { "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }

Min / Max aggregators

`doubleMin` aggregator

doubleMin Double.POSITIVE_INFINITY

 { "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }

`doubleMax` aggregator

doubleMax Double.NEGATIVE_INFINITY

 { "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }

`longMin` aggregator

longMin Long.MAX_VALUE

 { "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }

`longMax` aggregator

longMax Long.MIN_VALUE

 { "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }

First / Last aggregator

First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

Note that queries with first/last aggregators on a segment created with rollup enabled will return the rolled up value, and not the last value within the raw ingested data.

`doubleFirst` aggregator

doubleFirst computes the metric value with the minimum timestamp or 0 if no row exist

 {
  "type" : "doubleFirst",
  "name" : <output_name>,
  "fieldName" : <metric_name>
}

`doubleLast` aggregator

doubleLast computes the metric value with the maximum timestamp or 0 if no row exist

 {
  "type" : "doubleLast",
  "name" : <output_name>,
  "fieldName" : <metric_name>
}

`longFirst` aggregator

longFirst computes the metric value with the minimum timestamp or 0 if no row exist

 {
  "type" : "longFirst",
  "name" : <output_name>,
  "fieldName" : <metric_name>
}

`longLast` aggregator

longLast computes the metric value with the maximum timestamp or 0 if no row exist

 { 
  "type" : "longLast",
  "name" : <output_name>, 
  "fieldName" : <metric_name>,
}

JavaScript aggregator

Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your JavaScript functions are expected to return floating-point values.

 { "type": "javascript",
  "name": "<output_name>",
  "fieldNames"  : [ <column1>, <column2>, ... ],
  "fnAggregate" : "function(current, column1, column2, ...) {
 <updates partial aggregate (current) based on the current row values>
 return <updated partial aggregate>
 }",
  "fnCombine"   : "function(partialA, partialB) { return <combined partial results>; }",
  "fnReset"     : "function() { return <initial value>; }"
}

Example

 {
  "type": "javascript",
  "name": "sum(log(x)*y) + 10",
  "fieldNames": ["x", "y"],
  "fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",
  "fnCombine"   : "function(partialA, partialB) { return partialA + partialB; }",
  "fnReset"     : "function() { return 10; }"
}

Approximate Aggregations 近似计算

Cardinality aggregator

用HyperLogLog估计维度的基数，比hyperUnique慢，当只关心某一个维度时建议使用hyperUnique aggregator

 {
  "type": "cardinality",
  "name": "<output_name>",
  "fields": [ <dimension1>, <dimension2>, ... ],
  "byRow": <false | true> # (optional, defaults to false)
}

Cardinality by value

byRow = false时，对所有给定dimensions union后求cardinality

对于一个维度等同于

 SELECT COUNT(DISTINCT(dimension)) FROM <datasource>

多个维度时等同于

 SELECT COUNT(DISTINCT(value)) FROM (
  SELECT dim_1 as value FROM <datasource>
  UNION
  SELECT dim_2 as value FROM <datasource>
  UNION
  SELECT dim_3 as value FROM <datasource>
)

Cardinality by row

byRow = true时等同于

 SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM <datasource> GROUP BY DIM1, DIM2, DIM3 )

Example

计算人的pv 当地和非当地

 {
  "type": "cardinality",
  "name": "distinct_countries",
  "fields": [ "country_of_origin", "country_of_residence" ]
}

Determine the number of distinct people (i.e. combinations of first and last name).

 {
  "type": "cardinality",
  "name": "distinct_people",
  "fields": [ "first_name", "last_name" ],
  "byRow" : true
}

Determine the number of distinct starting characters of last names

 {
  "type": "cardinality",
  "name": "distinct_last_name_first_char",
  "fields": [
    {
     "type" : "extraction",
     "dimension" : "last_name",
     "outputName" :  "last_name_first_char",
     "extractionFn" : { "type" : "substring", "index" : 0, "length" : 1 }
    }
  ],
  "byRow" : true
}

HyperUnique aggregator

HyperLogLog 估计一个维度的uv

 { 
  "type" : "hyperUnique",
  "name" : <output_name>,
  "fieldName" : <metric_name>,
  "isInputHyperUnique" : false
}

十二、Components-Post-Aggregations

对聚合结果二次计算后输出

Arithmetic post-aggregator

支持 +, -, *, /, quotient.

Note:

/ 除数为0返回0
quotient 和浮点数一样

一般也有 ordering, 对结果排序:

If no ordering (or null) is specified, the default floating point ordering is used.
numericFirst ordering always returns finite values first, followed by NaN, and infinite values last.:

 postAggregation : {
  "type"  : "arithmetic",
  "name"  : <output_name>,
  "fn"    : <arithmetic_function>,
  "fields": [<post_aggregator>, <post_aggregator>, ...],
  "ordering" : <null (default), or "numericFirst">
}

Field accessor post-aggregator

aggregator.的输出作为输入

fieldName refers to the output name of the aggregator given in the aggregations portion of the query.

 { "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

Constant post-aggregator

constant post-aggregator 返回特定的值

 { "type"  : "constant", "name"  : <output_name>, "value" : <numerical_value> }

Greatest / Least post-aggregators

doubleGreatest and longGreatest computes the maximum of all fields and Double.NEGATIVE_INFINITY. doubleLeast and longLeast computes the minimum of all fields and Double.POSITIVE_INFINITY.

The difference between the doubleMax aggregator and the doubleGreatest post-aggregator is that doubleMax returns the highest value of all rows for one specific column while doubleGreatest returns the highest value of multiple columns in one row.

类似于SQL的MAX和GREATEST

 {
  "type"  : "doubleGreatest",
  "name"  : <output_name>,
  "fields": [<post_aggregator>, <post_aggregator>, ...]
}

HyperUnique Cardinality post-aggregator

The hyperUniqueCardinality post aggregator is used to wrap a hyperUnique object such that it can be used in post aggregations.

 { "type"  : "hyperUniqueCardinality", "name": <output name>, "fieldName"  : <the name field value of the hyperUnique aggregator>}

It can be used in a sample calculation as so:

   "aggregations" : [{
    {"type" : "count", "name" : "rows"},
    {"type" : "hyperUnique", "name" : "unique_users", "fieldName" : "uniques"}
  }],
  "postAggregations" : [{
    "type"   : "arithmetic",
    "name"   : "average_users_per_row",
    "fn"     : "/",
    "fields" : [
      { "type" : "hyperUniqueCardinality", "fieldName" : "unique_users" },
      { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }
    ]
  }]

例子

计算占比

 {
  ...
  "aggregations" : [
    { "type" : "count", "name" : "rows" },
    { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" }
  ],
  "postAggregations" : [{
    "type"   : "arithmetic",
    "name"   : "average",
    "fn"     : "*",
    "fields" : [
       { "type"   : "arithmetic",
         "name"   : "div",
         "fn"     : "/",
         "fields" : [
           { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" },
           { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }
         ]
       },
       { "type" : "constant", "name": "const", "value" : 100 }
    ]
  }]
  ...
}

十三、Components-Aggregation Granularity聚合粒度

决定数据在时间维度的分桶（hour, day, minute, etc.），可以是简单也可以是任意

Simple Granularities

支持的字符串表达形式是: all, none, second, minute, fifteen_minute, thirty_minute, hour, day, week, month, quarter and year.

all buckets everything into a single bucket
none node和摄入的粒度一样

millisecond粒度摄入

{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}
{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}
{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}

hour granularity groupby,

 {
   "queryType":"groupBy",
   "dataSource":"my_dataSource",
   "granularity":"hour",
   "dimensions":[
      "language"
   ],
   "aggregations":[
      {
         "type":"count",
         "name":"count"
      }
   ],
   "intervals":[
      "2000-01-01T00:00Z/3000-01-01T00:00Z"
   ]
}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T01:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T01:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T23:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T03:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

按天聚合

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

node和摄入的粒度一样

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T01:02:33.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T01:02:33.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T23:32:45.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T03:32:45.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

all 所有

 [ {
  "version" : "v1",
  "timestamp" : "2000-01-01T00:00:00.000Z",
  "event" : {
    "count" : 4,
    "language" : "en"
  }
} ]

Duration Granularities

milliseconds 的整数倍

This chunks up every 2 hours.

 {"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}

Example:

Reusing the data in the previous example, after submitting a groupBy query with 24 hours duration,

 {
   "queryType":"groupBy",
   "dataSource":"my_dataSource",
   "granularity":{"type": "duration", "duration": "86400000"},
   "dimensions":[
      "language"
   ],
   "aggregations":[
      {
         "type":"count",
         "name":"count"
      }
   ],
   "intervals":[
      "2000-01-01T00:00Z/3000-01-01T00:00Z"
   ]
}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

if you set the origin for the granularity to 2012-01-01T00:30:00Z,

    "granularity":{"type": "duration", "duration": "86400000", "origin":"2012-01-01T00:30:00Z"}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

Note that the timestamp for each bucket starts at the 30th minute.

Period Granularities

Period granularities are specified as arbitrary period combinations of years, months, weeks, hours, minutes and seconds (e.g. P2W, P3M, PT1H30M, PT0.750S) in ISO8601 format. They support specifying a time zone which determines where period boundaries start as well as the timezone of the returned timestamps. By default, years start on the first of January, months start on the first of the month and weeks start on Mondays unless an origin is specified.

Time zone is optional (defaults to UTC). Origin is optional (defaults to 1970-01-01T00:00:00 in the given time zone).

 {"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}

This will bucket by two-day chunks in the Pacific timezone.

 {"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}

This will bucket by 3-month chunks in the Pacific timezone where the three-month quarters are defined as starting from February.

Example

Reusing the data in the previous example, if you submit a groupBy query with 1 day period in Pacific timezone,

 {
   "queryType":"groupBy",
   "dataSource":"my_dataSource",
   "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"},
   "dimensions":[
      "language"
   ],
   "aggregations":[
      {
         "type":"count",
         "name":"count"
      }
   ],
   "intervals":[
      "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00"
   ]
}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-30T00:00:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:00:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:00:00.000-07:00",
  "event" : {
    "count" : 2,
    "language" : "en"
  }
} ]

Note that the timestamp for each bucket has been converted to Pacific time. Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}are put in the same bucket because they are in the same day in Pacific time.

Also note that the intervals in groupBy query will not be converted to the timezone specified, the timezone specified in granularity is only applied on the query results.

If you set the origin for the granularity to 1970-01-01T20:30:00-08:00,

    "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles", "origin": "1970-01-01T20:30:00-08:00"}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-29T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-08-30T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

Note that the origin you specified has nothing to do with the timezone, it only serves as a starting point for locating the very first granularity bucket. In this case, Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"} are not in the same bucket.

hjw199089

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
7-Druid-Querying

原文http://druid.io/docs/0.10.1/querying/querying.html其他：Transforming Dimension ValuesThe following JSON fields can be used in a query to operate on dimension values.http://druid.io/
复制链接

扫一扫