7-Druid-Querying

原文

http://druid.io/docs/0.10.1/querying/querying.html

其他:

Transforming Dimension Values

The following JSON fields can be used in a query to operate on dimension values.

http://druid.io/docs/0.10.1/querying/dimensionspecs.html

Query Context

The query context is used for various query configuration parameters. The following parameters apply to all querie

http://druid.io/docs/0.10.1/querying/query-context.html

Multi-value dimensions

http://druid.io/docs/0.10.1/querying/multi-value-dimensions.html

一、Querying

Queries 是以HTTP REST形式向nodes请求 (BrokerHistorical, or Realtime). JSON形式标识请求的内容:

curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>

其他形式的第三方查询库: client libraries 

http://druid.io/libraries.html

calcite库

http://calcite.apache.org/docs/druid_adapter.html

SQL Support for Druid

http://druid.io/docs/0.10.1/querying/sql.html


Queries分类

Aggregation Queries

Metadata Queries

Search Queries

查询方式使用建议

建议使用 Timeseries and TopN queries 而非 GroupBy. GroupBy 复杂,实现低效. 对于 单维度的grouping sorting ,topN 查询优于 groupBys.

Query的取消

 DELETE /druid/v2/{queryId}

例如取消query ID = 123的查询可以用(在制定端口发送请求)

 curl -X DELETE "http://host:port/druid/v2/abc123"

Query Errors

如果查询失败,可得到 HTTP 500 response 包含JSON对象如下:

 {
  "error" : "Query timeout",
  "errorMessage" : "Timeout waiting for task.",
  "errorClass" : "java.util.concurrent.TimeoutException",
  "host" : "druid1.example.com:8083"
}
field description
error 具体error code
errorMessage 详细信息
errorClass 异常类
host 异常端口


code description
Query timeout 超时
Query interrupted

query中断,可能是JVM关闭

Query cancelled 通过api取消
Resource limit exceeded The query exceeded a configured resource limit (e.g. groupBy maxResults).
Unknown exception 其他异常

二、聚合查询-Timeseries查询

请求中包含timeseries对象,返回一个json数组,其中每个json对象代表一个结果值

 timeseries query object :

 {
  "queryType": "timeseries",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "descending": "true",
  "filter": {
    "type": "and",
    "fields": [
      { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
      { "type": "or",
        "fields": [
          { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
          { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
        ]
      }
    ]
  },
  "aggregations": [
    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
    { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
  ],
  "postAggregations": [
    { "type": "arithmetic",
      "name": "sample_divide",
      "fn": "/",
      "fields": [
        { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
        { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
      ]
    }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
}

7个主要部分:

property description required?
queryType

"timeseries"

yes
dataSource 数据源 yes
descending

是否排序Default is false(ascending).

no
intervals ISO-8601 Intervals. 指定查询的范围 yes
granularity

 Granularities 查询结果分桶的粒度

yes
filter  Filters no
aggregations  Aggregations no
postAggregations  Post Aggregations no
context  Context no

上面的查询结果总的是:2个点,每个数据点包含sample_fieldName1的 (long) sum ,sample_fieldName2的 (double) sum ,以及复合结果sample_fieldName1/sample_fieldName2 如下:

 [ { "timestamp": "2012-01-01T00:00:00.000Z", "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }, { "timestamp": "2012-01-02T00:00:00.000Z","result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> } }
]

无结果时的零值填充Zero-filling

对于timeseries 粒度中没有值的情况,用0填充,如interval 2012-01-01/2012-01-04,其中2012-01-02没有数,则为:

 [
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": { "sample_name1": <some_value> }
  },
  {
   "timestamp": "2012-01-02T00:00:00.000Z",
   "result": { "sample_name1": 0 }
  },
  {
    "timestamp": "2012-01-03T00:00:00.000Z",
    "result": { "sample_name1": <some_value> }
  }
]

可以取消0填充,不在结果中显示,context中设置skipEmptyBuckets = ture,这样结果如下:

 {
  "queryType": "timeseries",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "aggregations": [
    { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-04T00:00:00.000" ],
  "context" : {
    "skipEmptyBuckets": "true"
  }
}


三、聚合查询-TopN查询

TopNs是每个节点将topK 发送到broker,然后broker将topN 返回,K在 max(1000, threshold)范围,可设置

 topN query object :

 {
  "queryType": "topN",
  "dataSource": "sample_data",
  "dimension": "sample_dim",
  "threshold": 5,
  "metric": "count",
  "granularity": "all",
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "dim1",
        "value": "some_value"
      },
      {
        "type": "selector",
        "dimension": "dim2",
        "value": "some_other_val"
      }
    ]
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "count",
      "fieldName": "count"
    },
    {
      "type": "doubleSum",
      "name": "some_metric",
      "fieldName": "some_metric"
    }
  ],
  "postAggregations": [
    {
      "type": "arithmetic",
      "name": "average",
      "fn": "/",
      "fields": [
        {
          "type": "fieldAccess",
          "name": "some_metric",
          "fieldName": "some_metric"
        },
        {
          "type": "fieldAccess",
          "name": "count",
          "fieldName": "count"
        }
      ]
    }
  ],
  "intervals": [
    "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
  ]
}.

各部分说明:

property description required?
queryType 类型"topN" yes
dataSource 数据源. yes
intervals 查询范围 yes
granularity 结果分桶间隔 yes
filter See Filters no
aggregations See Aggregations no
postAggregations See Post Aggregations no
dimension

按照某个维度取top

yes
threshold

限制N

yes
metric

top list的计算指标,对这些进行排序

yes
context See Context no

results would look like so:

 [
  {
    "timestamp": "2013-08-31T00:00:00.000Z",
    "result": [
      {
        "dim1": "dim1_val",
        "count": 111,
        "some_metrics": 10669,
        "average": 96.11711711711712
      },
      {
        "dim1": "another_dim1_val",
        "count": 88,
        "some_metrics": 28344,
        "average": 322.09090909090907
      },
      {
        "dim1": "dim1_val3",
        "count": 70,
        "some_metrics": 871,
        "average": 12.442857142857143
      },
      {
        "dim1": "dim1_val4",
        "count": 62,
        "some_metrics": 815,
        "average": 13.14516129032258
      },
      {
        "dim1": "dim1_val5",
        "count": 60,
        "some_metrics": 2787,
        "average": 46.45
      }
    ]
  }
]

Behavior on multi-value dimensions

topN queries can group on multi-value dimensions. When grouping on a multi-value dimension, all values from matching rows will be used to generate one group per value. It's possible for a query to return more groups than there are rows. For example, a topN on the dimension tags with filter "t1" AND "t3" would match only row1, and generate a result with three groups: t1t2, and t3. If you only need to include values that match your filter, you can use a filtered dimensionSpec. This can also improve performance.

See Multi-value dimensions for more details.

混叠Aliasing

现在的TopN是非精确值,

threshold可以通过 server parameter druid.query.topN.minTopNThreshold 设置,但是必须重启servers

精确结果必须groupBy后sort

容忍大约的topN排序在一个维度dimension有超过1000个维度值的情况,可用两次查询得到精确结果 ,一个 得到大约的 topN 个dimension的 values, 另一个值查询这第一步的这topN 值的结果。

Example First query:
 {
    "aggregations": [
             {
                 "fieldName": "L_QUANTITY_longSum",
                 "name": "L_QUANTITY_",
                 "type": "longSum"
             }
    ],
    "dataSource": "tpch_year",
    "dimension":"l_orderkey",
    "granularity": "all",
    "intervals": [
        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
    ],
    "metric": "L_QUANTITY_",
    "queryType": "topN",
    "threshold": 2
}
Example second query:
 {
    "aggregations": [
             {
                 "fieldName": "L_TAX_doubleSum",
                 "name": "L_TAX_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_DISCOUNT_doubleSum",
                 "name": "L_DISCOUNT_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_EXTENDEDPRICE_doubleSum",
                 "name": "L_EXTENDEDPRICE_",
                 "type": "doubleSum"
             },
             {
                 "fieldName": "L_QUANTITY_longSum",
                 "name": "L_QUANTITY_",
                 "type": "longSum"
             },
             {
                 "name": "count",
                 "type": "count"
             }
    ],
    "dataSource": "tpch_year",
    "dimension":"l_orderkey",
    "filter": {
        "fields": [
            {
                "dimension": "l_orderkey",
                "type": "selector",
                "value": "103136"
            },
            {
                "dimension": "l_orderkey",
                "type": "selector",
                "value": "1648672"
            }
        ],
        "type": "or"
    },
    "granularity": "all",
    "intervals": [
        "1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z"
    ],
    "metric": "L_QUANTITY_",
    "queryType": "topN",
    "threshold": 2
}

四、聚合查询-groupBy查询

 {
  "queryType": "groupBy",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "dimensions": ["country", "device"],
  "limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
  "filter": {
    "type": "and",
    "fields": [
      { "type": "selector", "dimension": "carrier", "value": "AT&T" },
      { "type": "or", 
        "fields": [
          { "type": "selector", "dimension": "make", "value": "Apple" },
          { "type": "selector", "dimension": "make", "value": "Samsung" }
        ]
      }
    ]
  },
  "aggregations": [
    { "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
    { "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
  ],
  "postAggregations": [
    { "type": "arithmetic",
      "name": "avg_usage",
      "fn": "/",
      "fields": [
        { "type": "fieldAccess", "fieldName": "data_transfer" },
        { "type": "fieldAccess", "fieldName": "total_usage" }
      ]
    }
  ],
  "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
  "having": {
    "type": "greaterThan",
    "aggregation": "total_usage",
    "value": 100
  }
}
property description required?
queryType 同上 yes
dataSource 同上 yes
dimensions

需要groupBy的维度

yes
limitSpec See LimitSpec. no
having See Having. no
granularity 同上 yes
filter See Filters no
aggregations See Aggregations no
postAggregations See Post Aggregations no
intervals 同上 yes
context 同上 no


按照天 day, ["country", "device"] 聚合,limit 5000条,total_usage > 100

 [ 
  {
    "version" : "v1",
    "timestamp" : "2012-01-01T00:00:00.000Z",
    "event" : {
      "country" : <some_dim_value_one>,
      "device" : <some_dim_value_two>,
      "total_usage" : <some_value_one>,
      "data_transfer" :<some_value_two>,
      "avg_usage" : <some_avg_usage_value>
    }
  }, 
  {
    "version" : "v1",
    "timestamp" : "2012-01-01T00:00:12.000Z",
    "event" : {
      "dim1" : <some_other_dim_value_one>,
      "dim2" : <some_other_dim_value_two>,
      "sample_name1" : <some_other_value_one>,
      "sample_name2" :<some_other_value_two>,
      "avg_usage" : <some_other_avg_usage_value>
    }
  },
...
]

五、元数据查询-Time Boundary查询

时间范围查询,返回数据集的最大最小时间点:

 {
    "queryType" : "timeBoundary",
    "dataSource": "sample_datasource",
    "bound"     : < "maxTime" | "minTime" > # optional, defaults to returning both timestamps if not set 
    "filter"    : { "type": "and", "fields": [<filter>, <filter>, ...] } # optional
}
property description required?
queryType 查询类型 yes
dataSource 数据源 yes
bound

minTime或minTime,默认返回两者

no
filter See Filters no
context See Context no

结果示例:

 [ {
  "timestamp" : "2013-05-09T18:24:00.000Z",
  "result" : {
    "minTime" : "2013-05-09T18:24:00.000Z",
    "maxTime" : "2013-05-09T18:37:00.000Z"
  }
} ]

六、元数据查询-Data Source Metadata查询

返回数据集最后一次数据摄入的时间,不能包括rollup

 {
    "queryType" : "dataSourceMetadata",
    "dataSource": "sample_datasource"
}
property description required?
queryType 类型 yes
dataSource 数据源 yes
context See Context no

结果

 [ {
  "timestamp" : "2013-05-09T18:24:00.000Z",
  "result" : {
    "maxIngestedEventTime" : "2013-05-09T18:24:09.007Z",
  }
} ]

七、Search查询

返回某个维度,维度值为指定值的结果

如下返回dim1和dim2中包含Ke的数据

  {

  "queryType": "search",
  "dataSource": "sample_datasource",
  "granularity": "day",
  "searchDimensions": [
    "dim1",
    "dim2"
  ],
  "query": {
    "type": "insensitive_contains",
    "value": "Ke"
  },
  "sort" : {
    "type": "lexicographic"
  },
  "intervals": [
    "2013-01-01T00:00:00.000/2013-01-03T00:00:00.000"
  ]
}

There are several main parts to a search query:

property description required?
queryType 类型 yes
dataSource 数据源 yes
granularity 粒度 yes
filter See Filters. no
limit

定义每个节点的最大返回值

no (default to 1000)
intervals 范围 yes
searchDimensions 查询的维度 no
query SearchQuerySpec. yes
sort

An object specifying how the results of the search should be sorted.
Possible types are "lexicographic" (the default sort), "alphanumeric", "strlen", and "numeric".
See Sorting Orders for more details.

排序方式,默认lexicographic,还有alphanumeric

no
context Context no


 [
  {
    "timestamp": "2012-01-01T00:00:00.000Z",
    "result": [
      {
        "dimension": "dim1",
        "value": "Ke$ha",
        "count": 3
      },
      {
        "dimension": "dim2",
        "value": "Ke$haForPresident",
        "count": 1
      }
    ]
  },
  {
    "timestamp": "2012-01-02T00:00:00.000Z",
    "result": [
      {
        "dimension": "dim1",
        "value": "SomethingThatContainsKe",
        "count": 1
      },
      {
        "dimension": "dim2",
        "value": "SomethingElseThatContainsKe",
        "count": 2
      }
    ]
  }
]

 

八、Select查询

 

返回原生行,支持分页

 
  {
   "queryType": "select",
   "dataSource": "wikipedia",
   "descending": "false",
   "dimensions":[],
   "metrics":[],
   "granularity": "all",
   "intervals": [
     "2013-01-01/2013-01-02"
   ],
   "pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
 }
 
 
property description required?
queryType 类型 yes
dataSource 数据源 yes
intervals 数据范围 yes
descending 排序 no
filter See Filters no
dimensions

如果为空返回全部

no
metrics 如果为空返回全部 no
granularity 粒度 Granularities yes
pagingSpec

在所有scanned segments中的offsets,pagingIdentifiers为下一个query分页用


yes
context   no
 

The format of the result is:

 
  [{
  "timestamp" : "2013-01-01T00:00:00.000Z",
  "result" : {
    "pagingIdentifiers" : {
      "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9" : 4
    },
    "events" : [ {
      "segmentId" : "wikipedia_editstream_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 0,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "1",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "11._korpus_(NOVJ)",
        "language" : "sl",
        "newpage" : "0",
        "user" : "EmausBot",
        "count" : 1.0,
        "added" : 39.0,
        "delta" : 39.0,
        "variation" : 39.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 1,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "112_U.S._580",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 70.0,
        "delta" : 70.0,
        "variation" : 70.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 2,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "113_U.S._243",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 77.0,
        "delta" : 77.0,
        "variation" : 77.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 3,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "113_U.S._73",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 70.0,
        "delta" : 70.0,
        "variation" : 70.0,
        "deleted" : 0.0
      }
    }, {
      "segmentId" : "wikipedia_2012-12-29T00:00:00.000Z_2013-01-10T08:00:00.000Z_2013-01-10T08:13:47.830Z_v9",
      "offset" : 4,
      "event" : {
        "timestamp" : "2013-01-01T00:00:00.000Z",
        "robot" : "0",
        "namespace" : "article",
        "anonymous" : "0",
        "unpatrolled" : "0",
        "page" : "113_U.S._756",
        "language" : "en",
        "newpage" : "1",
        "user" : "MZMcBride",
        "count" : 1.0,
        "added" : 68.0,
        "delta" : 68.0,
        "variation" : 68.0,
        "deleted" : 0.0
      }
    } ]
  }
} ]

 threshold 决定命中多少个,每一个有一个offset所为索引,descending 时offset为负数

Result Pagination

 

PagingSpec决定分页,threshold决定每个页中多少数据,开始时必须有一个pagingIdentifiers且为空

 
 "pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
 

返回时pagingIndentifers不为空,标识是结果。当取下一个结果时必须pagingIdentifiers设置为上一个结果的pagingIdentifiers值

九、Components-Datasources

数据源

Table Data Source

一个table作为源

 

 {
    "type": "table",
    "name": "<string_value>"
}

 

Union Data Source

 

几个table(same schema)做union作为源

 

 {
       "type": "union",
       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
}

 

Query Data Source

 

This is used for nested groupBys and is only currently supported for groupBys.

 

 {
    "type": "query",
    "query": {
        "type": "groupBy",
        ...
    }
}

十、Components-Query Filters

相当于SQL中where

Selector filter

selector 将一个特定的dimension = 特定的值,也可组成复杂的布尔表达式,Filtering with Extraction Functions 

语法如下:

 "filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }
相当于 WHERE <dimension_string> = '<dimension_value_string>'.

Column Comparison filter

维度间的比较

 "filter": { "type": "columnComparison", "dimensions": [<dimension_a>, <dimension_b>] }

相当于WHERE <dimension_a> = <dimension_b>.

Regular expression filter

Java regular expression. 维度和given pattern正则, Filtering with Extraction Functions

 "filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }

Logical expression filters

AND
 "filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
OR
 "filter": { "type": "or", "fields": [<filter>, <filter>, ...] }
NOT

The grammar for a NOT filter is as follows:

 "filter": { "type": "not", "field": <filter> }

JavaScript filter

The JavaScript filter matches a dimension against the specified JavaScript function predicate. The filter matches values for which the function returns true.

The function takes a single argument, the dimension value, and returns either true or false.

 {
  "type" : "javascript",
  "dimension" : <dimension_string>,
  "function" : "function(value) { <...> }"
}

Example The following matches any dimension values for the dimension name between 'bar' and 'foo'

 {
  "type" : "javascript",
  "dimension" : "name",
  "function" : "function(x) { return(x >= 'bar' && x <= 'foo') }"
}

The JavaScript filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

JavaScript-based functionality is disabled by default. Please refer to the Druid  JavaScript programming guide for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

Search filter

在部分string匹配

 {
    "filter": {
        "type": "search",
        "dimension": "product",
        "query": {
          "type": "insensitive_contains",
          "value": "foo" 
        }        
    }
}
property description required?
type 类型 yes
dimension 维度 yes
query 类型. yes
extractionFn Extraction function 作用在维度上 no

Search Query Spec
Contains
property description required?
type 一般是 "contains". yes
value 匹配的字符 yes
caseSensitive 是否大小写敏感 no (default == false)
Insensitive Contains
property description required?
type "insensitive_contains". yes
value 匹配的字符 yes

 "insensitive_contains"  等同于 "contains" search with "caseSensitive": false (or not provided).

Fragment
property description required?
type This String should always be "fragment". yes
values A JSON array of String values to run the search over. yes
caseSensitive Whether strings should be compared as case sensitive or not. Default: false(insensitive) no

In filter

类似于 SQL query:

  SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')

The grammar for a IN filter is as follows:

 {
    "type": "in",
    "dimension": "outlaw",
    "values": ["Good", "Bad", "Ugly"]
}

The IN filter supports the use of extraction functions, see Filtering with Extraction Functions for details.

Like filter

像SQL中的SQL LIKE, "%" (matches any number of characters) and "_" (matches any one character).

property type description required?
type String This should always be "like". yes
dimension String The dimension to filter on yes
pattern String LIKE pattern, such as "foo%" or "___bar". yes
escape String An escape character that can be used to escape special characters. no
extractionFn Extraction function Extraction function to apply to the dimension no

Like filters support the use of extraction functions, see Filtering with Extraction Functions for details.

 last_name LIKE "D%" (i.e. last_name starts with "D").

 {
    "type": "like",
    "dimension": "last_name",
    "pattern": "D%"
}

Bound filter

维度值的范围  greater than, less than, greater than or equal to, less than or equal to, and "between" (if both "lower" and "upper" are set).

property type description required?
type String "bound". yes
dimension String 维度 yes
lower String The lower bound for the filter no
upper String The upper bound for the filter no
lowerStrict Boolean Perform strict comparison on the lower bound ("<" instead of "<=") no, default: false
upperStrict Boolean Perform strict comparison on the upper bound (">" instead of ">=") no, default: false
ordering String

不同bound间比较的排序方法 Sorting Orders


no, default: "lexicographic"
extractionFn Extraction function Extraction function to apply to the dimension no

(1) 21 <= age <= 31

{ "type": "bound", "dimension": "age", "lower": "21", "upper": "31" , "ordering": "numeric" }

(2)foo <= name <= hoo, using the default lexicographic sorting order. json

{ "type": "bound", "dimension": "name", "lower": "foo", "upper": "hoo" }

 (3)21 < age < 31 

{ "type": "bound", "dimension": "age", "lower": "21", "lowerStrict": true, "upper": "31" , "upperStrict": true, "ordering": "numeric" }

 (4)age < 31.

 { "type": "bound", "dimension": "age", "upper": "31" , "upperStrict": true, "ordering": "numeric" }

(5)age >= 18

 { "type": "bound", "dimension": "age", "lower": "18" , "ordering": "numeric" }

Interval Filter

对long millisecond values比较,适合__time column, long metric columns和dimensions中s long milliseconds的维度,ISO 8601 intervals, left-closed and right-open matching (i.e., start <= time < end).

property type description required?
type String "interval". yes
dimension String dimension yes
intervals Array time ranges yes
extractionFn Extraction function Extraction function to apply to the dimension no


 October 1-7, 2014 and November 15-16, 2014.

 { "type" : "interval", "dimension" : "__time", "intervals" : [ "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z", "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z" ] }

等同于下面的形式

 {
    "type": "or",
    "fields": [
      {
        "type": "bound",
        "dimension": "__time",
        "lower": "1412121600000",
        "lowerStrict": false,
        "upper": "1412640000000" ,
        "upperStrict": true,
        "ordering": "numeric"
      },
      {
         "type": "bound",
         "dimension": "__time",
         "lower": "1416009600000",
         "lowerStrict": false,
         "upper": "1416096000000" ,
         "upperStrict": true,
         "ordering": "numeric"
      }
    ]
}

Column types

按照类型过滤

Druid supports filtering on timestamp, string, long, and float columns.

Filtering on numeric columns

 myFloatColumn = 10.1:

 "filter": {
  "type": "selector",
  "dimension": "myFloatColumn",
  "value": "10.1"
}

10 <= myFloatColumn < 20:

 "filter": {
  "type": "bound",
  "dimension": "myFloatColumn",
  "ordering": "numeric",
  "lowerBound": "10",
  "lowerStrict": false,
  "upperBound": "20",
  "upperStrict": true
}

Filtering on the Timestamp Column

 filtering on a long timestamp value:

 "filter": {
  "type": "selector",
  "dimension": "__time",
  "value": "124457387532"
}

Filtering on day of week:

 "filter": {
  "type": "selector",
  "dimension": "__time",
  "value": "Friday",
  "extractionFn": {
    "type": "timeFormat",
    "format": "EEEE",
    "timeZone": "America/New_York",
    "locale": "en"
  }
}

Filtering on a set of ISO 8601 intervals:

 {
    "type" : "interval",
    "dimension" : "__time",
    "intervals" : [
      "2014-10-01T00:00:00.000Z/2014-10-07T00:00:00.000Z",
      "2014-11-15T00:00:00.000Z/2014-11-16T00:00:00.000Z"
    ]
}

 

十一、Components-Aggregations

 

在数据进入Druid前的聚合. 查询时的聚合

Count aggregator

 

 { "type" : "count", "name" : <output_name> }

 对列count,不一定是摄入的原生数据量,可以在摄入时配置roll up,为了计算摄入数据量在 ingestion time时设置一个count ,在查询时用longSum aggregator

 

Sum aggregators

 

longSum aggregator

 

 { "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }

 

 fieldName – name 做64-bitsum 结果命名为output_name

 

doubleSum aggregator

 

64-bit floating point value

 

 { "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }

 

Min / Max aggregators

 

doubleMin aggregator

 

doubleMin Double.POSITIVE_INFINITY

 

 { "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }

 

doubleMax aggregator

 

doubleMax  Double.NEGATIVE_INFINITY

 

 { "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }

 

longMin aggregator

 

longMin  Long.MAX_VALUE

 

 { "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }

 

longMax aggregator

 

longMax   Long.MIN_VALUE

 

 { "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }

 

First / Last aggregator

 

First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.

 

Note that queries with first/last aggregators on a segment created with rollup enabled will return the rolled up value, and not the last value within the raw ingested data.

 

doubleFirst aggregator

 

doubleFirst computes the metric value with the minimum timestamp or 0 if no row exist

 

 {
  "type" : "doubleFirst",
  "name" : <output_name>,
  "fieldName" : <metric_name>
}

 

doubleLast aggregator

 

doubleLast computes the metric value with the maximum timestamp or 0 if no row exist

 

 {
  "type" : "doubleLast",
  "name" : <output_name>,
  "fieldName" : <metric_name>
}

 

longFirst aggregator

 

longFirst computes the metric value with the minimum timestamp or 0 if no row exist

 

 {
  "type" : "longFirst",
  "name" : <output_name>,
  "fieldName" : <metric_name>
}

 

longLast aggregator

 

longLast computes the metric value with the maximum timestamp or 0 if no row exist

 

 { 
  "type" : "longLast",
  "name" : <output_name>, 
  "fieldName" : <metric_name>,
}

 

JavaScript aggregator

 

Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions are allowed). Your JavaScript functions are expected to return floating-point values.

 

 { "type": "javascript",
  "name": "<output_name>",
  "fieldNames"  : [ <column1>, <column2>, ... ],
  "fnAggregate" : "function(current, column1, column2, ...) {
 <updates partial aggregate (current) based on the current row values>
 return <updated partial aggregate>
 }",
  "fnCombine"   : "function(partialA, partialB) { return <combined partial results>; }",
  "fnReset"     : "function() { return <initial value>; }"
}

 

Example

 

 {
  "type": "javascript",
  "name": "sum(log(x)*y) + 10",
  "fieldNames": ["x", "y"],
  "fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }",
  "fnCombine"   : "function(partialA, partialB) { return partialA + partialB; }",
  "fnReset"     : "function() { return 10; }"
}

 

JavaScript-based functionality is disabled by default. Please refer to the Druid  JavaScript programming guide for guidelines about using Druid's JavaScript functionality, including instructions on how to enable it.

 

Approximate Aggregations 近似计算

 

Cardinality aggregator

 

用HyperLogLog估计维度的基数,比hyperUnique慢,当只关心某一个维度时建议使用hyperUnique aggregator

 

 {
  "type": "cardinality",
  "name": "<output_name>",
  "fields": [ <dimension1>, <dimension2>, ... ],
  "byRow": <false | true> # (optional, defaults to false)
}
Cardinality by value

 byRow = false时,对所有给定dimensions union后求cardinality

 

  • 对于一个维度等同于

 

 SELECT COUNT(DISTINCT(dimension)) FROM <datasource>

 

  • 多个维度时等同于

 

 SELECT COUNT(DISTINCT(value)) FROM (
  SELECT dim_1 as value FROM <datasource>
  UNION
  SELECT dim_2 as value FROM <datasource>
  UNION
  SELECT dim_3 as value FROM <datasource>
)

 

Cardinality by row

 byRow = true时等同于 

 SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM <datasource> GROUP BY DIM1, DIM2, DIM3 )

 

Example

计算人的pv 当地和非当地

 

 {
  "type": "cardinality",
  "name": "distinct_countries",
  "fields": [ "country_of_origin", "country_of_residence" ]
}

 

Determine the number of distinct people (i.e. combinations of first and last name).

 

 {
  "type": "cardinality",
  "name": "distinct_people",
  "fields": [ "first_name", "last_name" ],
  "byRow" : true
}

 

Determine the number of distinct starting characters of last names

 

 {
  "type": "cardinality",
  "name": "distinct_last_name_first_char",
  "fields": [
    {
     "type" : "extraction",
     "dimension" : "last_name",
     "outputName" :  "last_name_first_char",
     "extractionFn" : { "type" : "substring", "index" : 0, "length" : 1 }
    }
  ],
  "byRow" : true
}

 

HyperUnique aggregator

 

 HyperLogLog 估计一个维度的uv

 

 { 
  "type" : "hyperUnique",
  "name" : <output_name>,
  "fieldName" : <metric_name>,
  "isInputHyperUnique" : false
}

 

十二、Components-Post-Aggregations

对聚合结果二次计算 后输出

Arithmetic post-aggregator

 支持 +-*/quotient.

Note:

  • / 除数为0返回0
  • quotient 和浮点数一样

一般也有 ordering, 对结果排序:

  • If no ordering (or null) is specified, the default floating point ordering is used.
  • numericFirst ordering always returns finite values first, followed by NaN, and infinite values last.:
 postAggregation : {
  "type"  : "arithmetic",
  "name"  : <output_name>,
  "fn"    : <arithmetic_function>,
  "fields": [<post_aggregator>, <post_aggregator>, ...],
  "ordering" : <null (default), or "numericFirst">
}

Field accessor post-aggregator

 aggregator.的输出作为输入

fieldName refers to the output name of the aggregator given in the aggregations portion of the query.

 { "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }

Constant post-aggregator

 constant post-aggregator 返回特定的值

 { "type"  : "constant", "name"  : <output_name>, "value" : <numerical_value> }

Greatest / Least post-aggregators

doubleGreatest and longGreatest computes the maximum of all fields and Double.NEGATIVE_INFINITY. doubleLeast and longLeast computes the minimum of all fields and Double.POSITIVE_INFINITY.

The difference between the doubleMax aggregator and the doubleGreatest post-aggregator is that doubleMax returns the highest value of all rows for one specific column while doubleGreatest returns the highest value of multiple columns in one row. 

类似于SQL的MAX和GREATEST

 {
  "type"  : "doubleGreatest",
  "name"  : <output_name>,
  "fields": [<post_aggregator>, <post_aggregator>, ...]
}

HyperUnique Cardinality post-aggregator

The hyperUniqueCardinality post aggregator is used to wrap a hyperUnique object such that it can be used in post aggregations.

 { "type"  : "hyperUniqueCardinality", "name": <output name>, "fieldName"  : <the name field value of the hyperUnique aggregator>}

It can be used in a sample calculation as so:

   "aggregations" : [{
    {"type" : "count", "name" : "rows"},
    {"type" : "hyperUnique", "name" : "unique_users", "fieldName" : "uniques"}
  }],
  "postAggregations" : [{
    "type"   : "arithmetic",
    "name"   : "average_users_per_row",
    "fn"     : "/",
    "fields" : [
      { "type" : "hyperUniqueCardinality", "fieldName" : "unique_users" },
      { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }
    ]
  }]

例子

计算占比

 {
  ...
  "aggregations" : [
    { "type" : "count", "name" : "rows" },
    { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" }
  ],
  "postAggregations" : [{
    "type"   : "arithmetic",
    "name"   : "average",
    "fn"     : "*",
    "fields" : [
       { "type"   : "arithmetic",
         "name"   : "div",
         "fn"     : "/",
         "fields" : [
           { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" },
           { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }
         ]
       },
       { "type" : "constant", "name": "const", "value" : 100 }
    ]
  }]
  ...
}


十三、Components-Aggregation Granularity聚合粒度

决定数据在时间维度的分桶(hour, day, minute, etc.),可以是简单也可以是任意

Simple Granularities

支持的字符串表达形式是: allnonesecondminutefifteen_minutethirty_minutehourdayweekmonthquarter and year.

  • all buckets everything into a single bucket
  • none node和摄入的粒度一样

millisecond粒度摄入

{"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}
{"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}
{"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
{"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}

 hour granularity groupby,

 {
   "queryType":"groupBy",
   "dataSource":"my_dataSource",
   "granularity":"hour",
   "dimensions":[
      "language"
   ],
   "aggregations":[
      {
         "type":"count",
         "name":"count"
      }
   ],
   "intervals":[
      "2000-01-01T00:00Z/3000-01-01T00:00Z"
   ]
}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T01:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T01:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T23:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T03:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]


按天聚合

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

node和摄入的粒度一样

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T01:02:33.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T01:02:33.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T23:32:45.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T03:32:45.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]


all 所有

 [ {
  "version" : "v1",
  "timestamp" : "2000-01-01T00:00:00.000Z",
  "event" : {
    "count" : 4,
    "language" : "en"
  }
} ]

Duration Granularities

milliseconds 的整数倍

This chunks up every 2 hours.

 {"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}
Example:

Reusing the data in the previous example, after submitting a groupBy query with 24 hours duration,

 {
   "queryType":"groupBy",
   "dataSource":"my_dataSource",
   "granularity":{"type": "duration", "duration": "86400000"},
   "dimensions":[
      "language"
   ],
   "aggregations":[
      {
         "type":"count",
         "name":"count"
      }
   ],
   "intervals":[
      "2000-01-01T00:00Z/3000-01-01T00:00Z"
   ]
}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T00:00:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

if you set the origin for the granularity to 2012-01-01T00:30:00Z,

    "granularity":{"type": "duration", "duration": "86400000", "origin":"2012-01-01T00:30:00Z"}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-03T00:30:00.000Z",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

Note that the timestamp for each bucket starts at the 30th minute.

Period Granularities

Period granularities are specified as arbitrary period combinations of years, months, weeks, hours, minutes and seconds (e.g. P2W, P3M, PT1H30M, PT0.750S) in ISO8601 format. They support specifying a time zone which determines where period boundaries start as well as the timezone of the returned timestamps. By default, years start on the first of January, months start on the first of the month and weeks start on Mondays unless an origin is specified.

Time zone is optional (defaults to UTC). Origin is optional (defaults to 1970-01-01T00:00:00 in the given time zone).

 {"type": "period", "period": "P2D", "timeZone": "America/Los_Angeles"}

This will bucket by two-day chunks in the Pacific timezone.

 {"type": "period", "period": "P3M", "timeZone": "America/Los_Angeles", "origin": "2012-02-01T00:00:00-08:00"}

This will bucket by 3-month chunks in the Pacific timezone where the three-month quarters are defined as starting from February.

Example

Reusing the data in the previous example, if you submit a groupBy query with 1 day period in Pacific timezone,

 {
   "queryType":"groupBy",
   "dataSource":"my_dataSource",
   "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles"},
   "dimensions":[
      "language"
   ],
   "aggregations":[
      {
         "type":"count",
         "name":"count"
      }
   ],
   "intervals":[
      "1999-12-31T16:00:00.000-08:00/2999-12-31T16:00:00.000-08:00"
   ]
}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-30T00:00:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-08-31T00:00:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T00:00:00.000-07:00",
  "event" : {
    "count" : 2,
    "language" : "en"
  }
} ]

Note that the timestamp for each bucket has been converted to Pacific time. Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}are put in the same bucket because they are in the same day in Pacific time.

Also note that the intervals in groupBy query will not be converted to the timezone specified, the timezone specified in granularity is only applied on the query results.

If you set the origin for the granularity to 1970-01-01T20:30:00-08:00,

    "granularity":{"type": "period", "period": "P1D", "timeZone": "America/Los_Angeles", "origin": "1970-01-01T20:30:00-08:00"}

you will get

 [ {
  "version" : "v1",
  "timestamp" : "2013-08-29T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-08-30T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-01T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
}, {
  "version" : "v1",
  "timestamp" : "2013-09-02T20:30:00.000-07:00",
  "event" : {
    "count" : 1,
    "language" : "en"
  }
} ]

Note that the origin you specified has nothing to do with the timezone, it only serves as a starting point for locating the very first granularity bucket. In this case, Row {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"} and {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"} are not in the same bucket.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值