groupBy Queries
These types of queries take a groupBy query object and return an array of JSON objects where each object represents a grouping asked for by the query.
Note: If you are doing aggregations with time as your only grouping, or an ordered groupBy over a single dimension, consider Timeseries and TopN queries as well as groupBy. Their performance may be better in some cases. See Alternatives below for more details.
这些类型的查询接受groupBy查询对象并返回JSON对象数组,其中每个对象表示查询所请求的分组。
注意:如果将时间作为唯一的分组进行聚合,或者在单个维度上进行有序的groupBy查询,请考虑Timeseries和TopN查询。在某些情况下,他们的表现可能更好。有关详细信息,请参阅下面的备选方案。
An example groupBy query object is shown below:
{
"queryType": "groupBy",
"dataSource": "sample_datasource",
"granularity": "day",
"dimensions": ["country", "device"],
"limitSpec": { "type": "default", "limit": 5000, "columns": ["country", "data_transfer"] },
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "carrier", "value": "AT&T" },
{ "type": "or",
"fields": [
{ "type": "selector", "dimension": "make", "value": "Apple" },
{ "type": "selector", "dimension": "make", "value": "Samsung" }
]
}
]
},
"aggregations": [
{ "type": "longSum", "name": "total_usage", "fieldName": "user_count" },
{ "type": "doubleSum", "name": "data_transfer", "fieldName": "data_transfer" }
],
"postAggregations": [
{ "type": "arithmetic",
"name": "avg_usage",
"fn": "/",
"fields": [
{ "type": "fieldAccess", "fieldName": "data_transfer" },
{ "type": "fieldAccess", "fieldName": "total_usage" }
]
}
],
"intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ],
"having": {
"type": "greaterThan",
"aggregation": "total_usage",
"value": 100
}
}
Following are main parts to a groupBy query:propertydescriptionrequired?queryTypeThis String should always be “groupBy”; this is the first thing Druid looks at to figure out how to interpret the queryyes
dataSourceA String or Object defining the data source to query, very similar to a table in a relational database. See DataSource for more information.yes
dimensionsA JSON list of dimensions to do the groupBy over; or see DimensionSpec for ways to extract dimensions.yes
limitSpecno
havingno
granularityDefines the granularity of the query. See Granularitiesyes
filterno
aggregationsno
postAggregationsno
intervalsA JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.yes
subtotalsSpecA JSON array of arrays to return additional result sets for groupings of subsets of top level dimensions. It is described later in more detail.no
contextAn additional JSON Object which can be used to specify certain flags.no
To pull it all together, the above query would return n*m data points, up to a maximum of 5000 points, where n is the cardinality of the country dimension, m is the cardinality of the devicedimension, each day between 2012-01-01 and 2012-01-03, from the sample_datasource table. Each data point contains the (long) sum of total_usage if the value of the data point is greater than 100, the (double) sum of data_transfer and the (double) result of total_usage divided by data_transfer for the filter set for a particular grouping of country and device.
The output looks like this:
[
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:00.000Z",
"event" : {
"country" : ,
"device" : ,
"total_usage" : ,
"data_transfer" :,
"avg_usage" :
}
},
{
"version" : "v1",
"timestamp" : "2012-01-01T00:00:12.000Z",
"event" : {
"dim1" : ,
"dim2" : ,
"sample_name1" : ,
"sample_name2" :,
"avg_usage" :
}
},
...
]
Behavior on multi-value dimensions
groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, all values from matching rows will be used to generate one group per value. It’s possible for a query to return more groups than there are rows. For example, a groupBy on the dimension tags with filter "t1" AND "t3" would match only row1, and generate a result with three groups: t1, t2, and t3. If you only need to include values that match your filter, you can use a filtered dimensionSp