数据接入
druid采用kafka index service直接从kafka中读取数据,并进行索引然后摄入到druid中。
研发只需要将数据以json格式打入kakfa对应的topic即可。
--json格式demo
{
"TIMESTAMP": 1512791296,
"TAG1": "tag1-2404",
"TAG2": "tag2-936",
"TAG3": "tag3-119",
"TAG4": "tag4-4534",
....
"BYTES": 42841,
"LATENCYMS": 8293,
"ACKS": 3153,
"ERROR_COUNT": 16149,
"CLICK": 243,
"POINT": 77,
"PRICE": 765
}
注意事项:
1、为了规避时区的干扰,timestamp应该设置为10位整数的unixtime。
2、字段名大小写敏感。
3、纬度列应该尽量避免值为NULL的情况,换言之,纬度列尽量保持非空。
定义datasource
数据进入kafka之后,需要定义一个datasource specfile告诉druid怎样接入数据,里面声明时间戳格式,纬度列、指标列、预聚合的粒度等等。
--datasource specfile demo
{
"type": "kafka",
"dataSchema": {
"dataSource": "metrics-min",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "TIMESTAMP",
"format": "posix"
},
"dimensionsSpec": {
"dimensions": [
"TAG1",
"TAG2",
...
"TAG30",
"TAG31"
]
}
}
},
"metricsSpec": [
{
"name": "counter",
"type": "count"
},
{
"name": "BYTES",
"fieldName": "BYTES",
"type": "doubleSum"
},
{
"name": "LATENCYMS",
"fieldName": "LATENCYMS",
"type": "doubleSum"
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "MINUTE"
}
},
"tuningConfig": {
"type": "kafka",
"maxRowsPerSegment": 500000000
},
"ioConfig": {
"topic": "metrics",
"consumerProperties": {
"bootstrap.servers": "192.168.1.115:9092"
},
"taskCount": 2,
"taskDuration": "PT72M",
"replicas": 1
}
}
注意事项:
1、dataSource声明了datasource的名称
2、dimensionsSpec声明了纬度列
3、metricsSpec声明了指标列
4、ioConfig声明了消费数据所在的kafka集群地址以及topic的名称
5、taskCount声明了起多少个consumer进程来消费数据
6、queryGranularity声明预聚合的粒度
启动数据摄入只需要以datasource sepcfile为参数向overlord发送一个POST请求即可。
curl -X POST -H 'Content-Type: application/json' -d @metrics-min-kafka-supervisor.json http://192.168.1.112:8090/druid/indexer/v1/supervisor
数据查询
select详情查询
select详情查询(首页)
demo:
{
"queryType": "select",
"dataSource": "metrics-min",
"descending": "false",
"dimensions":[],
"metrics":[],
"granularity": "all",
"intervals": [
"2017-12-09T15:10:00/2017-12-09T15:11:00"
],
"pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
}
等价与mysql:
select
*
from
metrics-min
where
timestamp>='2017-12-09 15:10:00' and timestamp<'2017-12-09 15:11:00'
limit 0, 5;
启动查询需要将准备好的json文件作为参数以POST方式发送给broker节点
curl -X POST '192.168.1.112:8082/druid/v2/?pretty' -H 'Content-Type:application/json' -d @query.json
output:
[ {
"timestamp" : "2017-12-09T07:10:00.000Z",
"result" : {
"pagingIdentifiers" : {
"metrics-min_2017-12-09T15:10:00.000+08:00_2017-12-09T15:11:00.000+08:00_2017-12-09T14:58:26.460+08:00" : 4
},
"dimensions" : [ "TAG26", "TAG27", "TAG24", "TAG25", "TAG22", "TAG23", "TAG20", "TAG21", "TAG1", "TAG9", "TAG8", "TAG7", "TAG6", "TAG5", "TAG19", "TAG4", "TAG3", "TAG17", "TAG2", "TAG18", "TAG15", "TAG16", "TAG13", "TAG14", "TAG11", "TAG12", "TAG31", "TAG10", "TAG30", "TAG28", "TAG29" ],
"metrics" : [ "BYTES", "ACKS", "PRICE", "LATENCYMS", "ERROR_COUNT", "counter", "CLICK", "POINT" ],
"events" : [ {
"segmentId" : "metrics-min_2017-12-09T15:10:00.000+08:00_2017-12-09T15:11:00.000+08:00_2017-12-09T14:58:26.460+08:00",
"offset" : 0,
"event" : {
"timestamp" : "2017-12-09T15:10:00.000+08:00",
"TAG1" : "tag1-0",
"TAG2" : "tag2-1384",
"TAG3" : "tag3-191",
...
...
--第二页查询
demo:
{
"queryType": "select",
"dataSource": "metrics-min",
"descending": "false",
"dimensions":[],
"metrics":[],
"granularity": "all",
"intervals": [
"2017-12-09T15:10:00/2017-12-09T15:11:00"
],
"pagingSpec":{"pagingIdentifiers": {"metrics-min_2017-12-09T15:10:00.000+08:00_2017-12-09T15:11:00.000+08:00_2017-12-09T14:58:26.460+08:00" : 5}, "threshold":5}
}
等价于mysql:
select
*
from
metrics-min
where
timestamp>='2017-12-09 15:10:00' and timestamp<'2017-12-09 15:11:00'
limit 5, 5;
注意事项:
1、dimensions、metrics选项可以指定相关纬度列和指标列,留空则返回所有列,应该尽量避免使用*,尽可能只返回需要的字段。
2、pagingSpec是分页选项,每次查询结果都会返回本次查询的pagingIdentifiers,下次查询时的pagingIdentifiers为上一次查询pagingIdentifiers的值+1
单列聚合topN
topN查询是对单列纬度进行groupby聚合,但是效率要比groupby高。
demo:
{
"queryType": "topN",
"dataSource": "metrics-min",
"dimension": "TAG12",
"threshold": 2,
"metric": "ack",
"granularity": "minute",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "TAG18",
"value": "tag18-9227"
},
{
"type": "selector",
"dimension": "TAG14",
"value": "tag14-2028"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "ack",
"fieldName": "ACKS"
},
{
"type": "doubleSum",
"name": "latencyms",
"fieldName": "LATENCYMS"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "ack_div_latency",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "ack",
"fieldName": "ack"
},
{
"type": "fieldAccess",
"name": "latencyms",
"fieldName": "latencyms"
}
]
}
],
"intervals": [
"2017-11-27T15:00:00.000/2017-11-28T15:00:00.000"
]
}
等价于mysql:
select
TAG12,
sum(ACKS) as ack,
sum(LATENCYMS) as latencyms,
sum(ACKS)/sum(LATENCYMS) as ack_div_latency
from
metrics-min
where
timestamp>='2017-11-27 15:00:00' and timestamp<'2017-11-28 15:00:00' and TAG14='tag14-2028' and TAG18='tag18-9227'
group by
TAG12
order by
ack
desc limit 2;
output:
[ {
"timestamp" : "2017-11-27T15:00:00.000+08:00",
"result" : [ {
"ack" : 90300896,
"ack_div_latency" : 1.003882596928461,
"TAG12" : "tag12-0",
"latencyms" : 8.995165E7
}, {
"ack" : 331693,
"ack_div_latency" : 1.1468773987427994,
"TAG12" : "tag12-1680",
"latencyms" : 289214.0
} ]
}, {
"timestamp" : "2017-11-27T16:00:00.000+08:00",
"result" : [ {
"ack" : 90986653,
"ack_div_latency" : 1.0018361726305511,
"TAG12" : "tag12-0",
"latencyms" : 9.0819892E7
}, {
"ack" : 307253,
"ack_div_latency" : 0.9880629652854823,
"TAG12" : "tag12-2016",
"latencyms" : 310965.0
} ]
}
...
...
]
注意事项:
1、granularity声明了数据聚合的时间粒度。目前在mysql下,报表是通过自己实现的timePeriod来对结果进行再聚合的。
2、threshold声明了topN的N
3、metric声明了按那个指标进行降序排列
3、dimension声明了需要group by的纬度,topN只支持单纬度。
多列聚合group by
demo:
{
"queryType": "groupBy",
"dataSource": "metrics-min",
"dimensions": ["TAG12", "TAG13", "TAG14"],
"granularity": "hour",
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "TAG18",
"value": "tag18-9227"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "ack",
"fieldName": "ACKS"
},
{
"type": "doubleSum",
"name": "latencyms",
"fieldName": "LATENCYMS"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "ack_div_latency",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "ack",
"fieldName": "ack"
},
{
"type": "fieldAccess",
"name": "latencyms",
"fieldName": "latencyms"
}
]
}
],
"intervals": [
"2017-11-27T15:00:00.000/p1d"
],
"limitSpec": {
"columns": [
{
"dimension": "ack",
"direction": "descending",
"dimensionOrder": "numeric"
}
],
"limit": 500,
"type": "default"
}
}
等价于mysql:
select
TAG12,
TAG13,
TAG14,
sum(ACKS) as ack,
sum(LATENCYMS) as latencyms,
sum(ACKS)/sum(LATENCYMS) as ack_div_latency
from
metrics-min
where
timestamp>='2017-11-27 15:00:00' and timestamp<'2017-11-28 15:00:00' and TAG18='tag18-9227'
group by
TAG12,TAG13,TAG14
order by
ack
desc limit 500;
output:
[ {
"version" : "v1",
"timestamp" : "2017-11-27T15:00:00.000+08:00",
"event" : {
"TAG13" : "tag13-0",
"TAG14" : "tag14-0",
"TAG12" : "tag12-0",
"ack" : 90300896,
"ack_div_latency" : 1.003882596928461,
"latencyms" : 8.995165E7
}
}, {
"version" : "v1",
"timestamp" : "2017-11-27T15:00:00.000+08:00",
"event" : {
"TAG13" : "tag13-1680",
"TAG14" : "tag14-1680",
"TAG12" : "tag12-1680",
"ack" : 331693,
"ack_div_latency" : 1.1468773987427994,
"latencyms" : 289214.0
}
}
...
...
]
注意事项:
1、granularity声明了返回结果的聚合粒度,对应报表上的每个图点的粒度,目前在mysql下,报表是通过自己实现的timePeriod来对结果进行再聚合的。
2、limitSpec声明了order by的字段、升降序以及limit返回的行数。
3、dimension声明了需要group by的纬度列表,支持任意数量的纬度。
group by timestamp
--仅对时间戳做group by
demo:
{
"queryType": "timeseries",
"dataSource": "metrics-min",
"granularity": "hour",
"descending": true,
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "TAG18",
"value": "tag18-9227"
}
]
},
"aggregations": [
{
"type": "longSum",
"name": "ack",
"fieldName": "ACKS"
},
{
"type": "doubleSum",
"name": "latencyms",
"fieldName": "LATENCYMS"
}
],
"postAggregations": [
{
"type": "arithmetic",
"name": "ack_div_latency",
"fn": "/",
"fields": [
{
"type": "fieldAccess",
"name": "ack",
"fieldName": "ack"
},
{
"type": "fieldAccess",
"name": "latencyms",
"fieldName": "latencyms"
}
]
}
],
"intervals": [
"2017-12-09T15:00:00.000/p1d"
]
}
等价于mysql:
select
timestamp,
sum(ACKS) as ack,
sum(LATENCYMS) as latencyms,
sum(ACKS)/sum(LATENCYMS) as ack_div_latency
from
metrics-min
where
timestamp>='2017-12-09 15:00:00' and timestamp<'2017-12-10 15:00:00' and TAG18='tag18-9227'
group by
timestamp
order by
timestamp
desc;
outopt:
[ {
"timestamp" : "2017-12-10T10:43:00.000+08:00",
"result" : {
"ack" : 90290998,
"latencyms" : 9.0457349E7,
"ack_div_latency" : 0.9981610007164813
}
}, {
"timestamp" : "2017-12-10T10:42:00.000+08:00",
"result" : {
"ack" : 89444860,
"latencyms" : 9.0373426E7,
"ack_div_latency" : 0.9897252318397225
}
...
...
]
pv、uv计算(distinct)
--datasketch估算
demo:
{
"queryType": "groupBy",
"intervals": "2017-11-28T00:00:00.000+08:00/2017-11-30T00:00:00.000+08:00",
"dimensions": [],
"aggregations": [
{
"type": "thetaSketch",
"fieldName": "TAG12",
"name": "uniq_tag12",
"size": 131072
}
],
"granularity": "all",
"dataSource": "metrics-min",
"filter": { "type": "selector", "dimension": "TAG18", "value": "tag18-0" }
}
等价于mysql:
select
count(distinct TAG12)
from
metrics-min
where
timestamp>='2017-11-28 00:00:00' and timestamp<'2017-11-30 00:00:00' and TAG18='tag18-0';
output:
[ {
"version" : "v1",
"timestamp" : "2017-11-28T00:00:00.000+08:00",
"event" : {
"uniq_tag12" : 1.0
}
} ]
注意事项:
1、aggregations选项中的size代表datasketch的精确度,值越大越精确。
2、output中的uniq_tag12代表count distinct TAG12的值,这里是1。
3、使用datasketch计算只能得到最终的distinct值,得不到具体的纬度值。
4、datasketch的误差率可以低于1%。
--group by精确统计
{
"queryType": "groupBy",
"intervals": "2017-11-28T00:00:00.000+08:00/2017-11-30T00:00:00.000+08:00",
"dimensions": ["TAG12"],
"aggregations": [
{
"type": "longSum",
"name": "sum_acks",
"fieldName": "ACKS"
}
],
"granularity": "all",
"dataSource": "metrics-min",
"filter": { "type": "selector", "dimension": "TAG18", "value": "tag18-0" }
}
等价于mysql:
select
TAG12,
sum(ACKS) as sum_acks
from
metrics-min
where
timestamp>='2017-11-28 00:00:00' and timestamp<'2017-11-30 00:00:00' and TAG18='tag18-0' group by TAG12;
output:
[ {
"version" : "v1",
"timestamp" : "2017-11-28T00:00:00.000+08:00",
"event" : {
"TAG12" : "tag12-0",
"sum_acks" : 4331118453
}
} ]
注意事项:
1、使用group by能够得到count distinct TAG12的精确值,即查询返回的总记录条数1。
2、使用group by能够拿到具体的纬度值,即output里TAG12对应的值tag12-0。
Druid SQL
http://druid.io/docs/latest/querying/sql.html