Druid查询数据

数据接入
druid采用kafka index service直接从kafka中读取数据,并进行索引然后摄入到druid中。
研发只需要将数据以json格式打入kakfa对应的topic即可。

--json格式demo
{
    "TIMESTAMP": 1512791296,
    "TAG1": "tag1-2404",
    "TAG2": "tag2-936",
    "TAG3": "tag3-119",
    "TAG4": "tag4-4534",
    ....
    "BYTES": 42841,
    "LATENCYMS": 8293,
    "ACKS": 3153,
    "ERROR_COUNT": 16149,
    "CLICK": 243,
    "POINT": 77,
    "PRICE": 765
}
注意事项:
    1、为了规避时区的干扰,timestamp应该设置为10位整数的unixtime。
    2、字段名大小写敏感。
    3、纬度列应该尽量避免值为NULL的情况,换言之,纬度列尽量保持非空。
定义datasource
数据进入kafka之后,需要定义一个datasource specfile告诉druid怎样接入数据,里面声明时间戳格式,纬度列、指标列、预聚合的粒度等等。

--datasource specfile demo
{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "metrics-min",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "TIMESTAMP",
          "format": "posix"
        },
        "dimensionsSpec": {
          "dimensions": [
            "TAG1",
            "TAG2",
            ...
            "TAG30",
            "TAG31"
      ]
        }
      }
    },
    "metricsSpec": [
      {
    "name": "counter",
    "type": "count"
      },
      {
        "name": "BYTES",
        "fieldName": "BYTES",
        "type": "doubleSum"
      },
      {
        "name": "LATENCYMS",
        "fieldName": "LATENCYMS",
        "type": "doubleSum"
      }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE"
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "maxRowsPerSegment": 500000000
  },
  "ioConfig": {
    "topic": "metrics",
    "consumerProperties": {
      "bootstrap.servers": "192.168.1.115:9092"
    },
    "taskCount": 2,
    "taskDuration": "PT72M",
    "replicas": 1
  }
}

注意事项:
    1、dataSource声明了datasource的名称
    2、dimensionsSpec声明了纬度列
    3、metricsSpec声明了指标列
    4、ioConfig声明了消费数据所在的kafka集群地址以及topic的名称
    5、taskCount声明了起多少个consumer进程来消费数据
    6、queryGranularity声明预聚合的粒度

启动数据摄入只需要以datasource sepcfile为参数向overlord发送一个POST请求即可。
curl -X POST -H 'Content-Type: application/json' -d @metrics-min-kafka-supervisor.json http://192.168.1.112:8090/druid/indexer/v1/supervisor
数据查询

select详情查询

select详情查询(首页)
demo:
{
   "queryType": "select",
   "dataSource": "metrics-min",
   "descending": "false",
   "dimensions":[],
   "metrics":[],
   "granularity": "all",
   "intervals": [
     "2017-12-09T15:10:00/2017-12-09T15:11:00"
   ],
   "pagingSpec":{"pagingIdentifiers": {}, "threshold":5}
 }
 等价与mysql:
    select 
     * 
    from 
        metrics-min 
    where 
        timestamp>='2017-12-09 15:10:00' and timestamp<'2017-12-09 15:11:00' 
    limit 0, 5;

    启动查询需要将准备好的json文件作为参数以POST方式发送给broker节点
    curl -X POST '192.168.1.112:8082/druid/v2/?pretty' -H 'Content-Type:application/json' -d @query.json

output:
[ {
  "timestamp" : "2017-12-09T07:10:00.000Z",
  "result" : {
    "pagingIdentifiers" : {
      "metrics-min_2017-12-09T15:10:00.000+08:00_2017-12-09T15:11:00.000+08:00_2017-12-09T14:58:26.460+08:00" : 4
    },
    "dimensions" : [ "TAG26", "TAG27", "TAG24", "TAG25", "TAG22", "TAG23", "TAG20", "TAG21", "TAG1", "TAG9", "TAG8", "TAG7", "TAG6", "TAG5", "TAG19", "TAG4", "TAG3", "TAG17", "TAG2", "TAG18", "TAG15", "TAG16", "TAG13", "TAG14", "TAG11", "TAG12", "TAG31", "TAG10", "TAG30", "TAG28", "TAG29" ],
    "metrics" : [ "BYTES", "ACKS", "PRICE", "LATENCYMS", "ERROR_COUNT", "counter", "CLICK", "POINT" ],
    "events" : [ {
      "segmentId" : "metrics-min_2017-12-09T15:10:00.000+08:00_2017-12-09T15:11:00.000+08:00_2017-12-09T14:58:26.460+08:00",
      "offset" : 0,
      "event" : {
        "timestamp" : "2017-12-09T15:10:00.000+08:00",
        "TAG1" : "tag1-0",
        "TAG2" : "tag2-1384",
        "TAG3" : "tag3-191",
    ...
    ...

--第二页查询
demo: 
 {
   "queryType": "select",
   "dataSource": "metrics-min",
   "descending": "false",
   "dimensions":[],
   "metrics":[],
   "granularity": "all",
   "intervals": [
     "2017-12-09T15:10:00/2017-12-09T15:11:00"
   ],
   "pagingSpec":{"pagingIdentifiers": {"metrics-min_2017-12-09T15:10:00.000+08:00_2017-12-09T15:11:00.000+08:00_2017-12-09T14:58:26.460+08:00" : 5}, "threshold":5}
 }
等价于mysql:
select 
    * 
from 
    metrics-min 
where 
    timestamp>='2017-12-09 15:10:00' and timestamp<'2017-12-09 15:11:00' 
limit 5, 5;

注意事项:
    1、dimensions、metrics选项可以指定相关纬度列和指标列,留空则返回所有列,应该尽量避免使用*,尽可能只返回需要的字段。
    2、pagingSpec是分页选项,每次查询结果都会返回本次查询的pagingIdentifiers,下次查询时的pagingIdentifiers为上一次查询pagingIdentifiers的值+1

单列聚合topN

topN查询是对单列纬度进行groupby聚合,但是效率要比groupby高。
demo:
{
  "queryType": "topN",
  "dataSource": "metrics-min",
  "dimension": "TAG12",
  "threshold": 2,
  "metric": "ack",
  "granularity": "minute",
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "TAG18",
        "value": "tag18-9227"
      },
      {
        "type": "selector",
        "dimension": "TAG14",
        "value": "tag14-2028"
      }
    ]
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "ack",
      "fieldName": "ACKS"
    },
    {
      "type": "doubleSum",
      "name": "latencyms",
      "fieldName": "LATENCYMS"
    }
  ],
  "postAggregations": [
    {
      "type": "arithmetic",
      "name": "ack_div_latency",
      "fn": "/",
      "fields": [
        {
          "type": "fieldAccess",
          "name": "ack",
          "fieldName": "ack"
        },
        {
          "type": "fieldAccess",
          "name": "latencyms",
          "fieldName": "latencyms"
        }
      ]
    }
  ],
  "intervals": [
    "2017-11-27T15:00:00.000/2017-11-28T15:00:00.000"
  ]
}
等价于mysql:
select 
    TAG12,
    sum(ACKS) as ack,
    sum(LATENCYMS) as latencyms,
    sum(ACKS)/sum(LATENCYMS) as ack_div_latency 
from 
    metrics-min 
where 
    timestamp>='2017-11-27 15:00:00' and timestamp<'2017-11-28 15:00:00' and TAG14='tag14-2028' and TAG18='tag18-9227' 
group by 
    TAG12 
order by 
    ack 
desc limit 2;

output:
[ {
  "timestamp" : "2017-11-27T15:00:00.000+08:00",
  "result" : [ {
    "ack" : 90300896,
    "ack_div_latency" : 1.003882596928461,
    "TAG12" : "tag12-0",
    "latencyms" : 8.995165E7
  }, {
    "ack" : 331693,
    "ack_div_latency" : 1.1468773987427994,
    "TAG12" : "tag12-1680",
    "latencyms" : 289214.0
  } ]
}, {
  "timestamp" : "2017-11-27T16:00:00.000+08:00",
  "result" : [ {
    "ack" : 90986653,
    "ack_div_latency" : 1.0018361726305511,
    "TAG12" : "tag12-0",
    "latencyms" : 9.0819892E7
  }, {
    "ack" : 307253,
    "ack_div_latency" : 0.9880629652854823,
    "TAG12" : "tag12-2016",
    "latencyms" : 310965.0
  } ]
}
    ...
    ...
]

注意事项:
    1、granularity声明了数据聚合的时间粒度。目前在mysql下,报表是通过自己实现的timePeriod来对结果进行再聚合的。
    2、threshold声明了topN的N
    3、metric声明了按那个指标进行降序排列
    3、dimension声明了需要group by的纬度,topN只支持单纬度。

多列聚合group by

demo:
{
  "queryType": "groupBy",
  "dataSource": "metrics-min",
  "dimensions": ["TAG12", "TAG13", "TAG14"],
  "granularity": "hour",
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "TAG18",
        "value": "tag18-9227"
      }
    ]
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "ack",
      "fieldName": "ACKS"
    },
    {
      "type": "doubleSum",
      "name": "latencyms",
      "fieldName": "LATENCYMS"
    }
  ],
  "postAggregations": [
    {
      "type": "arithmetic",
      "name": "ack_div_latency",
      "fn": "/",
      "fields": [
        {
          "type": "fieldAccess",
          "name": "ack",
          "fieldName": "ack"
        },
        {
          "type": "fieldAccess",
          "name": "latencyms",
          "fieldName": "latencyms"
        }
      ]
    }
  ],
  "intervals": [
    "2017-11-27T15:00:00.000/p1d"
  ],
  "limitSpec": {
      "columns": [
          {   
              "dimension": "ack",
              "direction": "descending",
              "dimensionOrder": "numeric"
          }
      ],
      "limit": 500,
      "type": "default"
  }
}
等价于mysql:
select 
    TAG12,
    TAG13,
    TAG14,
    sum(ACKS) as ack,
    sum(LATENCYMS) as latencyms,
    sum(ACKS)/sum(LATENCYMS) as ack_div_latency 
from 
    metrics-min 
where 
    timestamp>='2017-11-27 15:00:00' and timestamp<'2017-11-28 15:00:00' and TAG18='tag18-9227' 
group by 
    TAG12,TAG13,TAG14 
order by 
    ack 
desc limit 500;

output:
[ {
  "version" : "v1",
  "timestamp" : "2017-11-27T15:00:00.000+08:00",
  "event" : {
    "TAG13" : "tag13-0",
    "TAG14" : "tag14-0",
    "TAG12" : "tag12-0",
    "ack" : 90300896,
    "ack_div_latency" : 1.003882596928461,
    "latencyms" : 8.995165E7
  }
}, {
  "version" : "v1",
  "timestamp" : "2017-11-27T15:00:00.000+08:00",
  "event" : {
    "TAG13" : "tag13-1680",
    "TAG14" : "tag14-1680",
    "TAG12" : "tag12-1680",
    "ack" : 331693,
    "ack_div_latency" : 1.1468773987427994,
    "latencyms" : 289214.0
  }
}
    ...
    ...
]

注意事项:
    1、granularity声明了返回结果的聚合粒度,对应报表上的每个图点的粒度,目前在mysql下,报表是通过自己实现的timePeriod来对结果进行再聚合的。
    2、limitSpec声明了order by的字段、升降序以及limit返回的行数。
    3、dimension声明了需要group by的纬度列表,支持任意数量的纬度。

group by timestamp

--仅对时间戳做group by
demo:
{
  "queryType": "timeseries",
  "dataSource": "metrics-min",
  "granularity": "hour",
  "descending": true,
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "TAG18",
        "value": "tag18-9227"
      }
    ]
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "ack",
      "fieldName": "ACKS"
    },
    {
      "type": "doubleSum",
      "name": "latencyms",
      "fieldName": "LATENCYMS"
    }
  ],
  "postAggregations": [
    {
      "type": "arithmetic",
      "name": "ack_div_latency",
      "fn": "/",
      "fields": [
        {
          "type": "fieldAccess",
          "name": "ack",
          "fieldName": "ack"
        },
        {
          "type": "fieldAccess",
          "name": "latencyms",
          "fieldName": "latencyms"
        }
      ]
    }
  ],
  "intervals": [
    "2017-12-09T15:00:00.000/p1d"
  ]
}
等价于mysql:
select 
    timestamp,
    sum(ACKS) as ack,
    sum(LATENCYMS) as latencyms,
    sum(ACKS)/sum(LATENCYMS) as ack_div_latency 
from 
    metrics-min 
where 
    timestamp>='2017-12-09 15:00:00' and timestamp<'2017-12-10 15:00:00' and TAG18='tag18-9227' 
group by 
    timestamp 
order by 
    timestamp 
desc;

outopt:
[ {
  "timestamp" : "2017-12-10T10:43:00.000+08:00",
  "result" : {
    "ack" : 90290998,
    "latencyms" : 9.0457349E7,
    "ack_div_latency" : 0.9981610007164813
  }
}, {
  "timestamp" : "2017-12-10T10:42:00.000+08:00",
  "result" : {
    "ack" : 89444860,
    "latencyms" : 9.0373426E7,
    "ack_div_latency" : 0.9897252318397225
  }
  ...
  ...
]

pv、uv计算(distinct)

--datasketch估算
demo:
{
    "queryType": "groupBy",
    "intervals": "2017-11-28T00:00:00.000+08:00/2017-11-30T00:00:00.000+08:00",
    "dimensions": [],
    "aggregations": [
        {
            "type": "thetaSketch",
            "fieldName": "TAG12",
            "name": "uniq_tag12",
            "size": 131072
        }
    ],
    "granularity": "all",
    "dataSource": "metrics-min",
    "filter": { "type": "selector", "dimension": "TAG18", "value": "tag18-0" }
}
等价于mysql:
select 
    count(distinct TAG12) 
from 
    metrics-min 
where 
    timestamp>='2017-11-28 00:00:00' and timestamp<'2017-11-30 00:00:00' and TAG18='tag18-0';

output:
[ {
  "version" : "v1",
  "timestamp" : "2017-11-28T00:00:00.000+08:00",
  "event" : {
    "uniq_tag12" : 1.0
  }
} ]

注意事项:
    1、aggregations选项中的size代表datasketch的精确度,值越大越精确。
    2、output中的uniq_tag12代表count distinct TAG12的值,这里是13、使用datasketch计算只能得到最终的distinct值,得不到具体的纬度值。
    4、datasketch的误差率可以低于1%。

--group by精确统计
{
    "queryType": "groupBy",
    "intervals": "2017-11-28T00:00:00.000+08:00/2017-11-30T00:00:00.000+08:00",
    "dimensions": ["TAG12"],
    "aggregations": [
        {
            "type": "longSum",
            "name": "sum_acks",
            "fieldName": "ACKS"
        }
    ],
    "granularity": "all",
    "dataSource": "metrics-min",
    "filter": { "type": "selector", "dimension": "TAG18", "value": "tag18-0" }
}
等价于mysql:
select 
    TAG12,
    sum(ACKS) as sum_acks
from 
    metrics-min 
where 
    timestamp>='2017-11-28 00:00:00' and timestamp<'2017-11-30 00:00:00' and TAG18='tag18-0' group by TAG12;

output:
[ {
  "version" : "v1",
  "timestamp" : "2017-11-28T00:00:00.000+08:00",
  "event" : {
    "TAG12" : "tag12-0",
    "sum_acks" : 4331118453
  }
} ]

注意事项:
    1、使用group by能够得到count distinct TAG12的精确值,即查询返回的总记录条数12、使用group by能够拿到具体的纬度值,即output里TAG12对应的值tag12-0

Druid SQL

http://druid.io/docs/latest/querying/sql.html
  • 3
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值