Apache Druid 实践及案例分享
数据摄取
Druid 支持流式和批量两种方式的数据摄入,针对不同类型的数据,Druid 将外部数据源分为两种形式:
① 流式数据源(流式)
指的是持续不断地生产数据的数据源。例如:消息队列、日志、文件等
② 静态数据源(批量)
指的是数据已经生产完毕,不会有新数据产生的数据源,例如:文件系统的文件
流式数据可以通过两种方式来摄入:本地文件和远程文件
静态数据源数据摄取(本地文件)
案例:将本地文件的数据加载到 Druid 中,然后进行 SQL 操作
① 本地文件摄取,创建本地文件
mkdir -p /opt/server/druid/
# 创建本地文件
vim /opt/server/druid/stock-trade.json
# 添加以下内容到文件中
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"20"}
{"timestamp":"2019-10-01T01:01.00Z","sec_code":"白云机场","platform":"pc","trade_vol":"10"}
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"5"}
{"timestamp":"2019-10-01T05:01.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"1"}
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"白云机场","platform":"pc","trade_vol":"8"}
{"timestamp":"2019-10-01T01:01.00Z","sec_code":"浦发银行","platform":"mobile","trade_vol":"7"}
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"6"}
{"timestamp":"2019-10-01T03:01.00Z","sec_code":"东风汽车","platform":"pc","trade_vol":"5"}
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"7"}
{"timestamp":"2019-10-01T01:01.00Z","sec_code":"白云机场","platform":"pc","trade_vol":"8"}
{"timestamp":"2019-10-01T05:03.00Z","sec_code":"浦发银行","platform":"mobile","trade_vol":"9"}
{"timestamp":"2019-10-01T01:01.00Z","sec_code":"东风汽车","platform":"pc","trade_vol":"23"}
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"5"}
{"timestamp":"2019-10-01T01:01.00Z","sec_code":"白云机场","platform":"pc","trade_vol":"1"}
{"timestamp":"2019-10-01T03:03.00Z","sec_code":"浦发银行","platform":"mobile","trade_vol":"1"}
{"timestamp":"2019-10-01T01:01.00Z","sec_code":"东风汽车","platform":"pc","trade_vol":"2"}
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"3"}
{"timestamp":"2019-10-01T05:01.00Z","sec_code":"浦发银行","platform":"pc","trade_vol":"5"}
{"timestamp":"2019-10-01T01:03.00Z","sec_code":"浦发银行","platform":"mobile","trade_vol":"6"}
{"timestamp":"2019-10-01T01:01.00Z","sec_code":"东风汽车","platform":"pc","trade_vol":"3"}
② 将 stock_trade.json
分发到其他 Druid 节点相同的目录
注意:数据要保持在各个节点都要有,否则会出现任务执行成功,但是没有 DataSource 的情况
# 在 node1 操作
scp /opt/server/druid/stock-trade.json node2:$PWD
scp /opt/server/druid/stock-trade.json node3:$PWD
③ 发送创建 DataSource
的请求
{
"spec": {
"dataSchema": {
"dataSource": "stock_trade",
"parser": {
"type": "String",
"parseSpec": {
"format": "json",
"dimensionsSpec": {
"dimensions": [
"sec_code",
"platform"
]
},
"timestampSpec": {
"column": "timestamp",
"format": "auto"
}
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
},
{
"type": "longSum",
"name": "trade_vol",
"fieldName": "trade_vol",
"expression": null
}
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "HOUR",
"rollup": true,
"intervals": [
"2019-10-01T00:00:00.000Z/2019-10-03T00:00:00.000Z"
]
},
"transformSpec": {
"filter": null,
"transforms": []
}
},
"ioConfig": {
"type": "index",
"firehose": {
"type": "local",
"baseDir": "/opt/server/druid",
"filter": "stock-trade.json",
"parser": null
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index",
"maxRowsPerSegment": null,
"maxRowsInMemory": 1000000,
"maxBytesInMemory": 0,
"maxTotalRows": null,
"numShards": null,
"partitionDimensions": [],
"indexSpec": {
"bitmap": {
"type": "concise"
},
"dimensionCompression": "lz4",
"metricCompression": "lz4",
"longEncoding": "longs"
},
"maxPendingPersists": 0,
"buildV9Directly": true,
"forceGuaranteedRollup": false,
"reportParseExceptions": false,
"pushTimeout": 0,
"segmentWriteOutMediumFactory": null,
"logParseExceptions": false,
"maxParseExceptions": 2147483647,
"maxSavedParseExceptions": 0
}
},
"type": "index"
}
④ 查看 DataSource
⑤ SQL 查询
-- 查看各个股票在各个平台的成交总量
SELECT
sec_code,
platform,
SUM(trade_vol) AS total_trade_vol
FROM "stock_trade"
WHERE TIME_FORMAT("__time", 'yyyyMMdd') = '20191001'
GROUP BY sec_code, platform
静态数据源数据摄取(HDFS文件)
案例:加载 HDFS 上的数据到 Druid 中,然后进行 SQL 操作
# 执行原理
1. Druid 使用 HadoopDruidIndexer 加载批量数据,将数据生成 segments 文件,存放在HDFS上
2. 从 HDFS 下载 segments 文件到本地
3. 然后遍可从Druid中查询数据
① 创建文件 wiki-visit-log.json
文件并上传到 HDFS 中
vim wiki-visit-log.json
# 上传到 HDFS 中
hdfs dfs -put wiki-visit-log.json /tmp/logs
{"time":"2015-09-12T00:47:00.496Z","channel":"#ca.wikipedia","cityName":null,"comment":"Robot inserta {{Commonscat}} que enllaça amb [[commons:category:Rallicula]]","countryIsoCode":null,"countryName":null,"isAnonymous":false,"isMinor":true,"isNew":false,"isRobot":true,"isUnpatrolled":false,"metroCode":null,"namespace":"Main","page":"Rallicula","regionIsoCode":null,"regionName":null,"user":"PereBot","delta":17,"added":17,"deleted":0}
{"time":"2015-09-12T00:47:05.474Z","channel":"#en.wikipedia","cityName":"Auburn","comment":"/* Status of peremptory norms under international law */ fixed spelling of 'Wimbledon'","countryIsoCode":"AU","countryName":"Australia","isAnonymous":true,"isMinor":false,"isNew":false,"isRobot":false,"isUnpatrolled":false,"metroCode":null,"namespace":"Main","page":"Peremptory norm","regionIsoCode":"NSW","regionName":"New South Wales","user":"60.225.66.142","delta":0,"added":0,"deleted":0}
{"time":"2015-09-12T00:47:08.770Z","channel":"#vi.wikipedia","cityName":null,"comment":"fix Lỗi CS1: ngày tháng","countryIsoCode":null,"countryName":null,"isAnonymous":false,"isMinor":true,"isNew":false,"isRobot":true,"isUnpatrolled":false,"metroCode":null,"namespace":"Main","page":"Apamea abruzzorum","regionIsoCode":null,"regionName":null,"user":"Cheers!-bot","delta":18,"added":18,"deleted":0}
{"time":"2015-09-12T00:47:11.862Z","channel":"#vi.wikipedia","cityName":null,"comment":"clean up using [[Project:AWB|AWB]]","countryIsoCode":null,"countryName":null,"isAnonymous":false,"isMinor":false,"isNew":false,"isRobot":true,"isUnpatrolled":false,"metroCode":null,"namespace":"Main","page":"Atractus flammigerus","regionIsoCode":null,"regionName":null,"user":"ThitxongkhoiAWB","delta":18,"added":18,"deleted":0}
{"time":"2015-09-12T00:47:13.987Z","channel":"#vi.wikipedia","cityName":null,"comment":"clean up using [[Project:AWB|AWB]]","countryIsoCode":null,"countryName":null,"isAnonymous":false,"isMinor":false,"isNew":false,"isRobot":true,"isUnpatrolled":false,"metroCode":null,"namespace":"Main","page":"Agama mossambica","regionIsoCode":null,"regionName":null,"user":"ThitxongkhoiAWB","delta":18,"added":18,"deleted":0}
{"time":"2015-09-12T00:47:17.009Z","channel":"#ca.wikipedia","cityName":null,"comment":"/* Imperi Austrohongarès */","countryIsoCode":null,"countryName":null,"isAnonymous":false,"isMinor":false,"isNew":false,"isRobot":false,"isUnpatrolled":false,"metroCode":null,"namespace":"Main","page":"Campanya dels Balcans (1914-1918)","regionIsoCode":null,"regionName":null,"user":"Jaumellecha","delta":-20,"added":0,"deleted":20}
流式数据源数据摄取(Kafka)
① 创建 topic
cd /opt/server/kafka_2.12-2.4.1
bin/kafka-topics.sh --create --zookeeper node1:2181 --partitions 1 --replication-factor 1 --topic metrics
② 提交索引任务
③ 通过 WEB-UI 界面创建任务
④ 启动 Kafka 生产者
# 启动
bin/kafka-topics.sh --create --zookeeper node1:2181 --partitions 1 --replication-factor 1 --topic metrics
# 添加 1 条数据
{"time":"2019-07-23T17:57:58Z","url":"/foo/bar","user":"alice","latencyMs":32}
⑤ 配置任务信息
Connect
Parse Data
ParseTime
Transform
Filter
Configure Schema
Partition
Tune
⑥ 任务提交后查看
⑦ SQL 查询