CC00005.druid——|Hadoop&OLAP_Druid.V05|——|Druid.v05|入门|从kafka加载流式数据.V1|

一、从Kafka中加载流式数据
### --- 从Kafka中加载流式数据

~~~     数据及需求说明:Druid典型应用架构:不在Druid中处理复杂的数据转换清洗工作
### --- 假设有以下网络流量数据:

~~~     ts:时间戳
~~~     srcip:发送端 IP 地址
~~~     srcport:发送端端口号
~~~     dstip:接收端 IP 地址
~~~     dstport:接收端端口号
~~~     protocol:协议
~~~     packets:传输包的数量
~~~     bytes:传输字节数
~~~     cost:传输耗费的时间
~~~     # 数据为json格式,通过Kafka传输

~~~     每行数据包含:时间戳(ts)、维度列、指标列
~~~     维度列:srcip、srcport、dstip、dstport、protocol
~~~     指标列:packets、bytes、cost
~~~     # 需要计算的指标:
~~~     记录的条数:count
~~~     packets:max
~~~     bytes:min
~~~     cost:sum
~~~     # 数据汇总的粒度:分钟
二、准备测试数据
### --- 测试数据:

{"ts":"2021-10-01T00:01:35Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":1, "bytes":1000, "cost": 0.1}
{"ts":"2021-10-01T00:01:36Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":2, "bytes":2000, "cost": 0.1}
{"ts":"2021-10-01T00:01:37Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":3, "bytes":3000, "cost": 0.1}
{"ts":"2021-10-01T00:01:38Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":4, "bytes":4000, "cost": 0.1}

{"ts":"2021-10-01T00:02:08Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":5, "bytes":5000, "cost": 0.2}
{"ts":"2021-10-01T00:02:09Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":6, "bytes":6000, "cost": 0.2}
{"ts":"2021-10-01T00:02:10Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":7, "bytes":7000, "cost": 0.2}
{"ts":"2021-10-01T00:02:11Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":8, "bytes":8000, "cost": 0.2}
{"ts":"2021-10-01T00:02:12Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":9, "bytes":9000, "cost": 0.2}
### --- 最后执行查询:

~~~     # 查询数据
select * from tab;
~~~输出参数 有两个返回值,以下仅为示意
{"ts":"2021-11-01T00:01","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":5, "bytes":1000, "cost": 0.4, "count":4}
{"ts":"2021-11-01T00:02","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":9, "bytes":5000, "cost": 1.0, "count":5}
~~~     # 其他查询

select dstPort, min(packets), max(bytes), sum(count), min(count)
from tab
group by dstPort
三、创建Topic发送消息
### --- 启动Kafka集群,并创建一个名为 "yanqidruid1" 的 Topic :

~~~     # 启动kafka
[root@hadoop01 ~]# kafka-server-start.sh -daemon /opt/yanqi/servers/kafka_2.12/config/server.properties
### --- 创建topic
~~~     # 创建topic
~~~     # --zookeeper hadoop01:2181,hadoop02:2181/myKafka => /myKafka 为namespace,请注意自己是否添加了

[root@hadoop01 ~]# kafka-topics.sh --create --zookeeper hadoop01:2181,hadoop02:2181/myKafka --replication-factor 2 --partitions 6 --topic yanqidruid1
### --- 启动生产者

~~~     # 启动生产者
[root@hadoop01 ~]# kafka-console-producer.sh --broker-list hadoop01:9092,hadoop02:9092 --topic yanqidruid1
~~~加载一条数据
{"ts":"2021-10-01T00:01:35Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":1, "bytes":1000, "cost": 0.1}
{"ts":"2021-10-01T00:02:08Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":5, "bytes":5000, "cost": 0.2}
四、从kafka中摄取数据
### --- 从kafka中摄取数据

~~~     # 浏览器访问 hadoop03:8888,点击控制台中的 Load data
~~~     # Start:选择 Apache Kafka ,点击 Connect data
### --- Connect

~~~     在 Bootstrap servers 输入hadoop01:9092,hadoop02:9092
~~~     在 Topic 输入 yanqidruid1
~~~     点击 Preview 确保看到的数据是正确的
~~~     后点击"Next: Parse data"进入下一步——Next:Parse Data
### --- Parse data

~~~     数据加载器将尝试自动为数据确定正确的解析器。可以使用多种解析器解析数据
~~~     这里使用json 解析器解析数据
### --- Parse time

~~~     定义数据的主时间列
### --- Tranform

~~~     不建议在Druid中进行复杂的数据转换操作,可考虑将这些操作放在数据预处理
~~~     这里没有定义数据转换
### --- Filter

~~~     不建议在Druid中进行复杂的数据过滤操作,可考虑将这些操作放在数据预处理
~~~     这里没有定义数据过滤
### --- configure Schema

~~~     定义指标列、维度列
~~~     定义如何在维度列上进行计算
~~~     定义是否在摄取数据时进行数据的合并(即Rollup),以及Rollup的粒度
### --- Partition
~~~     定义如何进行数据分区

~~~     # Primary partitioning有两种方式
~~~     uniform,以一个固定的时间间隔聚合数据,建议使用这种方式。这里将每天的数据作为一个分区
~~~     arbitrary,尽量保证每个segments大小一致,时间间隔不固定

~~~     # Secondary partitioning
~~~     Max rows per segment,每个Segment最大的数据条数
~~~     Max total rows,Segment等待发布的最大数据条数
### --- Tune

~~~     定义任务执行和优化的相关参数
### --- Publish

~~~     定义Datasource的名称
~~~     定义数据解析失败后采取的动作
### --- Edit spec

~~~     json串为数据摄取规范。可返回之前的步骤中进行更改,
~~~     也可以直接编辑规范,并在前面的步骤中看到它
~~~     摄取规范定义完成后,单击 Submit将创建一个数据摄取任务
五、数据查询
### --- 数据查询
### --- 进行数据查询

~~~     数据摄取规范发布后创建 Supervisor
~~~     Supervisor 会启动一个Task,从Kafka中摄取数据
~~~     等待一小段时间,Datasource被创建,此时可以进行数据的查询
### --- 在kafka中写入数据

~~~     # 启动生产者
[root@hadoop01 ~]# kafka-console-producer.sh --broker-list hadoop01:9092,hadoop02:9092 --topic yanqidruid1
~~~加载数据
{"ts":"2021-10-01T00:01:35Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":1, "bytes":1000, "cost": 0.1}
{"ts":"2021-10-01T00:01:36Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":2, "bytes":2000, "cost": 0.1}
{"ts":"2021-10-01T00:01:37Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":3, "bytes":3000, "cost": 0.1}
{"ts":"2021-10-01T00:01:38Z","srcip":"6.6.6.6", "dstip":"8.8.8.8", "srcport":6666, "dstPort":8888, "protocol": "tcp", "packets":4, "bytes":4000, "cost": 0.1}

{"ts":"2021-10-01T00:02:08Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":5, "bytes":5000, "cost": 0.2}
{"ts":"2021-10-01T00:02:09Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":6, "bytes":6000, "cost": 0.2}
{"ts":"2021-10-01T00:02:10Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":7, "bytes":7000, "cost": 0.2}
{"ts":"2021-10-01T00:02:11Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":8, "bytes":8000, "cost": 0.2}
{"ts":"2021-10-01T00:02:12Z","srcip":"1.1.1.1", "dstip":"2.2.2.2", "srcport":6666, "dstPort":8888, "protocol": "udp", "packets":9, "bytes":9000, "cost": 0.2}
### --- 查询数据

~~~     # 查看全部的数据 --备注:维度相同的数据进行了Rollup
select * from "yanqitable1"
~~~     # 其他查询  --- count字段加引号,表示是一个列名(本质是进行转义,否则认为count是一个函数,将报错)
select dstPort, min(sum_packets), max(min_bytes), sum("count"), min("count") from "yanqitable1" group by dstPort
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

yanqi_vip

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值