Waterdrop安装部署及相关数据接入输出
0、准备工作
-
服务器使用 CentOS 7.6 1810 系统版本。
-
因为所有的安装环境均通过 docker 来进行安装部署,所以首先要安装 docker 环境。此处不做赘述。
注:修改 docker 的根目录;开机自启 docker 服务。
-
jdk 版本使用 openjdk version “1.8.0_292”。
-
Python 版本使用 CentOS 自带的 2.7.5,使用 3.7.x 以上版本会出现不兼容问题。
-
关闭并禁用防火墙。
1、安装 Elasticsearch
1、下载 Elasticsearch
下载最新的 7.13.2 版本的 Elasticsearch。使用 docker 的官方镜像进行下载。
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.2
2、启动 elasticsearch
此处搭建集群,共3个 es 节点。
1、安装 docker-compose 命令
wget https://github.com/docker/compose/releases/download/1.23.0-rc3/docker-compose-Linux-x86_64
mv docker-compose-Linux-x86_64 /usr/local/bin/docker-compose
chmod 755 /usr/local/bin/docker-compose
docker-compose --version
2、创建 docker-compose.yml 文件
vi docker-compose.yml
添加如下内容:
version: '2.2'
services:
es01:
image: docker.elastic.co/elasticsearch/elasticsearch:7.13.2
container_name: es01
environment:
- node.name=es01
- cluster.name=es-docker-cluster
- discovery.seed_hosts=es02,es03
- cluster.initial_master_nodes=es01,es02,es03
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- data01:/usr/share/elasticsearch/data
ports:
- 9200:9200
networks:
- elastic
privileged:
true
restart:
always
es02:
image: docker.elastic.co/elasticsearch/elasticsearch:7.13.2
container_name: es02
environment:
- node.name=es02
- cluster.name=es-docker-cluster
- discovery.seed_hosts=es01,es03
- cluster.initial_master_nodes=es01,es02,es03
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- data02:/usr/share/elasticsearch/data
networks:
- elastic
privileged:
true
restart:
always
es03:
image: docker.elastic.co/elasticsearch/elasticsearch:7.13.2
container_name: es03
environment:
- node.name=es03
- cluster.name=es-docker-cluster
- discovery.seed_hosts=es01,es02
- cluster.initial_master_nodes=es01,es02,es03
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- data03:/usr/share/elasticsearch/data
networks:
- elastic
privileged:
true
restart:
always
volumes:
data01:
driver: local
data02:
driver: local
data03:
driver: local
networks:
elastic:
driver: bridge
3、启动
修改系统配置
vi /etc/sysctl.conf
添加以下内容
vm.max_map_count=655360
保存退出后,执行:
sysctl -p
启动 elasticsearch 集群
docker-compose up -d
4、查看启动情况
执行
curl -X GET "localhost:9200/_cat/nodes?v=true&pretty"
如图所示即为正常启动。
2、安装 MariaDB
1、下载 MariaDB
docker pull docker.io/mariadb
2、启动 MariaDB
docker run -itd --privileged=true --restart=always --network=host -e MYSQL_ROOT_PASSWORD=Abc123456 -v /home/mariadb/data:/var/lib/mysql -v /etc/localtime:/etc/localtime --name=$container_name mariadb
3、安装 Kafka
1、下载 Kafka
此处下载的 kafka 版本为 kafka_2.13-2.7.0
docker pull docker.io/wurstmeister/kafka
2、下载 zookeeper
docker pull docker.io/wurstmeister/zookeeper
3、下载 kafka-manager
docker pull docker.io/sheepkiller/kafka-manager
4、创建 docker-compose.yml 文件
vi docker-compose.yml
添加如下内容:
version: '2'
services:
zookeeper:
container_name: srv-zookeeper
image: wurstmeister/zookeeper
volumes:
- ./data:/data
ports:
- "2181:2181"
privileged:
true
restart:
always
kafka:
container_name: srv-kafka
image: wurstmeister/kafka
ports:
- "9092:9092"
environment:
KAFKA_ADVERTISED_HOST_NAME: localhost
KAFKA_MESSAGE_MAX_BYTES: 2000000
KAFKA_CREATE_TOPICS: "Topic1:1:3,Topic2:1:1:compact"
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
volumes:
- ./kafka-logs:/kafka
- /var/run/docker.sock:/var/run/docker.sock
privileged:
true
restart:
always
kafka-manager:
container_name: srv-kafka-manager
image: sheepkiller/kafka-manager
ports:
- 9020:9000
environment:
ZK_HOSTS: zookeeper:2181
privileged:
true
restart:
always
5、启动
docker-compose up -d
6、查看启动情况
使用浏览器访问对应的 ip:
http://ip:9020
出现如下所示即为成功:
7、消息发送测试
启动两个窗口,都进入容器
docker exec -it srv-kafka bash
切换到以下路径
cd /opt/kafka
创建一个 topic
bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092
将事件写入 topic,输入后可随意输入一些字符
bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092
再另一个窗口读取写入的事件,可在本窗口看到之前的写入的事件
bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
4、安装 Spark
1、下载 Spark
下载链接,此处选择 2.4.8 版本
https://archive.apache.org/dist/spark/
2、安装 Spark
解压下载好的压缩包:
tar -zxvf spark-2.4.8-bin-hadoop2.7.tgz
执行
./bin/spark-shell
出现如下内容,则为启动成功:
5、安装 Waterdrop
1、下载 Waterdrop
下载链接,此处选择的版本是 [Stable]v1.5.1:
https://github.com/InterestingLab/waterdrop/releases
2、配置 Waterdrop
解压下载好的压缩包:
unzip waterdrop-1.5.1.zip
指定 SPARK_HOME 路径:
vi config/waterdrop-env.sh
修改为上一节中 Spark 解压的路径,此处为修改为:
SPARK_HOME=/home/spark/spark-3.1.2-bin-hadoop3.2
新建 config/application.conf 文件,它决定了Waterdrop启动后,数据输入,处理,输出的方式和逻辑。
spark {
# Waterdrop defined streaming batch duration in seconds
spark.streaming.batchDuration = 5
spark.app.name = "Waterdrop"
spark.ui.port = 13000
}
input {
socketStream {}
}
filter {
split {
fields = ["msg", "name"]
delimiter = ","
}
}
output {
stdout {}
}
3、启动 Waterdrop
新建一个连接窗口,安装 nc 命令:
yum install nc
执行:
nc -lk 9999
回到之前的窗口,启动 Waterdrop:
./bin/start-waterdrop.sh --master local[4] --deploy-mode client --config ./config/application.conf
显示如下所示,即为成功:
在 nc 命令窗口输入以下内容进行测试:
Hello World, Waterdrop
Waterdrop 显示如下日志极为成功:
后台启动
nuhup ./bin/start-waterdrop.sh --master local[4] --deploy-mode client --config ./config/application.conf &
会在当前文件夹生成 nohup.txt 文件存放日志。
7、使用 Kafka 消息发送
浏览器访问 ip:9020 ,进入Kafka 管理界面,创建如图 Cluster
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KZsZOIOs-
点击进入 Cluster ,创建 Topic,如图所示
两个窗口分别运行 producer.py 和 consumer.py,注意修改对应的 ip 地址及 topic 名称。
producer.py
#!/user/bin/env python
# -*-coding: utf-8 -*-
from pykafka import KafkaClient
import uuid,time
kafkaServers="192.168.1.167:9092"
producer="msgProducer"
topic="quickstart-events"
msgKey=""
dataClass=""
dataType="json"
def buildKafkaMsg():
nowStr = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))
#data={"name":"A","sex":"B","age":"18","time":nowStr}
kafakMsg={"producer":producer,"msgId":str(uuid.uuid1()),"sentTime":nowStr, \
"topic":topic,"dataClass":dataClass,"dataType":dataType}
return str(kafakMsg)
def sendKafkaMsg(msg):
client = KafkaClient(hosts = kafkaServers)
topicdocu = client.topics[topic]
producer = topicdocu.get_sync_producer()
producer.produce(msg, partition_key=msgKey)
producer.stop()
print time.ctime(), ", send msg len=", len(msg)
if __name__=="__main__":
for i in range(100):
msg = buildKafkaMsg()
sendKafkaMsg(msg)
time.sleep(5)
consumer.py
#!/user/bin/env python
# -*-coding: utf-8 -*-
import json
from kafka import KafkaProducer
from kafka import KafkaConsumer
kafkaServers=['192.168.1.167:9092']
msgTopic="quickstart-events"
msgKey=""
{"cardFaceNumber":"0000030020425802","transDate":"20170629","transTime":"154512","transType":"88","transAmount":"90","lineID":"000012","posNo":"01000733"}
def startConsumer():
consumer = KafkaConsumer(msgTopic,bootstrap_servers=kafkaServers)
print "consumer starting for topic=",msgTopic
for rec_msg in consumer:
print rec_msg
if __name__=="__main__":
startConsumer()
运行结果
8、使用 Waterdrop 接收 Kafka 的发送的消息
修改 Waterdrop 的 config/application.conf 文件,在 input{} 中增加 Kafka 的相关配置。
kafkaStream {
topics = "quickstart-events"
consumer.bootstrap.servers = "localhost:9092"
consumer.group.id = "waterdrop_group"
}
运行 producer.py,并在 Waterdro 目录中运行
./bin/start-waterdrop.sh --master local[4] --deploy-mode client --config ./config/application.conf
如图所示,即为 Waterdrop 接收到了 Kafka发送的消息
9、使用 Waterdrop 向 ES 和 MariaDB 发送消息
修改 Waterdrop 的 config/application.conf 文件,在 fliter{} 中增加 json 的相关配置。
json {
source_field = "raw_message"
}
修改 Waterdrop 的 config/application.conf 文件,在 output{} 中增加 mysql 及 elasticsearch 相关配置。并在数据库中建立对应的数据库及表。表的字段名为第8节中圈出的表的标题名称,要与图中保持一致。
mysql {
url = "jdbc:mysql://192.168.1.167:3306/waterdrop"
table = "waterdrop"
user = "root"
password = "newsys"
save_mode = "overwrite"
}
elasticsearch {
hosts = ["192.168.1.167:9200"]
index = "waterdrop-${now}"
es.batch.size.entries = 100000
index_time_format = "yyyy.MM.dd"
}
其中 save_mode 的保存形式有:
error(default) 保存到数据源时,如果数据已经存在,则会抛出异常
append 保存到数据源时,如果数据/表已经存在,则数据帧的内容应附加到现有数据中。
overwrite 保存到数据源时,如果数据/表已经存在,则现有数据预计将被数据框架的内容覆盖,也就是清空之前的数据重新存入
ignore 保存到数据源时,如果数据已经存在,则保存操作不会保存数据内容,并且不会更改现有数据
最终 application.conf 文件如图
运行 producer.py,并在 Waterdro 目录中运行
./bin/start-waterdrop.sh --master local[4] --deploy-mode client --config ./config/application.conf
通过 Navicat 连接数据库,查看对应表,Waterdrop 已经将数据存入表中。
浏览器安装 Elasticvue 插件,安装好后点击进入,添加对应的 elasticsearch 配置,并测试连接。
点击 INDICES ,再点击对应的名称
可以看到 Waterdrop 已经将数据存入表中。
10、多个 Topic 数据存入多个表中
在配置中增加对应的配置
spark {
# Waterdrop defined streaming batch duration in seconds
spark.streaming.batchDuration = 5
spark.app.name = "Waterdrop"
spark.ui.port = 13000
}
input {
#socketStream {}
# 接入多个 topic
kafkaStream {
topics = "quickstart-events,quickstart-events1"
consumer.bootstrap.servers = "localhost:9092"
consumer.group.id = "waterdrop_group"
result_table_name = "table_input_0"
}
}
filter {
json {
source_table_name = "table_input_0"
source_field = "raw_message"
result_table_name = "table_filter_json"
}
# 使用 sql 将不同 topic 分表
sql {
sql = "select * from table_filter_json where topic = 'quickstart-events'"
result_table_name = "table_filter_sql_0"
}
sql {
sql = "select * from table_filter_json where topic = 'quickstart-events1'"
result_table_name = "table_filter_sql_1"
}
}
output {
#stdout {
#}
# 指定表存入对应的数据库表中
mysql {
source_table_name = "table_filter_sql_0"
url = "jdbc:mysql://192.168.1.167:3306/waterdrop"
table = "waterdrop"
user = "root"
password = "newsys"
save_mode = "append"
}
mysql {
source_table_name = "table_filter_sql_1"
url = "jdbc:mysql://192.168.1.167:3306/waterdrop"
table = "waterdrop1"
user = "root"
password = "newsys"
save_mode = "append"
}
# 指定表存入对应的索引中
elasticsearch {
source_table_name = "table_filter_sql_0"
hosts = ["192.168.1.167:9200"]
index = "waterdrop-${now}"
es.batch.size.entries = 100000
index_time_format = "yyyy.MM.dd"
}
elasticsearch {
source_table_name = "table_filter_sql_1"
hosts = ["192.168.1.167:9200"]
index = "waterdrop1-${now}"
es.batch.size.entries = 100000
index_time_format = "yyyy.MM.dd"
}
}
11、一个 Topic 数据根据字段存入多个表
spark {
# Waterdrop defined streaming batch duration in seconds
spark.streaming.batchDuration = 5
spark.app.name = "Waterdrop"
spark.ui.port = 13000
}
input {
#socketStream {}
# 接入 topic
kafkaStream {
topics = "quickstart-events"
consumer.bootstrap.servers = "localhost:9092"
consumer.group.id = "waterdrop_group"
result_table_name = "table_input_0"
}
}
filter {
json {
source_table_name = "table_input_0"
source_field = "raw_message"
result_table_name = "table_filter_json"
}
# 使用 sql 根据数据中的某个字段如:table 进行分表操作。
# 确保数据中有该字端否则报错。
sql {
sql = "select * from table_filter_json where table = 'quickstart-events'"
result_table_name = "table_filter_sql_0"
}
sql {
sql = "select * from table_filter_json where table = 'quickstart-events1'"
result_table_name = "table_filter_sql_1"
}
}
output {
stdout {
}
# 指定表存入对应的数据库表中
mysql {
source_table_name = "table_filter_sql_0"
url = "jdbc:mysql://192.168.1.167:3306/waterdrop"
table = "waterdrop"
user = "root"
password = "newsys"
save_mode = "append"
}
mysql {
source_table_name = "table_filter_sql_1"
url = "jdbc:mysql://192.168.1.167:3306/waterdrop"
table = "waterdrop1"
user = "root"
password = "newsys"
save_mode = "append"
}
# 指定表存入对应的索引中
elasticsearch {
source_table_name = "table_filter_sql_0"
hosts = ["192.168.1.167:9200"]
index = "waterdrop-${now}"
es.batch.size.entries = 100000
index_time_format = "yyyy.MM.dd"
}
elasticsearch {
source_table_name = "table_filter_sql_1"
hosts = ["192.168.1.167:9200"]
index = "waterdrop1-${now}"
es.batch.size.entries = 100000
index_time_format = "yyyy.MM.dd"
}
}
动态配置 where条件
filter {
sql {
table_name = "user_view"
sql = "select * from user_view where city ='"${city}"' and dt = '"${date}"'"
}
}
启动时添加对应参数
./bin/start-waterdrop.sh -c ./config/your_app.conf -e client -m local[2] -i city=shanghai -i date=20190319
s"
save_mode = “append”
}
指定表存入对应的索引中
elasticsearch {
source_table_name = “table_filter_sql_0”
hosts = [“192.168.1.167:9200”]
index = “waterdrop-KaTeX parse error: Expected 'EOF', got '}' at position 82: …"yyyy.MM.dd" }̲ elasticsearc…{now}”
es.batch.size.entries = 100000
index_time_format = “yyyy.MM.dd”
}
}
动态配置 where条件
filter {
sql {
table_name = “user_view”
sql = “select * from user_view where city =’”
c
i
t
y
"
′
a
n
d
d
t
=
′
"
{city}"' and dt = '"
city"′anddt=′"{date}"’"
}
}
启动时添加对应参数
```bash
./bin/start-waterdrop.sh -c ./config/your_app.conf -e client -m local[2] -i city=shanghai -i date=20190319
当多个 sql 作为过滤条件时,会出现当某一个过滤条件不生效或者过滤结果为0时,不能继续向 ES 或者 数据库进行保存。有老哥可以解决这个问题的,欢迎评论。
文中有错误的,或者需要帮助的可以私信、评论或则给我发邮件。邮箱:imu-machao@qq.com