使用Kafka的Connect监听Mysql数据并同步到ElasticSearch-刘宇

最新推荐文章于 2024-06-15 20:12:08 发布

Brycen Liu

最新推荐文章于 2024-06-15 20:12:08 发布

阅读量1.5k

点赞数 2

分类专栏： Kafka 文章标签： kafka elasticsearch zookeeper 大数据 mysql

本文链接：https://blog.csdn.net/liuyu973971883/article/details/107838437

版权

Kafka 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

使用Kafka的Connect监听Mysql数据并同步到ElasticSearch-刘宇

一、安装zookeeper
二、安装kafka
三、安装Elasticsearch
四、配置kafka中的Connect，实现将MySQL数据同步到Elasticsearch中

作者：刘宇
CSDN博客地址：https://blog.csdn.net/liuyu973971883
有部分资料参考，如有侵权，请联系删除。如有不正确的地方，烦请指正，谢谢。

前提条件：需要安装JAVA的运行环境，我这边使用的是JDK1.8版本，安装过程就不演示了。

一、安装zookeeper

这边搭建的是zookeeper的集群

1、解压zookeeper的tar包

cd /software
tar -xzvf zookeeper-3.4.14.tar.gz

2、创建zookeeper所使用到的文件夹

#进入zookeeper解压下来的文件夹
cd /software/zookeeper-3.4.14
#创建zookeeper所使用的快照的存储路径
mkdir dataDir
#创建zookeeper所使用的日志的存储路径
mkdir dataDirLog

3、修改zookeeper配置文件

拷贝原始配置文件

#进入zookeeper文件夹的conf文件夹
cd /software/zookeeper-3.4.14/conf
#拷贝配置文件
cp zoo_sample.cfg zoo.cfg

编辑配置文件

#编辑zoo.cfg文件
vi zoo.cfg

添加下面几项配置

#路径为我们刚才创建的文件夹路径
dataDir=/software/zookeeper-3.4.14/dataDir
dataDirLog=/software/zookeeper-3.4.14/dataDirLog
#zookeeper集群，有几个zookeeper就写几个server
server.1=192.168.40.101:2888:3888
server.2=192.168.40.102:2888:3888
server.3=192.168.40.103:2888:3888

4、添加zookeeper唯一标识

进入我们刚才创建的系统快照存储路径

cd /software/zookeeper-3.4.14/dataDir

添加唯一标识myid，其中数字为你配置文件zoo.cfg中的server.x的编号，每一台zookeeper都得有一个自己的myid文件

echo "1" > myid

5、启动zookeeper

#进入zookeeper文件夹的bin目录下
cd /software/zookeeper-3.4.14/bin
#启动zookeeper
./zkServer.sh start

二、安装kafka

这边搭建的是单台kafka，我是安装在103Linux上的

1、解压

#进入software目录
cd /software
#解压
tar -xzvf kafka_2.11-2.2.1.tgz
#修改文件名
mv kafka_2.11-2.2.1 kafka

2、修改配置文件

进入kafka的confifg文件夹，并编辑配置文件

cd /software/kafka/config
vi server.properties

默认配置文件介绍（参考的这位老哥的资料：https://www.cnblogs.com/toutou/p/linux_install_kafka.html）

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

#  broker就是一个kafka的部署实例，在一个kafka集群中，每一台kafka都要有一个broker.id
#  并且，该id唯一，且必须为整数
broker.id=0

############################# Socket Server Settings #############################

# The address the socket server listens on. It will get the value returned from 
# java.net.InetAddress.getCanonicalHostName() if not configured.
#   FORMAT:
#     listeners = security_protocol://host_name:port
#   EXAMPLE:
#     listeners = PLAINTEXT://your.host.name:9092
#listeners=PLAINTEXT://:9092

# Hostname and port the broker will advertise to producers and consumers. If not set, 
# it uses the value for "listeners" if configured.  Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
#advertised.listeners=PLAINTEXT://your.host.name:9092

#The number of threads handling network requests
# 默认处理网络请求的线程个数 3个
num.network.threads=3

# The number of threads doing disk I/O
# 执行磁盘IO操作的默认线程个数 8
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
# socket服务使用的进行发送数据的缓冲区大小，默认100kb
socket.send.buffer.bytes=102400

# The receive buffer (SO_SNDBUF) used by the socket server
# socket服务使用的进行接受数据的缓冲区大小，默认100kb
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
# socket服务所能够接受的最大的请求量，防止出现OOM(Out of memory)内存溢出，默认值为：100m
# （应该是socker server所能接受的一个请求的最大大小，默认为100M）
socket.request.max.bytes=104857600

############################# Log Basics （数据相关部分，kafka的数据称为log）#############################

# A comma seperated list of directories under which to store log files
# 一个用逗号分隔的目录列表，用于存储kafka接受到的数据
log.dirs=/home/uplooking/data/kafka

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
# 每一个topic所对应的log的partition分区数目，默认1个。更多的partition数目会提高消费
# 并行度，但是也会导致在kafka集群中有更多的文件进行传输
# （partition就是分布式存储，相当于是把一份数据分开几份来进行存储，即划分块、划分分区的意思）
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
# 每一个数据目录用于在启动kafka时恢复数据和在关闭时刷新数据的线程个数。如果kafka数据存储在磁盘阵列中
# 建议此值可以调整更大。
num.recovery.threads.per.data.dir=1

############################# Internal Topic Settings  #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1


############################# Log Flush Policy （数据刷新策略）#############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs（平衡） here:
#    1. Durability 持久性: Unflushed data may be lost if you are not using replication.
#    2. Latency 延时性: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput 吞吐量: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# kafka中只有基于消息条数和时间间隔数来制定数据刷新策略，而没有大小的选项，这两个选项可以选择配置一个
# 当然也可以两个都配置，默认情况下两个都配置，配置如下。

# The number of messages to accept before forcing a flush of data to disk
# 消息刷新到磁盘中的消息条数阈值
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
# 消息刷新到磁盘生成一个log数据文件的时间间隔
#log.flush.interval.ms=1000

############################# Log Retention Policy（数据保留策略） #############################

# The following configurations control the disposal（清理） of log segments（分片）. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated（累积）.
# A segment will be deleted whenever（无论什么时间） *either* of these criteria（标准） are met. Deletion always happens
# from the end of the log.
# 下面的配置用于控制数据片段的清理，只要满足其中一个策略（基于时间或基于大小），分片就会被删除

# The minimum age of a log file to be eligible for deletion
# 基于时间的策略，删除日志数据的时间，默认保存7天
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes. 1G
# 基于大小的策略，1G
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
# 数据分片策略
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies 5分钟
# 每隔多长时间检测数据是否达到删除条件
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000

############################# Group Coordinator Settings #############################

# The following configuration specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance.
# The rebalance will be further delayed by the value of group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms.
# The default value for this is 3 seconds.
# We override this to 0 here as it makes for a better out-of-the-box experience for development and testing.
# However, in production environments the default value of 3 seconds is more suitable as this will help to avoid unnecessary, and potentially expensive, rebalances during application startup.
group.initial.rebalance.delay.ms=0

修改配置文件的如下几项

broker.id=1
#配置内网kafka集群的监听器，用于告诉外部连接者访问指定的主机名和端口。如果是外网集群则需要使用advertised.listeners。
listeners=PLAINTEXT://192.168.40.103:9092
#配置zookeeper集群的地址
zookeeper.connect=192.168.40.101:2181,192.168.40.102:2181,192.168.40.103:2181

3、后台启动kafka

#进入kafka目录
cd /software/kafka
#后台启动
nohup bin/kafka-server-start.sh config/server.properties &

三、安装Elasticsearch

这边搭建的是单台Elasticsearch，我是安装在103的Linux上的

1、解压Elasticsearch

#进入software目录
cd /software
#解压
tar -zxvf elasticsearch-5.6.8.tar.gz

2、修改配置文件

编辑配置文件

vi /software/elasticsearch-5.6.8/config/elasticsearch.yml

修改如下配置

cluster.name: my-application
node.name: node-1
path.data: /software/elasticsearch-5.6.8/data
path.logs: /software/elasticsearch-5.6.8/logs
network.host: 0.0.0.0
http.port: 9200

3、创建data和logs文件夹

#进入到elasticsearch目录
cd /software/elasticsearch-5.6.8
#创建data文件夹
mkdir data
#创建logs文件夹
mkdir logs

4、创建启动用户

因为elasticsearch不能root用户启动，所以我们这边创建一个用户和组来启动它

创建用户和组

#创建用户组
groupadd elsearch
#创建用户并添加到用户组中
useradd -r -g elsearch elsearch
passwd elsearch

将elasticsearch的目录权限设置成该用户和组

chown -R elsearch:elsearch /software/elasticsearch-5.6.8

5、启动Elasticsearch

启动

#切换启动用户
su elsearch
#进入到elasticsearch的bin目录下
cd /software/elasticsearch-5.6.8/bin
#后台启动
nohup ./elasticsearch &
#观察nohup日志，查看是否出错，一般都会出现线程数等不够错误
tail -f nohup.out

检查是否启动成功

curl  http://192.168.40.103:9200
#如果出现下面信息即启动成功
{
  "name" : "node-1",
  "cluster_name" : "my-application",
  "cluster_uuid" : "2UlrJ43PQDKbrqvcTG9IyA",
  "version" : {
    "number" : "5.6.8",
    "build_hash" : "688ecce",
    "build_date" : "2018-02-16T16:46:30.010Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

6、错误解决

6.1、错误1

max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]

是因为最大的文件数不够，切换的root用户下修改/etc/security/limits.conf文件即可

添加如下配置，然后重新启动linux即可

*                soft    nofile          65536
*                hard    nofile          65536

6.2、错误2

max number of threads [3818] for user [es] is too low, increase to at least [4096]

是因为最大线程个数太低，切换的root用户下修改/etc/security/limits.conf文件即可

添加如下配置，然后重新启动linux即可

*                soft    nproc           4096
*                hard    nproc           4096

6.3、错误3

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

是因为限制一个进程可以拥有的VMA(虚拟内存区域)的数量不够，切换的root用户下修改/etc/sysctl.conf文件即可

添加如下配置

vm.max_map_count=262144

立即生效

sysctl -p

6.4、错误4

system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk

是因为如果在Centos6下，是不支持SecComp，而ES5.2.1默认bootstrap.system_call_filter为true进行检测，所以导致检测失败，失败后直接导致ES不能启动

解决方法：
在elasticsearch.yml中配置bootstrap.system_call_filter为false，注意要在Memory下面，随后重启es

bootstrap.memory_lock: false
bootstrap.system_call_filter: false

四、配置kafka中的Connect，实现将MySQL数据同步到Elasticsearch中

1、前期工作

1.1、所需jar包

所有jar包下载：
下载地址：点击下载
单独jar包下载：
- kafka-connect-jdbc-4.1.1
  下载地址：点击下载
- mysql-connector-java-5.1.40.jar
  下载地址：点击下载
- kafka-connect-elasticsearch-5.4.1.jar
  下载地址：点击下载
- commons-codec-1.11.jar、commons-logging-1.2.jar、 httpclient-4.5.12.jar、httpcore-4.4.13.jar
  下载地址：点击下载
- common-utils-5.4.1.jar
  下载地址：点击下载
- httpasyncclient-4.1.3.jar
  下载地址：点击下载
- httpcore-nio-4.4.6.jar
  下载地址：点击下载
- jest-6.3.1.jar、jest-common-6.3.1.jar
  下载地址：点击下载
- gson-2.8.5.jar
  下载地址：点击下载
- slf4j-api-1.7.26.jar
  下载地址：点击下载

1.2、在数据库中创建需要同步的数据库表

create database test1;
use test1;
create table user(id int PRIMARY KEY AUTO_INCREMENT,username varchar(50),password varchar(50));

1.3、在kafka的config文件中配置mysql到kafka的连接器

创建mysql-test1.properties

# 连接器名称
name=mysql_test1
# 连接器使用的类
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
# 最大任务数
tasks.max=1
# mysql地址
connection.url=jdbc:mysql://192.168.40.102:3306/test1?user=root&password=root&useUnicode=true&characterEncoding=utf-8&useSSL=false&serverTimezone=GMT&autoReconnect=true
# 监听模式：分为incrementing、timestamp、timestamp+incrementing
mode=incrementing
# 监听的字段名
incrementing.column.name=id
# 主题前缀
topic.prefix=mysql_test1_
每10秒刷新一次
poll.interval.ms=10000

1.4、在kafka的config文件中配置kafka到elasticsearch的连接器

创建es-mysql-test1.properties

# 连接器名称
name=es_mysql_test1
# 使用的类
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
# 最大任务数
tasks.max=1
# 主题名，一般都是主题前缀+表名
topics=mysql_test1_user
# 表示写入ES的每条记录的键为kafka主题名字+分区id+偏移量
key.ignore=true
# elasticsearch地址
connection.url=http://192.168.40.103:9200
# elasticsearch索引类型
type.name=test1_user

1.5、修改kafka中config中的connect-standalone.properties配置文件

我们这边是单个kafka，所以使用standalone启动连接器

修改bootstrap.servers，需要修改为与kafka的server.properties文件中的listeners的ip一致

bootstrap.servers=192.168.40.103:9092

2、运行Connect

这边使用的是单机的connect模式

#进入kafka的bin目录
cd /software/kafka/bin
#后台启动connect并加上两个连接器的配置文件
nohup ./connect-standalone.sh ../config/connect-standalone.properties ../config/es-mysql-test1.properties ../config/mysql-test1.properties &
#查看nohup日志是否有错误，或者也可以通过connector的api查看各个连接器的状态
tail -f nohup.out

3、Connector的API

curl -X GET http://ip:8083/connector-plugins
GET /connectors – 返回所有正在运行的connector名。 
POST /connectors – 新建一个connector; 请求体必须是json格式并且需要包含name字段和config字段,name是connector的名字,config是json格式,必须包含你的connector的配置信息。 
GET /connectors/{name} – 获取指定connetor的信息。 
GET /connectors/{name}/config – 获取指定connector的配置信息。 
PUT /connectors/{name}/config – 更新指定connector的配置信息。 
GET /connectors/{name}/status – 获取指定connector的状态,包括它是否在运行、停止、或者失败,如果发生错误,还会列出错误的具体信息。 
GET /connectors/{name}/tasks – 获取指定connector正在运行的task。 
GET /connectors/{name}/tasks/{taskid}/status – 获取指定connector的task的状态信息。 
PUT /connectors/{name}/pause – 暂停connector和它的task,停止数据处理知道它被恢复。 
PUT /connectors/{name}/resume – 恢复一个被暂停的connector。 
POST /connectors/{name}/restart – 重启一个connector,尤其是在一个connector运行失败的情况下比较常用 
POST /connectors/{name}/tasks/{taskId}/restart – 重启一个task,一般是因为它运行失败才这样做。 
DELETE /connectors/{name} – 删除一个connector,停止它的所有task并删除配置。

Brycen Liu

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
使用Kafka的Connect监听Mysql数据并同步到ElasticSearch-刘宇

作者：刘宇。一、安装zookeeper1、解压zookeeper的tar包2、创建zookeeper所使用到的文件夹3、修改zookeeper配置文件4、添加zookeeper唯一标识二、安装kafka1、解压2、修改配置文件3、后台启动kafka三、安装Elasticsearch1、解压Elasticsearch2、修改配置文件3、创建data和logs文件夹4、创建启动用户5、启动Elasticsearch七、配置kafka中的Connect，实现将MySQL数据同步到Elasticsearch。
复制链接

扫一扫