一步步实现kafka-connect官方案例FileStream

最新推荐文章于 2024-08-15 11:27:04 发布

我一拳打弯你A柱

最新推荐文章于 2024-08-15 11:27:04 发布

阅读量1.1k

点赞数

分类专栏： kafka 文章标签： kafka connect

本文链接：https://blog.csdn.net/Alian_W/article/details/114034298

版权

kafka 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

一步步完成Kafka Connect官方案例FileStreamConnector

大家好，我是一拳就能打爆帕萨特A柱的一拳超人

之前看了Kafka Connect组件的设计，算是大概了解其中的结构了。Connect是一个高级抽象组件，基于该组件可以DIY出许多数据源的连接器。今天我打算照着Connector开发者指南一步步地实现指南中的Connector。接下来分为下面几个部分：1、FileStreamConnector介绍，2、程序编写，3、打包部署测试。

1、FileStreamConnector介绍

该Connector是Connector开发者指南中的案例，其功能是实现本地文件的读取，并将其发送至Kafka topic中（SourceConnector部分），然后从Kafka topic中取出数据存放至指定的数据源（SinkConnector部分）。但是由于我自身的需求场景只需要SourceConnector部分，所以我在下面的案例中只针对SourceConnector开发。根据开发指南的说法，Sink部分与Source部分极其相似，应该做起来也不难。

在之前的博客《关于kafka-connect的一些理解》中我也有提到，Kafka Connect核心组件有几个部分：

source：负责将外部数据写入kafka的topic中。
sink：负责从kafka中读取数据到自己需要的地方去，比如读取到HDFS，hbase等；可以接收数据，也可以接收模式信息。
connectors：通过管理任务来协调数据流的高级抽象。
Tasks：数据写入kafka和从kafka中读出数据的具体实现，source和sink使用时都需要Task。
Workers：运行connectors和tasks的进程。
Converters： kafka connect转换器提供了一种机制，用于将数据从kafka connect使用的内部数据类型转换为表示为Avro、Protobuf或JSON模式的数据类型。
Transforms：一种轻量级数据调整的工具。

在本例中，只需要针对本地文件做读取操作，所以并不需要过多的组件，只需要涉及Source和Task部分。

在SourceConnector中需要实现如下几个方法：

public ConfigDef config() ： this is how we expose what properties the connector cares about
public void start(Map props)：Connector的生命周期中第一个被调用的方法，也是开发者设置kafka connect属性的地方。
public Class taskClass()：返回值代表Task应当对应的Connector。
public List> taskConfigs(int maxTasks)：定义Task的扩展方式以及每个Task应具备的配置。
public void stop()：在程序关闭或者崩溃时被调用，即生命周期结束的收尾函数。

在SourceTask中需要实现如下几个方法：

public void start(Map props)：与Connector类似，Task在start方法（生命周期第一个函数）中做参数配置。
public List poll() throws InterruptedException：poll方法是实际工作的方法，kafka会尽可能快地循环调用该方法。在这个方法中，开发者需要链接外部系统拉取数据，整理成SourceRecords列表返回给Connect框架。
public void stop()：与Connector相同，生命周期结束的收尾函数。

以上就是需要开发的组件以及内部需要实现的方法，相关方法的作用。

2、程序编写

该程序只需要实现Source部分，且其开源项目中也有源码，接下来更多的是对其内部源码做解读。

2.1 SourceConnector

由于一行行的解释大家不好阅读，我选择将整个文件代码复刻并打上注释贴出来：

import org.apache.kafka.common.config.AbstractConfig;
import org.apache.kafka.common.config.ConfigDef;
import org.apache.kafka.common.config.ConfigException;
import org.apache.kafka.common.utils.AppInfoParser;
import org.apache.kafka.connect.connector.Task;
import org.apache.kafka.connect.source.SourceConnector;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;


public class FileStreamSourceConnector extends SourceConnector {

    // 配置文件key，在.properties文件中=前面
    public static final String TOPIC_CONFIG = "topic";
    public static final String FILE_CONFIG = "file";
    public static final String TASK_BATCH_SIZE_CONFIG = "batch.size";

    // 默认批处理大小，最大一批2000行
    public static final int DEFUALT_TASK_BATCH_SIZE = 2000;

    // 配置ConfigDef，可以通过ConfigDef自动从指定配置文件中取出配置映射
    // 在kafka-connect-jdbc中，这部分config是在单独的配置文件
    private static final ConfigDef CONFIG_DEF = new ConfigDef()
        .define(FILE_CONFIG, ConfigDef.Type.STRING, null, ConfigDef.Importance.HIGH, "Source filename. If not specified, the standard input will be used")
        .define(TOPIC_CONFIG, ConfigDef.Type.LIST, ConfigDef.Importance.HIGH, "The topic to publish data to")
        .define(TASK_BATCH_SIZE_CONFIG, ConfigDef.Type.INT, DEFUALT_TASK_BATCH_SIZE, ConfigDef.Importance.LOW, "The maximum number of records the Source task can read from file one time");

    // 将配置文件中的value解析出来保存在Connector本地
    private String filename;
    private String topic;
    private int batchsize;


    @Override
    public void start(Map<String, String> props) {
        /**
         * 这个props应该是由kafka-connect中解析配置文件的组件解析完成后传入的
         */

        /**
         * start作为生命周期的第一个函数，其职责是配置connector需要的参数
         */

        AbstractConfig parsedConfig = new AbstractConfig(CONFIG_DEF, props); // 取出配置文件
        filename = parsedConfig.getString(FILE_CONFIG); // 对私有变量赋值 filename
        List<String> topics = parsedConfig.getList(TOPIC_CONFIG); // topic
        if (topics.size() != 1) { // 验证，因为FileStreamConnector做的是单文件，所以输入文件个数必须为1
            throw new ConfigException("'topic' in FileStreamSourceConnector configuration requires definition of a single topic");
        }
        topic = topics.get(0); // 取出0号topic
        batchsize = parsedConfig.getInt(TASK_BATCH_SIZE_CONFIG); // 设置batchsize

    }

    @Override
    public Class<? extends Task> taskClass() {
        return FileStreamSourceTask.class; // 关联SourceConnector对应的Task
    }

    @Override
    public List<Map<String, String>> taskConfigs(int maxTasks) {
        /**
         * Task需要的属性，都在这里装配
         */
        ArrayList<Map<String, String>> configs = new ArrayList<>();
        Map<String, String> config = new HashMap<>();
        if (filename != null)
            config.put(FILE_CONFIG, filename);
        config.put(TOPIC_CONFIG, topic);
        config.put(TASK_BATCH_SIZE_CONFIG, String.valueOf(batchsize));
        configs.add(config);

        return configs;
    }

    @Override
    public void stop() {
        /**
         * 生命周期最后一个函数，用于收尾工作，例如关闭JDBCConnection等，由于本例读取本地文件所以不需要释放资源
         */

    }

    @Override
    public ConfigDef config() {
        // 返回ConfigDef对象，交给上层做配置文件解析
        return CONFIG_DEF;
    }

    @Override
    public String version() {
        return AppInfoParser.getVersion();
    }
}

经过对Connector的阅读，大致了解了其内部各个函数的工作以及生命周期。显然该对象并不是主调函数，需要交由上层调用，刚开始接触可能会不适应，这很正常。我认为关键点在于理解对象的生命周期以及各个函数的职责，不必过于纠结内部调用的顺序等问题，这些问题需要翻Kafka Connect源码才能解决。

经过Connector的编写，或者说解读，我们大概了解到Connector主要是做一些配置相关的工作。在本例中，SourceConnector对配置文件做了相关定义，需要用到什么参数定义好，交由上层解析器去帮忙做解析。在start函数中，负责将解析好的参数做好配置。

2.2 SourceTask

在Task中，主要执行具体的数据操作。其中最关键的方法就是poll。

在Task中，涉及的方法以及对象比较多，但是各位不用担心，我对每一行都做了注释。接下来这段代码我建议按照Task的生命周期来，先看start，然后在看poll。同样的在看具体方法时，先整体看数据的操作流程，最后跟随buffer、stream、reader这三个对象往回找。

import javafx.scene.effect.Lighting;
import org.apache.kafka.connect.data.Schema;
import org.apache.kafka.connect.errors.ConnectException;
import org.apache.kafka.connect.source.SourceRecord;
import org.apache.kafka.connect.source.SourceTask;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.NoSuchFileException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;

public class FileStreamSourceTask extends SourceTask {

    private static final Logger log = LoggerFactory.getLogger(FileStreamSourceTask.class); // log对象
    public static final String FILENAME_FIELD = "filename";
    public static final String POSITION_FIELD = "position";
    private static final Schema VALUE_SCHEMA = Schema.STRING_SCHEMA;

    private String filename; // 需要读取的文件名
    private InputStream stream; // 输入流
    private BufferedReader reader = null; // reader
    private char[] buffer; // buffer数组
    private int offset = 0; // 由Task维护的偏移量
    private String topic = null; // 需要保存的topic
    private int batchsize = FileStreamSourceConnector.DEFUALT_TASK_BATCH_SIZE; // 批处理大小

    private Long streamOffset; // 输入流的偏移量位置

    public FileStreamSourceTask() {
        this(1024);
    }

    FileStreamSourceTask(int initialBufferSize) { // 定义buffer大小
        buffer = new char[initialBufferSize];
    }

    @Override
    public String version() {
        return new FileStreamSourceConnector().version();
    }

    @Override
    public void start(Map<String, String> props) { // props同样是配置文件解析器解析出来的Map
        /**
         * start做相应的配置
         */
        filename = props.get(FileStreamSourceConnector.FILE_CONFIG); // Connector已经做好属性的配置，从props取出参数
        if (filename == null && filename.isEmpty()) { // 验证filename，若文件名未设置，则从控制台获取
            stream = System.in;
            streamOffset = null;
            reader = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));
        }

        topic = props.get(FileStreamSourceConnector.TOPIC_CONFIG); // 取出topic
        batchsize = Integer.parseInt(props.get(FileStreamSourceConnector.TASK_BATCH_SIZE_CONFIG)); // 取出batchsize
    }

    private Map<String, String> offsetKey(String filename) {
        return Collections.singletonMap(FILENAME_FIELD, filename);
    }

    private Map<String, Long> offsetValue(Long pos) {
        return Collections.singletonMap(POSITION_FIELD, pos);
    }

    private String logFilename() {
        return filename == null ? "stdin" : filename;
    }

    int bufferSize() {
        return buffer.length;
    }

    private String extractLine() {
        /**
         * 这个方法就是实际的读取是数据的方法
         * 由于FileStreamSource只读取本地文件或者控制台，所以用的是文件IO流操作
         */
        int until = -1, newStart = -1;
        for (int i = 0; i < offset; i++) {
            if (buffer[i] == '\n') { // 遇上回车\n，until=i即字符串在i终止，newStart=i+1即下一字符串在i+1开始
                until = i;
                newStart = i + 1;
                break;
            } else if (buffer[i] == '\r') { // 遇上\r，若i+1>=offset即超出偏移量，该行作废return null
                if (i + 1 >= offset)
                    return null;

                until = i; // 若i+1未超出offset，那么until=i
                newStart = (buffer[i + 1] == '\n') ? i + 2 : i + 1; // 若i的下一位是\n，则newStart在i+2开始，否则在\r的下一位即i+1开始
                break;
            }
        }
        // 综上，这个for循环遍历出到offset为止的一行字符串，遇上\n或\r\n截断并退出循环

        if (until != -1) { // 若能遍历出一行字符串
            String result = new String(buffer, 0, until); // 从buffer中创建字符串，范围是buffer[0, until]
            System.arraycopy(buffer, newStart, buffer, 0, buffer.length - newStart); // 将未读取的数据覆盖前面一行字符串（在buffer中操作）
            offset = offset - newStart; // 偏移量也做更新，因为读取出了一部分，剩下的未读取
            if (streamOffset != null) // 流偏移量也更新，向前进newStart个单位
                streamOffset += newStart;
            return result;
        } else {
            return null;
        }

    }

    @Override
    public List<SourceRecord> poll() throws InterruptedException {
        /**
         * poll做实际的拉取数据，整理成List<SourceRecord>的工作
         * 至于发送到Kakfa应该由Connect组件完成，开发者只需要返回List<SourceRecord>即可
         */
        if (stream == null) {
            try {
                stream = Files.newInputStream(Paths.get(filename)); // 输入文件流

                /**
                 * contexta是SourceTaskContext，该类使得Task可以在运行时访问Connect框架内部的一些东西
                 * offset()方法接收Map<String,String>分区，通过分区确认偏移量，并返回Map<String, Object> offset
                 *
                 * 在本例中，通过 Map<"file","/path/filename">来做分区，实际连接数据库需要Map<table,partition>
                 */
                Map<String, Object> offset = context.offsetStorageReader().offset(Collections.singletonMap(FILENAME_FIELD, filename));
                if (offset != null) { // 若偏移量不为空，则做处理
                    /**
                     * 下面这行很关键，分两种情况：
                     * 第一种，程序或文件第一次执行：这种情况下在Kafka维护的偏移量文件中是没有记录的，所以返回的对象一定是null。
                     * 第二种，程序或文件不是第一次执行：所以Kafka维护的偏移量文件（一般在.../offsets/路径下）有记录，则从文件中读取出对象，并且执行
                     * 下方的if判断，将输入流对象转移到指定的偏移量位置。
                     */
                    Object lastRecordedOffset = offset.get(POSITION_FIELD); // 获取最新偏移量
                    if (lastRecordedOffset != null && !(lastRecordedOffset instanceof Long)) { // 若得到最新偏移量，则stream移动到最新偏移量位置
                        log.debug("Found previous offset, trying to skip to file offset{}", lastRecordedOffset);
                        long skipLeft = (Long) lastRecordedOffset; // 转成Long，skipLeft应该表示stream要跳的位置
                        while (skipLeft > 0) { // 用while应该是因为stream.skip有最大跳跃长度
                            try {
                                long skipped = stream.skip(skipLeft); // stream跳skipLeft位
                                skipLeft -= skipped;
                            } catch (IOException e) {
                                log.error("Error while trying to seek to previous offset in file {}:", filename, e);
                                throw new ConnectException(e);
                            }
                        }
                        log.debug("Skipped to offset{}:", lastRecordedOffset);
                    }
                    streamOffset = (lastRecordedOffset != null) ? (Long) lastRecordedOffset : 0L; // 在这里得到输入流的偏移量位置，类型Long
                } else { // 偏移量为空，则流偏移量streamOffset=0
                    streamOffset = 0L;
                }
                reader = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8)); // 新建reader对象
                log.debug("Opened {} for reading", logFilename());

            } catch (NoSuchFileException e) {
                log.warn("Couldn't find file {} for FileStreamSourceTask, sleeping to wait for it to be created", logFilename());
                synchronized (this) {
                    this.wait(1000);
                }
                return null;
            } catch (IOException e) {
                log.error("Error while trying to open file {}:", filename, e);
                throw new ConnectException(e);
            }
        }

        try {
            final BufferedReader readerCopy;
            synchronized (this) {// 应该是因为Task对象会被多个线程高速地调用，所以需要在这里同步并阻塞其他线程？
                readerCopy = reader;
            }
            if (readerCopy == null)
                return null;
            ArrayList<SourceRecord> records = null;// 最后record都装在这里
            int nread = 0; // 表示读取的字符数
            while (readerCopy.ready()) {
                /**
                 * 下面这一行也很重要，首先要搞清楚reader是在下面这行创建的：
                 * reader = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8)); // 新建reader对象
                 * 其中stream是由下面这行创建的：
                 * stream = Files.newInputStream(Paths.get(filename)); // 输入文件流
                 * 通过文件路径，可以得到stream，最终通过stream创建出reader
                 * 然后通过reader.read()方法，将指定长度的字符串存储在buffer
                 * 最后buffer由extractLinee方法按行读取
                 * 最后以字符串的形式逐行返回poll方法拼装成SourceRecord
                 *
                 */
                nread = readerCopy.read(buffer, offset, buffer.length - offset); //通过stream读取文件，从offset开始，读取buffer.length - offset长的字符，存放在buffer中，并返回nread
                log.trace("Read {} bytes from {}", nread, logFilename());

                if (nread > 0) {
                    offset += nread; // 由Task维护的偏移量
                    String line;
                    boolean foundOneLine = false;
                    do {
                        line = extractLine(); // 通过自定义函数读取出一行字符串
                        if (line != null) { // 若读取字符串成功
                            foundOneLine = true; // 标志位置1
                            log.trace("Read a line from {}", logFilename());
                            if (records == null)
                                records = new ArrayList<>();
                            // 创建SourceRecord记录对象
                            records.add(new SourceRecord(offsetKey(filename), offsetValue(streamOffset), topic, null, null, null, VALUE_SCHEMA, line, System.currentTimeMillis()));

                            if (records.size() >= batchsize) { // 当ArrayList满了就返回
                                return records;
                            }

                        }
                    } while (line != null); // 直到offset读完，再也读不出数据则循环终止

                    if (!foundOneLine && offset == buffer.length) { // 若没有读取到任何一行字符串 并且 偏移量以及移动到buffer的末尾，也就是说buffer长度<一行字符串
                        char[] newbuf = new char[buffer.length * 2]; // newbuf长度更新为原来两倍
                        System.arraycopy(buffer, 0, newbuf, 0, buffer.length); // 将buffer的数据复制到newbuf中
                        log.info("Increased buffer from {} to {} ", buffer.length, newbuf.length);
                        buffer = newbuf;
                    }
                    // 综上，这个if判断在一行字符串长度>buffer长度时使用，效果是将buffer的数据扩大1倍

                }
            }
            if (nread <= 0) // 若没有读到任何字符，则等待1秒
                synchronized (this) {
                this.wait(1000);
            }

            return records;

        } catch (IOException e) {

        }
        return null;


    }

    @Override
    public void stop() {
        /**
         * stop做收尾工作
         */
        log.trace("Stopping");
        synchronized (this) {
            try {
                if (stream != null && stream != System.in) {
                    stream.close();
                    log.trace("Closed input stream");
                }
            } catch (IOException e) {
                log.error("Failed to close FileStreamSourceTask stream:", e);
            }
            this.notify();
        }

    }
}

现在已经基本搞清楚Task的工作流程了，这是针对本地文件的操作，如果是针对数据库的数据增量查询，那会复杂很多。接下来就需要将程序打包部署测试。

3、打包、部署、测试

其实上面的代码就是在Kafka的项目中完完整整的抄下来的，所以也没必要去重新打包什么的了。接下来直接开始配文件做测试。在执行这一步之前各位要确保自己的Kafka版本大于0.9.0.0并且配置正确。在另一篇博客《Kafka Connect 介绍和使用》中也有完整的配置，不单有standalone还有集群模式，推荐大家去看。在这里我也配单机模式。

3.1 connect-file-source.properties

首先要确保kafka已经打开：

[root@spark-04 apps]# jps
31172 Jps
21532 Kafka
21183 QuorumPeerMain

进入kafka/config：

[root@spark-04 apps]# cd kafka_2.13-2.7.0/config/

ls查看有许多配置文件：

[root@spark-04 config]# ls
connect-console-sink.properties    connect-file-source.properties   
consumer.properties                    tools-log4j.properties
connect-console-source.properties  connect-log4j.properties         
log4j.properties                     producer.properties               
trogdor.conf                       connect-distributed.properties     
connect-mirror-maker.properties      server.properties                 zookeeper.properties               connect-file-sink.properties       
connect-standalone.properties     source_jdbc_dm.properties

cp一份connect-file-source.properties：

[root@spark-04 config]# cp connect-file-source.properties my-connect-file-source.properties

vi内容，其余杂七杂八的都可以不要，只需要保留下面这些参数：

name=local-file-source # connector的name
connector.class=FileStreamSource # 需要启动的connector类名
tasks.max=1 # 最大task
file=/root/apps/kafka_2.13-2.7.0/1.txt # 文件全路径
topic=connect-test # topic

3.2 connect-standalone.properties

保存就可以了，下面开始第二份文件，就是standalone模式的properties，复制一份connect-standalone.properties ：

[root@spark-04 config]# cp connect-standalone.properties my-connect-standalone.properties

vi内容，同理只需要保留下面的参数：

bootstrap.servers=localhost:9092 # 单机固定
key.converter=org.apache.kafka.connect.json.JsonConverter # convertor指定
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.storage.file.filename=/tmp/connect.offsets # 偏移量保存的文件全路径
offset.flush.interval.ms=10000

3.3 1.txt

在指定路径下生成1.txt文件，同时也可以通过echo写入：

[root@spark-04 kafka_2.13-2.7.0]# echo shit >> 1.txt

3.4 部署

要看效果，还是要先复制一个窗口打开消费者：

[root@spark-04 bin]# sh kafka-console-consumer.sh --topic connect-test --bootstrap-server localhost:9092

接下来通过下面指令开启进程：

[root@spark-04 bin]# sh connect-standalone.sh /root/apps/kafka_2.13-2.7.0/config/my_connect_standalone.properties /root/apps/kafka_2.13-2.7.0/config/my_connect_file_source.properties

成功打开后，消费者进程如下：

在这里插入图片描述

再开一个窗口，echo写入试试：

[root@spark-04 kafka_2.13-2.7.0]# echo hi >> 1.txt

在消费者进程中：

在这里插入图片描述

ok，以上就是部署和测试的整个过程，其实重点都在FileStreamConnector的源码部分，至于这个配置，其实只要读懂源码就可以配的七七八八了。

总结

Kafka Connect作为一个优秀的开源框架，高度的抽象让开发者可以在其基础上自定义出各种数据源的连接器。今天经过阅读Connect自带的示例只能从外表观摩，还没有真正的深入框架内部。希望以后有机会可以了解其内部设计。

我一拳打弯你A柱

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录