【flink】自定义flink-socket-connector

海色风铃

已于 2023-09-09 16:08:16 修改

阅读量245

点赞数

文章标签： flink 大数据

于 2023-09-09 16:02:33 首次发布

本文链接：https://blog.csdn.net/weixin_50573352/article/details/132778522

版权

用户自定义 Sources & Sinks

概述

实心箭头展示了在转换过程中对象如何从一个阶段到下一个阶段转换为其他对象。
在这里插入图片描述

元数据

Table API 和 SQL 都是声明式 API。这包括表的声明。因此，执行 CREATE TABLE 语句会导致目标 catalog 中的元数据更新。

对于大多数 catalog 实现，外部系统中的物理数据不会针对此类操作进行修改。特定于连接器的依赖项不必存在于类路径中。在 WITH 子句中声明的选项既不被验证也不被解释。

动态表的元数据（通过 DDL 创建或由 catalog 提供）表示为 CatalogTable 的实例。必要时，表名将在内部解析为 CatalogTable。

解析器

在解析和优化以 table 编写的程序时，需要将 CatalogTable 解析为 DynamicTableSource（用于在 SELECT 查询中读取）和 DynamicTableSink（用于在 INSERT INTO 语句中写入）。

DynamicTableSourceFactory 和 DynamicTableSinkFactory 提供连接器特定的逻辑，用于将 CatalogTable 的元数据转换为 DynamicTableSource 和 DynamicTableSink 的实例。在大多数情况下，以工厂模式设计的目的是验证选项（例如示例中的 ‘port’ = ‘5022’ ），配置编码解码格式（如果需要），并创建表连接器的参数化实例。

默认情况下，DynamicTableSourceFactory 和 DynamicTableSinkFactory 的实例是使用 Java的 [Service Provider Interfaces (SPI)] (https://docs.oracle.com/javase/tutorial/sound/SPI-intro.html) 发现的。 connector 选项（例如示例中的 ‘connector’ = ‘custom’）必须对应于有效的工厂标识符。

尽管在类命名中可能不明显，但 DynamicTableSource 和 DynamicTableSink 也可以被视为有状态的工厂，它们最终会产生具体的运行时实现来读写实际数据。

规划器使用 source 和 sink 实例来执行连接器特定的双向通信，直到找到最佳逻辑规划。取决于声明可选的接口（例如 SupportsProjectionPushDown 或 SupportsOverwrite），规划器可能会将更改应用于实例并且改变产生的运行时实现。

运行时的实现

一旦逻辑规划完成，规划器将从表连接器获取 runtime implementation。运行时逻辑在 Flink 的核心连接器接口中实现，例如 InputFormat 或 SourceFunction。

这些接口按另一个抽象级别被分组为 ScanRuntimeProvider、LookupRuntimeProvider 和 SinkRuntimeProvider 的子类。

例如，OutputFormatProvider（提供 org.apache.flink.api.common.io.OutputFormat ）和 SinkFunctionProvider（提供org.apache.flink.streaming.api.functions.sink.SinkFunction）都是规划器可以处理的 SinkRuntimeProvider 具体实例。

动态表的工厂类

在根据 catalog 与 Flink 运行时上下文信息，为某个外部存储系统配置动态表连接器时，需要用到动态表的工厂类。

比如，通过实现 org.apache.flink.table.factories.DynamicTableSourceFactory 接口完成一个工厂类，来生产 DynamicTableSource 类。

通过实现 org.apache.flink.table.factories.DynamicTableSinkFactory 接口完成一个工厂类，来生产 DynamicTableSink 类。

默认情况下，Java 的 SPI 机制会自动识别这些工厂类，同时将 connector 配置项作为工厂类的”标识符“。

在 JAR 文件中，需要将实现的工厂类路径放入到下面这个配置文件：

META-INF/services/org.apache.flink.table.factories.Factory

Flink 会对工厂类逐个进行检查，确保其“标识符”是全局唯一的，并且按照要求实现了上面提到的接口 (比如 DynamicTableSourceFactory)。

如果必要的话，也可以在实现 catalog 时绕过上述 SPI 机制识别工厂类的过程。即在实现 catalog 接口时，在org.apache.flink.table.catalog.Catalog#getFactory 方法中直接返回工厂类的实例。

SocketDynamicTableFactory

import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.configuration.ConfigOption;
import org.apache.flink.configuration.ConfigOptions;
import org.apache.flink.configuration.ReadableConfig;
import org.apache.flink.table.connector.format.DecodingFormat;
import org.apache.flink.table.connector.source.DynamicTableSource;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.factories.DeserializationFormatFactory;
import org.apache.flink.table.factories.DynamicTableSourceFactory;
import org.apache.flink.table.factories.FactoryUtil;
import org.apache.flink.table.types.DataType;

import java.util.HashSet;
import java.util.Set;

/**
 * The {@link SocketDynamicTableFactory} translates the catalog table to a table source.
 *
 * <p>Because the table source requires a decoding format, we are discovering the format using the
 * provided {@link FactoryUtil} for convenience.
 */
public final class SocketDynamicTableFactory implements DynamicTableSourceFactory {

    // define all options statically
    public static final ConfigOption<String> HOSTNAME =
            ConfigOptions.key("hostname").stringType().noDefaultValue();

    public static final ConfigOption<Integer> PORT =
            ConfigOptions.key("port").intType().noDefaultValue();

    public static final ConfigOption<Integer> BYTE_DELIMITER =
            ConfigOptions.key("byte-delimiter").intType().defaultValue(10); // corresponds to '\n'

    @Override
    public String factoryIdentifier() {
        return "socket"; // used for matching to `connector = '...'`
    }

    @Override
    public Set<ConfigOption<?>> requiredOptions() {
        final Set<ConfigOption<?>> options = new HashSet<>();
        options.add(HOSTNAME);
        options.add(PORT);
        options.add(FactoryUtil.FORMAT); // use pre-defined option for format
        return options;
    }

    @Override
    public Set<ConfigOption<?>> optionalOptions() {
        final Set<ConfigOption<?>> options = new HashSet<>();
        options.add(BYTE_DELIMITER);
        return options;
    }

    @Override
    public DynamicTableSource createDynamicTableSource(Context context) {
        // either implement your custom validation logic here ...
        // or use the provided helper utility
        final FactoryUtil.TableFactoryHelper helper =
                FactoryUtil.createTableFactoryHelper(this, context);

        // discover a suitable decoding format
        final DecodingFormat<DeserializationSchema<RowData>> decodingFormat =
                helper.discoverDecodingFormat(
                        DeserializationFormatFactory.class, FactoryUtil.FORMAT);

        // validate all options
        helper.validate();

        // get the validated options
        final ReadableConfig options = helper.getOptions();
        final String hostname = options.get(HOSTNAME);
        final int port = options.get(PORT);
        final byte byteDelimiter = (byte) (int) options.get(BYTE_DELIMITER);

        // derive the produced data type (excluding computed columns) from the catalog table
        final DataType producedDataType =
                context.getCatalogTable().getResolvedSchema().toPhysicalRowDataType();

        // create and return dynamic table source
        return new SocketDynamicTableSource(
                hostname, port, byteDelimiter, decodingFormat, producedDataType);
    }
}

动态表的 source 端

按照定义，动态表是随时间变化的。

在读取动态表时，表中数据可以是以下情况之一：

changelog 流（支持有界或无界），在 changelog 流结束前，所有的改变都会被源源不断地消费，由 ScanTableSource 接口表示。
处于一直变换或数据量很大的外部表，其中的数据一般不会被全量读取，除非是在查询某个值时，由 LookupTableSource 接口表示。
一个类可以同时实现这两个接口，Planner 会根据查询的 Query 选择相应接口中的方法。

Scan Table Source #
在运行期间，ScanTableSource 接口会按行扫描外部存储系统中所有数据。

被扫描的数据可以是 insert、update、delete 三种操作类型，因此数据源可以用作读取 changelog （支持有界或无界）。在运行时，返回的 changelog mode 表示 Planner 要处理的操作类型。

在常规批处理的场景下，数据源可以处理 insert-only 操作类型的有界数据流。

在常规流处理的场景下，数据源可以处理 insert-only 操作类型的无界数据流。

在变更日志数据捕获（即 CDC）场景下，数据源可以处理 insert、update、delete 操作类型的有界或无界数据流。

可以实现更多的功能接口来优化数据源，比如实现 SupportsProjectionPushDown 接口，这样在运行时在 source 端就处理数据。在 org.apache.flink.table.connector.source.abilities 包下可以找到各种功能接口，更多内容可查看 source abilities table。

实现 ScanTableSource 接口的类必须能够生产 Flink 内部数据结构，因此每条记录都会按照org.apache.flink.table.data.RowData 的方式进行处理。Flink 运行时提供了转换机制保证 source 端可以处理常见的数据结构，并且在最后进行转换。

Lookup Table Source #
在运行期间，LookupTableSource 接口会在外部存储系统中按照 key 进行查找。

相比于ScanTableSource，LookupTableSource 接口不会全量读取表中数据，只会在需要时向外部存储（其中的数据有可能会一直变化）发起查询请求，惰性地获取数据。

同时相较于ScanTableSource，LookupTableSource 接口目前只支持处理 insert-only 数据流。

暂时不支持扩展功能接口，可查看 org.apache.flink.table.connector.source.LookupTableSource 中的文档了解更多。

LookupTableSource 的实现方法可以是 TableFunction 或者 AsyncTableFunction，Flink运行时会根据要查询的 key 值，调用这个实现方法进行查询。

source 端的功能接口

接口描述
SupportsFilterPushDown 支持将过滤条件下推到 DynamicTableSource。为了更高效处理数据，source 端会将过滤条件下推，以便在数据产生时就处理。
SupportsLimitPushDown 支持将 limit（期望生产的最大数据条数）下推到 DynamicTableSource。
SupportsPartitionPushDown 支持将可用的分区信息提供给 planner 并且将分区信息下推到 DynamicTableSource。在运行时为了更高效处理数据，source 端会只从提供的分区列表中读取数据。
SupportsProjectionPushDown 支持将查询列(可嵌套)下推到 DynamicTableSource。为了更高效处理数据，source 端会将查询列下推，以便在数据产生时就处理。如果 source 端同时实现了 SupportsReadingMetadata，那么 source 端也会读取相对应列的元数据信息。
SupportsReadingMetadata 支持通过 DynamicTableSource 读取列的元数据信息。source 端会在生产数据行时，在最后添加相应的元数据信息，其中包括元数据的格式信息。
SupportsWatermarkPushDown 支持将水印策略下推到 DynamicTableSource。水印策略可以通过工厂模式或 Builder 模式来构建，用于抽取时间戳以及水印的生成。在运行时，source 端内部的水印生成器会为每个分区生产水印。
SupportsSourceWatermark 支持使用 ScanTableSource 中提供的水印策略。当使用 CREATE TABLE DDL 时，<可以使用> SOURCE_WATERMARK() 来告诉 planner 调用这个接口中的水印策略方法。
SupportsRowLevelModificationScan 支持将读数据的上下文 RowLevelModificationScanContext 从 ScanTableSource 传递给实现了 SupportsRowLevelDelete，SupportsRowLevelUpdate 的 sink 端。
注意上述接口当前只适用于 ScanTableSource，不适用于LookupTableSource。

SocketDynamicTableSource

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.connector.source.Source;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.connector.ChangelogMode;
import org.apache.flink.table.connector.ProviderContext;
import org.apache.flink.table.connector.format.DecodingFormat;
import org.apache.flink.table.connector.source.DataStreamScanProvider;
import org.apache.flink.table.connector.source.DynamicTableSource;
import org.apache.flink.table.connector.source.ScanTableSource;
import org.apache.flink.table.connector.source.abilities.SupportsFilterPushDown;
import org.apache.flink.table.connector.source.abilities.SupportsProjectionPushDown;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.types.DataType;

/**
 * The {@link SocketDynamicTableSource} is used during planning.
 *
 * <p>In our example, we don't implement any of the available ability interfaces such as {@link
 * SupportsFilterPushDown} or {@link SupportsProjectionPushDown}. Therefore, the main logic can be
 * found in {@link #getScanRuntimeProvider(ScanContext)} where we instantiate the required {@link
 * Source} and its {@link DeserializationSchema} for runtime. Both instances are parameterized to
 * return internal data structures (i.e. {@link RowData}).
 *
 * <p>Note: This is only an example and should not be used in production. The source is not
 * fault-tolerant and can only work with a parallelism of 1.
 */
public final class SocketDynamicTableSource implements ScanTableSource {

    private final String hostname;
    private final int port;
    private final byte byteDelimiter;
    private final DecodingFormat<DeserializationSchema<RowData>> decodingFormat;
    private final DataType producedDataType;

    public SocketDynamicTableSource(
            String hostname,
            int port,
            byte byteDelimiter,
            DecodingFormat<DeserializationSchema<RowData>> decodingFormat,
            DataType producedDataType) {
        this.hostname = hostname;
        this.port = port;
        this.byteDelimiter = byteDelimiter;
        this.decodingFormat = decodingFormat;
        this.producedDataType = producedDataType;
    }

    @Override
    public ChangelogMode getChangelogMode() {
        // in our example the format decides about the changelog mode
        // but it could also be the source itself
        return decodingFormat.getChangelogMode();
    }

    @Override
    public ScanRuntimeProvider getScanRuntimeProvider(ScanContext runtimeProviderContext) {
        return new DataStreamScanProvider() {
            @Override
            public DataStream<RowData> produceDataStream(
                    ProviderContext providerContext, StreamExecutionEnvironment execEnv) {
                final DeserializationSchema<RowData> deserializer =
                        decodingFormat.createRuntimeDecoder(
                                runtimeProviderContext, producedDataType);

                final SocketSource socketSource =
                        new SocketSource(hostname, port, byteDelimiter, deserializer);

                return execEnv.fromSource(
                                socketSource, WatermarkStrategy.noWatermarks(), "SocketSource")
                        // SocketSource can only work with a parallelism of 1.
                        .setParallelism(1);
            }

            @Override
            public boolean isBounded() {
                return false;
            }
        };
    }

    @Override
    public DynamicTableSource copy() {
        return new SocketDynamicTableSource(
                hostname, port, byteDelimiter, decodingFormat, producedDataType);
    }

    @Override
    public String asSummaryString() {
        return "Socket Table Source";
    }
}

SocketSource

import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.connector.source.Boundedness;
import org.apache.flink.api.connector.source.ReaderOutput;
import org.apache.flink.api.connector.source.Source;
import org.apache.flink.api.connector.source.SourceReader;
import org.apache.flink.api.connector.source.SourceReaderContext;
import org.apache.flink.api.connector.source.SourceSplit;
import org.apache.flink.api.connector.source.SplitEnumerator;
import org.apache.flink.api.connector.source.SplitEnumeratorContext;
import org.apache.flink.api.java.typeutils.ResultTypeQueryable;
import org.apache.flink.core.io.InputStatus;
import org.apache.flink.core.io.SimpleVersionedSerializer;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.table.data.RowData;
import org.apache.flink.util.Preconditions;
import org.apache.flink.util.UserCodeClassLoader;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.InetSocketAddress;
import java.net.Socket;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.CompletableFuture;

/**
 * The {@link SocketSource} opens a socket and consumes bytes.
 *
 * <p>It splits records by the given byte delimiter (`\n` by default) and delegates the decoding to
 * a pluggable {@link DeserializationSchema}.
 *
 * <p>Note: This is only an example and should not be used in production. The source is not
 * fault-tolerant and can only work with a parallelism of 1.
 */
public final class SocketSource
        implements Source<RowData, SocketSource.DummySplit, SocketSource.DummyCheckpoint>, ResultTypeQueryable<RowData> {

    private final String hostname;
    private final int port;
    private final byte byteDelimiter;
    private final DeserializationSchema<RowData> deserializer;

    public SocketSource(
            String hostname,
            int port,
            byte byteDelimiter,
            DeserializationSchema<RowData> deserializer) {
        this.hostname = hostname;
        this.port = port;
        this.byteDelimiter = byteDelimiter;
        this.deserializer = deserializer;
    }

    @Override
    public TypeInformation<RowData> getProducedType() {
        return deserializer.getProducedType();
    }

    @Override
    public Boundedness getBoundedness() {
        return Boundedness.CONTINUOUS_UNBOUNDED;
    }

    @Override
    public SplitEnumerator<DummySplit, DummyCheckpoint> createEnumerator(
            SplitEnumeratorContext<DummySplit> enumContext) throws Exception {
        // The socket itself implicitly represents the only split and the enumerator is not
        // utilized.
        return null;
    }

    @Override
    public SplitEnumerator<DummySplit, DummyCheckpoint> restoreEnumerator(
            SplitEnumeratorContext<DummySplit> enumContext, DummyCheckpoint checkpoint)
            throws Exception {
        // This source is not fault-tolerant.
        return null;
    }

    @Override
    public SimpleVersionedSerializer<DummySplit> getSplitSerializer() {
        return new NoOpDummySplitSerializer();
    }

    @Override
    public SimpleVersionedSerializer<DummyCheckpoint> getEnumeratorCheckpointSerializer() {
        // This source is not fault-tolerant.
        return null;
    }

    @Override
    public SourceReader<RowData, DummySplit> createReader(SourceReaderContext readerContext)
            throws Exception {
        Preconditions.checkState(
                readerContext.currentParallelism() == 1,
                "SocketSource can only work with a parallelism of 1.");
        deserializer.open(
                new DeserializationSchema.InitializationContext() {
                    @Override
                    public MetricGroup getMetricGroup() {
                        return readerContext.metricGroup().addGroup("deserializer");
                    }

                    @Override
                    public UserCodeClassLoader getUserCodeClassLoader() {
                        return readerContext.getUserCodeClassLoader();
                    }
                });
        return new SocketReader();
    }

    /**
     * Placeholder because the socket itself implicitly represents the only split and does not
     * require an actual split object.
     */
    public static class DummySplit implements SourceSplit {
        @Override
        public String splitId() {
            return "dummy";
        }
    }

    /**
     * Placeholder because the SocketSource does not support fault-tolerance and thus does not
     * require actual checkpointing.
     */
    public static class DummyCheckpoint {}

    private class SocketReader implements SourceReader<RowData, DummySplit> {

        private Socket socket;
        private ByteArrayOutputStream buffer;
        private InputStream stream;
        int b;

        @Override
        public void start() {
            while (socket == null) {
                try {
                    socket = new Socket();
                    socket.connect(new InetSocketAddress(hostname, port), 0);
                    buffer = new ByteArrayOutputStream();
                    stream = socket.getInputStream();
                } catch (Throwable t) {
                    socket = null;
                    try {
                        System.err.printf(
                                "Cannot connect to %s:%d. Retrying in 5 seconds...%n",
                                hostname, port);
                        Thread.sleep(5000);
                    } catch (InterruptedException e) {
                        throw new RuntimeException(e);
                    }
                }
            }
        }

        @Override
        public InputStatus pollNext(ReaderOutput<RowData> output) throws Exception {
            while ((b = stream.read()) >= 0) {
                // buffer until delimiter
                if (b != byteDelimiter) {
                    buffer.write(b);
                }
                // decode and emit record
                else {
                    try {
                        output.collect(deserializer.deserialize(buffer.toByteArray()));
                    } catch (Exception e) {
                        System.err.printf(
                                "Malformed data row: %s. A valid sample: INSERT|Alice|12%n",
                                buffer.toString());
                    }
                    buffer.reset();
                    return InputStatus.MORE_AVAILABLE;
                }
            }
            return InputStatus.END_OF_INPUT;
        }

        @Override
        public List<DummySplit> snapshotState(long checkpointId) {
            // This source is not fault-tolerant.
            return Collections.emptyList();
        }

        @Override
        public CompletableFuture<Void> isAvailable() {
            // Not used. The socket is read in a blocking manner until it is closed.
            return null;
        }

        @Override
        public void addSplits(List<DummySplit> splits) {
            // Ignored. The socket itself implicitly represents the only split.
        }

        @Override
        public void notifyNoMoreSplits() {
            // Ignored. The socket itself implicitly represents the only split.
        }

        @Override
        public void close() throws Exception {
            try {
                buffer.close();
            } catch (Throwable t) {
                // ignore
            }

            try {
                stream.close();
            } catch (Throwable t) {
                // ignore
            }

            try {
                socket.close();
            } catch (Throwable t) {
                // ignore
            }
        }
    }

    /**
     * Not used - only required to avoid NullPointerException. The split is not transferred from the
     * enumerator, it is implicitly represented by the socket.
     */
    private static class NoOpDummySplitSerializer implements SimpleVersionedSerializer<DummySplit> {
        @Override
        public int getVersion() {
            return 0;
        }

        @Override
        public byte[] serialize(DummySplit split) throws IOException {
            return new byte[0];
        }

        @Override
        public DummySplit deserialize(int version, byte[] serialized) throws IOException {
            return new DummySplit();
        }
    }
}

SocketExample

import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

public class SocketExample {
    public static void main(String[] args) throws Exception {
        final ParameterTool params = ParameterTool.fromArgs(args);
        final String hostname = params.get("hostname", "192.168.96.200");
        final String port = params.get("port", "9999");

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); // source only supports parallelism of 1

        final StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);

        // register a table in the catalog
        tEnv.executeSql(
                "CREATE TABLE UserScores (name STRING, score INT)\n"
                        + "WITH (\n"
                        + "  'connector' = 'socket',\n"
                        + "  'hostname' = '"
                        + hostname
                        + "',\n"
                        + "  'port' = '"
                        + port
                        + "',\n"
                        + "  'byte-delimiter' = '10',\n"
                        + "  'format' = 'csv'\n"
                        + ")");

        // define a dynamic aggregating query
        final Table result = tEnv.sqlQuery("SELECT name, SUM(score) FROM UserScores GROUP BY name");

        // print the result to the console
        tEnv.toChangelogStream(result).print();

        env.execute();
    }
}

海色风铃

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【flink】自定义flink-socket-connector

用户自定义 Sources & Sinks概述实心箭头展示了在转换过程中对象如何从一个阶段到下一个阶段转换为其他对象。元数据Table API 和 SQL 都是声明式 API。这包括表的声明。因此，执行 CREATE TABLE 语句会导致目标 catalog 中的元数据更新。对于大多数 catalog 实现，外部系统中的物理数据不会针对此类操作进行修改。特定于连接器的依赖项不必存在于类路径中。在 WITH 子句中声明的选项既不被验证也不被解释。动态表的元数据（通过 DDL 创建或由
复制链接

扫一扫