flink sql upsert kafka对于changelogNormalize state解读

flink sql upsert kafka对于changelogNormalize state解读


原文:https://www.jianshu.com/p/5ffe5aa0dc59

这里说一点:

  • flink sql - upsert kafka 去重并非在kafka-connector中实现,而是在这个DeduplicateFunctionBase父类中的ValueState进行keyby状态去重的,因此为何upsert-kafka需要在kafka的message中带有key;


/**
 * Base class for deduplicate function.
 *
 * @param <T> Type of the value in the state.
 * @param <K> Type of the key.
 * @param <IN> Type of the input elements.
 * @param <OUT> Type of the returned elements.
 */
abstract class DeduplicateFunctionBase<T, K, IN, OUT> extends KeyedProcessFunction<K, IN, OUT> {

    private static final long serialVersionUID = 1L;

    // the TypeInformation of the values in the state.
    protected final TypeInformation<T> typeInfo;
    protected final long stateRetentionTime;
    protected final TypeSerializer<OUT> serializer;
    // state stores previous message under the key.
    protected ValueState<T> state;

    public DeduplicateFunctionBase(
            TypeInformation<T> typeInfo, TypeSerializer<OUT> serializer, long stateRetentionTime) {
        this.typeInfo = typeInfo;
        this.stateRetentionTime = stateRetentionTime;
        this.serializer = serializer;
    }

    @Override
    public void open(Configuration configure) throws Exception {
        super.open(configure);
        ValueStateDescriptor<T> stateDesc =
                new ValueStateDescriptor<>("deduplicate-state", typeInfo);
        StateTtlConfig ttlConfig = createTtlConfig(stateRetentionTime);
        if (ttlConfig.isEnabled()) {
            stateDesc.enableTimeToLive(ttlConfig);
        }
        state = getRuntimeContext().getState(stateDesc);
    }
}

state进行deduplicate具体实现:
org.apache.flink.table.runtime.operators.deduplicate.DeduplicateFunctionHelper

 /**
     * Processes element to deduplicate on keys with process time semantic, sends current element as
     * last row, retracts previous element if needed.
     *
     * @param currentRow latest row received by deduplicate function
     * @param generateUpdateBefore whether need to send UPDATE_BEFORE message for updates
     * @param state state of function, null if generateUpdateBefore is false
     * @param out underlying collector
     */
    static void processLastRowOnProcTime(
            RowData currentRow,
            boolean generateUpdateBefore,
            boolean generateInsert,
            ValueState<RowData> state,
            Collector<RowData> out)
            throws Exception {

        checkInsertOnly(currentRow);
        if (generateUpdateBefore || generateInsert) {
            // use state to keep the previous row content if we need to generate UPDATE_BEFORE
            // or use to distinguish the first row, if we need to generate INSERT
            RowData preRow = state.value();
            state.update(currentRow);
            if (preRow == null) {
                // the first row, send INSERT message
                currentRow.setRowKind(RowKind.INSERT);
                out.collect(currentRow);
            } else {
                if (generateUpdateBefore) {
                    preRow.setRowKind(RowKind.UPDATE_BEFORE);
                    out.collect(preRow);
                }
                currentRow.setRowKind(RowKind.UPDATE_AFTER);
                out.collect(currentRow);
            }
        } else {
            // always send UPDATE_AFTER if INSERT is not needed
            currentRow.setRowKind(RowKind.UPDATE_AFTER);
            out.collect(currentRow);
        }
    }

Flink SQL是Apache Flink的一种查询语言,用于在Flink中进行实时数据处理和分析。要实现将Kafka中的数据落盘到HDFS,可以使用Flink SQL的相关功能。 首先,我们需要在Flink的配置文件中设置Kafka和HDFS的连接信息。在Flink的conf/flink-conf.yaml文件中,配置以下属性: ``` state.backend: filesystem state.checkpoints.dir: hdfs://<HDFS_HOST>:<HDFS_PORT>/checkpoints state.savepoints.dir: hdfs://<HDFS_HOST>:<HDFS_PORT>/savepoints ``` 其中,<HDFS_HOST>是HDFS的主机地址,<HDFS_PORT>是HDFS的端口号。这样配置后,Flink将会将检查点和保存点存储到HDFS中。 接下来,在Flink SQL中创建一个表来读取Kafka中的数据,并将数据写入到HDFS中。可以使用以下SQL语句实现: ```sql CREATE TABLE kafka_source ( key STRING, value STRING ) WITH ( 'connector' = 'kafka', 'topic' = '<KAFKA_TOPIC>', 'properties.bootstrap.servers' = '<KAFKA_BOOTSTRAP_SERVERS>', 'properties.group.id' = '<KAFKA_GROUP_ID>', 'format' = 'json' ); CREATE TABLE hdfs_sink ( key STRING, value STRING ) WITH ( 'connector' = 'filesystem', 'path' = 'hdfs://<HDFS_HOST>:<HDFS_PORT>/output', 'format' = 'csv', 'csv.field-delimiter' = ',' ); INSERT INTO hdfs_sink SELECT key, value FROM kafka_source; ``` 这里,'<KAFKA_TOPIC>'是Kafka中的主题名称,'<KAFKA_BOOTSTRAP_SERVERS>'是Kafka的启动服务器地址,'<KAFKA_GROUP_ID>'是Kafka消费者组的ID。'json'和'csv'是数据的格式,可以根据实际情况进行调整。 以上SQL语句创建了一个名为kafka_source的输入表,将Kafka中的数据源与之关联。同时,创建了一个名为hdfs_sink的输出表,将数据写入到HDFS中。最后,通过INSERT INTO语句,将kafka_source中的数据写入到hdfs_sink中。 通过以上的配置和操作,Flink SQL就可以实现将Kafka中的数据落盘到HDFS。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值