Flink中自定义序列化器

程序员成长:技术、职场与思维模式实战指南 10w+人浏览 765人参与

Flink中有自己的序列化器和Kryo序列化器,当不满足Flink中类型定义的要求的的时候,就会回退到使用Kryo序列化器,而通常使用Kryo序列化器比使用Flink的序列化器性能要低很多。

当然Flink提供了一些当回退到了Kryo的时候,可以根据自己的类型来注册自定义的序列化器,位置见:https://nightlies.apache.org/flink/flink-docs-release-2.1/zh/docs/dev/datastream/fault-tolerance/serialization/third_party_serializers/

这里要求要实现Kryo的Serializer类,类全路径为:com.ericsoftware.kryo.Serializer,下面给一个实现的例子出来:

import com.esotericsoftware.kryo.Kryo;
import com.esotericsoftware.kryo.Serializer;
import com.esotericsoftware.kryo.io.Input;
import com.esotericsoftware.kryo.io.Output;
import org.apache.fory.Fory;
import org.apache.fory.ThreadSafeFory;
import org.apache.fory.config.Language;

public class ForySerializer <T> extends Serializer<T> {
	public static final ThreadSafeFory fory = Fory.builder()
			.withLanguage(Language.JAVA)
			.withRefTracking(false)
			.requireClassRegistration(false)
			.buildThreadSafeFory();

	@Override
	public void write(Kryo kryo, Output output, T object) {
		byte[] bytes = fory.serialize(object);
		output.writeInt(bytes.length);
		output.writeBytes(bytes);
	}

	@Override
	public T read(Kryo kryo, Input input, Class<? extends T> type) {
		int length = input.readInt();
		byte[] bytes = input.readBytes(length);
		return (T) fory.deserialize(bytes);
	}


	@Override
	public boolean isImmutable() {
		return false;
	}

}

虽然Flink提供了以上的方式来自定义序列化器,但是当遇到以下这种情况的时候,用上面提到的方式,却是不会生效的:

@NoArgsConstructor
@AllArgsConstructor
@Data
public class CustomDataType {

    private String status;

    private RawValue rawValue;

    @NoArgsConstructor
    @AllArgsConstructor
    @Data
    public static class RawValue {
        private String key;

        private Object value;
    }

}

接着我们翻阅官方发现了,可以通过定义TypeInfomationFactory的方式来,保证我们无法走Flink序列化器的类型走自定义的类型:

https://nightlies.apache.org/flink/flink-docs-release-2.1/zh/docs/dev/datastream/fault-tolerance/serialization/types_serialization/#defining-type-information-using-a-factory

以下是针对TypeInfomationFactory方式实现的详情:

public class CustomTypeInfoFactory extends TypeInfoFactory<CustomDataType> {

    @Override
    public TypeInformation<CustomDataType> createTypeInfo(Type t, Map<String, TypeInformation<?>> genericParameters) {
        return new CustomTypeInformation();
    }
}
public class CustomTypeInformation extends TypeInformation<CustomDataType> {

    private static final long serialVersionUID = 1L;

    @Override
    public boolean isBasicType() {
        return false;
    }

    @Override
    public boolean isTupleType() {
        return false;
    }

    @Override
    public int getArity() {
        return 1;
    }

    @Override
    public int getTotalFields() {
        return CustomDataType.class.getFields().length;
    }

    @Override
    public Class<CustomDataType> getTypeClass() {
        return CustomDataType.class;
    }

    @Override
    public boolean isKeyType() {
        return false;
    }

    @Override
    public TypeSerializer<CustomDataType> createSerializer(SerializerConfig config) {
        return new CustomTypeSerializer();
    }

    @Override
    public String toString() {
        return "CustomTypeInformation<CustomDataType>";
    }

    @Override
    public boolean equals(Object obj) {
        return obj instanceof CustomTypeInformation;
    }

    @Override
    public int hashCode() {
        return CustomTypeInformation.class.hashCode();
    }

    @Override
    public boolean canEqual(Object obj) {
        return obj instanceof CustomTypeInformation;
    }
}

这里是最核心的定义序列化器的地方,这里使用了Apache Fory来序列化数据:

public class CustomTypeSerializer extends TypeSerializer<CustomDataType> {

    private static final long serialVersionUID = 1L;

    public static final ThreadSafeFory fory = Fory.builder()
            .withLanguage(Language.JAVA)
            .withRefTracking(false)
            .requireClassRegistration(false)
            .buildThreadSafeFory();

    public CustomTypeSerializer() {
        
    }

    @Override
    public boolean isImmutableType() {
        return false;
    }

    @Override
    public TypeSerializer<CustomDataType> duplicate() {
        return new CustomTypeSerializer();
    }

    @Override
    public CustomDataType createInstance() {
        return new CustomDataType();
    }

    @Override
    public CustomDataType copy(CustomDataType from) {
        CustomDataType.RawValue rawValue = new CustomDataType.RawValue();
        rawValue.setKey(from.getRawValue().getKey());
        rawValue.setValue(from.getRawValue().getValue());
        return new CustomDataType(from.getStatus(), rawValue);
    }

    @Override
    public CustomDataType copy(CustomDataType from, CustomDataType reuse) {
        reuse.setStatus(from.getStatus());
        reuse.setRawValue(from.getRawValue());
        return reuse;
    }

    @Override
    public int getLength() {
        return -1; // 可变长度
    }

    @Override
    public void serialize(CustomDataType record, DataOutputView target) throws IOException {
        // 序列化逻辑
        byte[] bytes = fory.serialize(record);
        target.writeInt(bytes.length);
        target.write(bytes);
    }

    @Override
    public CustomDataType deserialize(DataInputView source) throws IOException {
        // 反序列化逻辑
        int length = source.readInt();
        byte[] bytes = new byte[length];
        source.read(bytes);
        return (CustomDataType) fory.deserialize(bytes);
    }

    @Override
    public CustomDataType deserialize(CustomDataType reuse, DataInputView source) throws IOException {
        return reuse;
    }

    @Override
    public void copy(DataInputView source, DataOutputView target) throws IOException {
        serialize(deserialize(source), target);
    }

    @Override
    public boolean equals(Object obj) {
        return obj instanceof CustomTypeSerializer;
    }

    @Override
    public int hashCode() {
        return CustomTypeSerializer.class.hashCode();
    }

    @Override
    public TypeSerializerSnapshot<CustomDataType> snapshotConfiguration() {
        return new CustomTypeSerializerSnapshot();
    }

    // 序列化器快照(用于状态兼容性)
    public static final class CustomTypeSerializerSnapshot extends SimpleTypeSerializerSnapshot<CustomDataType> {
        public CustomTypeSerializerSnapshot() {
            super(CustomTypeSerializer::new);
        }
    }
}

最后是我们的数据类型上加上注解:@TypeInfo(CustomTypeInfoFactory.class)。

接下来提供一个例子来跑通以上的自定义序列化器:

public class SerializerDemo {

    public static void main(String[] args) throws Exception {

        Configuration configuration = new Configuration();
        configuration.set(RestOptions.PORT, 8081);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(configuration);
        env.setParallelism(2);
        env.disableOperatorChaining();

        env.addSource(new SourceFunction<String>() {

            private volatile boolean flag = true;

            private Random rand = new Random();

            @Override
            public void run(SourceContext<String> ctx) throws Exception {
                while (flag) {
                    Thread.sleep(1000);
                    ctx.collect(String.format("%s@%s", rand.nextInt(100), LocalDateTime.now()));
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        }).map(new MapFunction<String, CustomDataType>() {
            @Override
            public CustomDataType map(String value) throws Exception {
                String[] split = value.split("@");
                CustomDataType customDataType = new CustomDataType();
                customDataType.setStatus("alive");
                CustomDataType.RawValue rawValue = new CustomDataType.RawValue();
                rawValue.setKey(split[0]);
                rawValue.setValue(split[1]);
                customDataType.setRawValue(rawValue);
                return customDataType;
            }
        }).addSink(new SinkFunction<CustomDataType>() {
            @Override
            public void invoke(CustomDataType value) throws Exception {
                System.out.println("结果为:"+value);
            }
        });

        env.execute();
    }

}

其中最后附上我debug代码的时候,拷贝下来的代码栈,方便代码理解整个代码逻辑:

createTypeInfo:13, CustomTypeInfoFactory (com.bonree.serializers)
createTypeInfoFromFactory:1385, TypeExtractor (org.apache.flink.api.java.typeutils)
createTypeInfoFromFactory:1353, TypeExtractor (org.apache.flink.api.java.typeutils)
getTypeInfoFactory:1730, TypeExtractor (org.apache.flink.api.java.typeutils)          // 在这里去读取Pojo类上定义的@TypeInfo注解获取对应的TypeInfoFactory的
getClosestFactory:1790, TypeExtractor (org.apache.flink.api.java.typeutils)
createTypeInfoFromFactory:1340, TypeExtractor (org.apache.flink.api.java.typeutils)
createTypeInfoWithTypeHierarchy:882, TypeExtractor (org.apache.flink.api.java.typeutils)
privateCreateTypeInfo:861, TypeExtractor (org.apache.flink.api.java.typeutils)
getUnaryOperatorReturnType:608, TypeExtractor (org.apache.flink.api.java.typeutils)
getMapReturnTypes:184, TypeExtractor (org.apache.flink.api.java.typeutils)
map:425, DataStream (org.apache.flink.streaming.api.datastream)
main:41, SerializerDemo (com.bonree.serializers)

### 实现 Flink CDC 2.4 自定义序列化器指南 #### 创建自定义序列化器类 为了创建一个自定义序列化器,在 Java 或 Scala 中需要继承 `TypeSerializer` 类并重写相应的方法。这允许开发者控制数据类型的序列化逻辑。 ```java public class CustomRowDataSerializer extends TypeSerializer<RowData> { @Override public boolean isImmutable() { /* ... */ } @Override public TypeSerializerSnapshot<RowData> snapshotConfiguration() { /* ... */ } @Override public RowData createInstance() { /* ... */ } @Override public RowData copy(RowData from) { /* ... */ } @Override public RowData copy(RowData from, RowData reuse) { /* ... */ } @Override public int getLength() { /* ... */ } @Override public void serialize(RowData record, DataOutputView target) throws IOException { /* ... */ } @Override public RowData deserialize(DataInputView source) throws IOException { /* ... */ } @Override public RowData deserialize(RowData reuse, DataInputView source) throws IOException { /* ... */ } @Override public void copy(DataInputView source, DataInputView target) throws IOException { /* ... */ } } ``` 此代码片段展示了如何构建一个新的序列化器来处理特定的数据结构,如 `RowData` 对象[^1]。 #### 注册自定义序列化器 当实现了上述序列化器之后,还需要将其注册到 Flink 的运行环境中以便于框架能够识别和使用它。通常情况下这是通过配置文件或者编程方式完成的: ```java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); env.getConfig().registerTypeWithKryoSerializer(CustomClass.class, CustomSerializer.class); ``` 对于流处理应用来说,则应采用 StreamExecutionEnvironment 来代替 ExecutionEnvironment[^2]。 #### 序列化器的应用场景 Flink CDC 支持多种数据库作为数据源,并且可以将变更事件捕获后发送给下游系统。在这个过程中如果涉及到复杂对象或特殊格式的数据传输时就需要用到定制化的序列化机制以确保效率和准确性[^3]。 #### 编译注意事项 在开发期间可能会遇到一些编译错误,特别是当你尝试修改官方发布的 jar 文件中的组件时。为了避免这些问题,建议按照标准流程先安装所有必要的依赖项再进行调试工作。例如可以通过 Maven 构建工具执行以下命令来跳过测试阶段从而加快迭代速度: ```bash mvn clean compile install -DskipTests=true ``` 这样做能有效减少不必要的麻烦并且提高工作效率[^4]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值