【 Flink】数据格式转换、写入读取kafka、Redis、MySQL、ClickHouse

嘚嘚嘚嘚嘚嘚哒

已于 2023-08-22 16:26:41 修改

阅读量410

点赞数

分类专栏： Flink 文章标签： flink kafka redis mysql clickhouse 数据库 java

于 2023-08-18 11:04:04 首次发布

本文链接：https://blog.csdn.net/qq_47450919/article/details/132290168

版权

Flink 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

上篇：Flink入门及获取数据、实现多并行度

一、添加依赖

flink基本依赖记得自行添加

 	<dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-connector-jdbc_2.11</artifactId>
        <version>1.12.7</version>
    </dependency>
 
    <dependency>
        <groupId>ru.yandex.clickhouse</groupId>
        <artifactId>clickhouse-jdbc</artifactId>
        <version>0.3.2</version>
    </dependency>

二、使用 Flink 应用转换操作

将数据从一种格式, 转化成另一种格式

streamSource.process(new ProcessFunction

1、将数据从字符串 => 数字

public class FlinkTransform01 {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> streamSource = environment.fromCollection(Arrays.asList("151", "162", "134", "131"));

        SingleOutputStreamOperator<Integer> streamOperator = streamSource.process(new ProcessFunction<String, Integer>() {

            @Override
            public void processElement(String s, ProcessFunction<String, Integer>.Context context, Collector<Integer> collector) throws Exception {
                Integer data = Integer.valueOf(s);
                collector.collect(data);
            }
        }).setParallelism(10);

        streamSource.print();

        environment.execute();
    }
}

运行结果：
5> 162
6> 134
4> 151
7> 131

在这个代码中，从一个字符串的集合中创建了一个数据源。然后使用 process 函数将每个字符串转换为整数，并将转换后的整数发送到数据流中。最后将结果打印出来。请注意，在 streamOperator 上使用了 .setParallelism(10) 来设置操作的并行度为 10。这表示 Flink 将会使用 10 个并行任务来执行这个操作，从而并行处理数据。

上面的代码涉及到了以下主要知识点：

StreamExecutionEnvironment: 这是 Flink 程序的执行环境，你可以从中创建数据流并设置整个程序的参数和属性。
Data Sources: 使用了 environment.fromCollection 来创建一个数据流，将一个集合转换为一个数据流。这是 Flink 中常用的一种数据源。
ProcessFunction: 这是 Flink 中的一个重要概念，允许对数据流进行处理。代码中使用 ProcessFunction 将每个字符串转换为整数，并将转换后的整数发送到下游。
Parallelism: 你使用 .setParallelism(10) 设置了操作的并行度。这决定了 Flink 会在多少个并行任务中同时执行这个操作。
SingleOutputStreamOperator: 这是对数据流的一种表示，可以在上面应用各种操作。在你的代码中，streamOperator 是一个通过 process 操作产生的数据流。
print() 操作: 使用 streamOperator.print() 来将操作的结果打印出来。这是一种简单的输出方式，方便查看数据流中的内容。

2、将数据从json字符串 => java对象

此处我们读取的是之前存放在kafka中Topic名为“alert_log”的数据，此时需要先确保有着对应的实体类AlertLog ，实体类代码如下：

addSource(new FlinkKafkaConsumer())

@Data
@NoArgsConstructor
@AllArgsConstructor
public class AlertLog {
    //设备（交换机）序列号：为了区分每个交换机采集到的数据
    private String deviceNum;

    //五元组
    private String srcIp;
    private Integer srcPort;
    private String dstIp;
    private Integer dstPort;
    //协议
    private String protocol;

    //五元组 + 2个Mac地址 = 七元组
    //源物理地址
    private String srcMac;

    //目的物理地址
    private String dstMac;

    //这个数据包被发现的时间
    //时间戳 timestamp
    private long createTime;
}

在 Flink 的 ProcessFunction 中使用一个 HashMap 来存储相同属性的 AlertLog 对象

class FlinkTransform02{
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        Properties properties = new Properties();
        properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"服务器IP地址:9092, 服务器IP地址:9093");
        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG,"flink_kafka");

        DataStreamSource<String> streamSource = environment.addSource(new FlinkKafkaConsumer("alert_log", new SimpleStringSchema(), properties));

        SingleOutputStreamOperator<AlertLog> streamOperator = streamSource.process(new ProcessFunction<String, AlertLog>() {
            //匿名内部类无法使用静态属性
            private final ObjectMapper objectMapper = new ObjectMapper();

            @Override
            public void processElement(String s, ProcessFunction<String, AlertLog>.Context context, Collector<AlertLog> collector) throws Exception {

                AlertLog alertLog = objectMapper.readValue(s, AlertLog.class);
                collector.collect(alertLog);
            }
        }).setParallelism(10);

        streamOperator.print();

        environment.execute();

    }
}

总的来说，这段代码演示了如何使用 Flink 创建数据流、应用转换操作、设置并行度以及进行简单的数据处理和输出。

三、Flink结合kafka

1、Flink写入kafka：使用官方组件 new FlinkKafkaProducer

addSink(new FlinkKafkaProducer())

public class FlinkSink01 {
   //写入kafka
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> streamSource = environment.fromCollection(Arrays.asList("8541", "7"));

        Properties properties = new Properties();
        properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "服务器IP地址:9092, 服务器IP地址:9093");

        //2、使用官方组件 new FlinkKafkaProducer
        streamSource.addSink(new FlinkKafkaProducer<>("test_flink_1", new SimpleStringSchema(), properties));

        //1、自定义写法
//        streamSource.addSink(new SinkFunction<String>() {
//            KafkaProducer kafkaProducer = new KafkaProducer<>("这里写……");
//            @Override
//            public void invoke(String value, Context context) throws Exception {
//                kafkaProducer.send(new ProducerRecord("","111"));
//
//            }
//        });

        environment.execute();
    }
}

2、读取kafka

可以直接打开kafka进行消费
kafka基本使用及结合Java使用

也可以直接用flink读取kafka数据，官方内部采用了kafka-streaming, 极大提升了读取性能

addSource(new FlinkKafkaConsumer())

/*
 * flink读取kafka数据  使用官方组件,
 * 1.简便  2.官方内部采用了kafka-streaming, 极大提升了读取性能
 * new FlinkKafkaConsumer
 */
class FlinkSourceDemo03{
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        Properties properties = new Properties();
        properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "服务器IP地址:9092,服务器IP地址:9093");
        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "kafka_reader_1");

        DataStreamSource<String> streamSource = environment.addSource(new FlinkKafkaConsumer("test_flink_1", new SimpleStringSchema(), properties));

        streamSource.print();

        environment.execute();
    }
}

四、Flink结合redis

1、写入redis

注意: 当sink函数中存在三方连接对象需要创建的时候, 比如jedis对象, 由于检查点/恢复机制, flink会对这个对象做特殊要求
也就是说, 正常情况下, flink不允许用户轻易的在sink函数中添加成员变量。但是,同时flink官方也提供了初始化成员变量的方案, 通过 Rich接口完成。

通过富接口完成 new RichSinkFunction

flink: flink由于有检查点, 恢复点的存在的, flink支持, 任意的时刻, 当系统发生宕机的时候，都会将当前任务的运行状态保存在磁盘中, 然后在系统重启后, 再次恢复过来。

addSink(new RichSinkFunction

class FlinkSink02 {
/*
   结合redis:
    */
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> streamSource = environment.fromCollection(Arrays.asList("morning", "afternoon"));

        streamSource.addSink(new RichSinkFunction<String>() {
            private Jedis jedis;
            @Override
            public void open(Configuration parameters) throws Exception {
                jedis = new Jedis("服务器IP地址", 6379);
                jedis.auth("redis密码");
            }

            @Override
            public void close() throws Exception {
                jedis.close();
            }

            @Override
            public void invoke(String value, Context context) throws Exception {
                jedis.set(value, "ohhhhhh!!!");
            }
        });

        environment.execute();
    }
}

使用QuickRedis工具查看redis数据库
成功插入redis

2、读取redis

addSource(new RichParallelSourceFunction

class FlinkSink03 {
    /*
       读取redis: new RichParallelSourceFunction
        */
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<Object> streamSource = environment.addSource(new RichParallelSourceFunction<Object>() {

            @Override
            public void run(SourceContext<Object> sourceContext) throws Exception {
                //......
            }

            @Override
            public void cancel() {
            }
        });
        
        streamSource.print();

        environment.execute();
    }
}

3、第三方写入redis

flink官方实际上没有对于redis连接的官方方案，有三方团队会专门为flink写上sink->redis的依赖, 但是有时候不灵。

版本需与Flink依赖版本保持一致，此处都为2.12

<!--第三方依赖-->
        <dependency>
            <groupId>org.apache.bahir</groupId>
            <artifactId>flink-connector-redis_2.12</artifactId>
            <version>1.1.0</version>
        </dependency>

class FlinkSink04 {
    /*
       写入redis: 写入[第三方]
       //使用第三方依赖：flink-connector-redis_2.12
//        new FlinkJedisPoolConfig.Builder()
        */
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> streamSource = environment.fromCollection(Arrays.asList("fighting", "oh My"));

        //flink官方实际上没有对于redis连接的官方方案
        //有三方团队会专门为flink写上sink->redis的依赖, 但是有时候不灵

        //指定连接配置
        FlinkJedisPoolConfig jedisPoolConfig = new FlinkJedisPoolConfig.Builder()
                .setHost("服务器IP地址").setPort(6379)
                .setPassword("redis密码").build();

        //指定将flink流中的数据, 如何存储到redis中
        DataStreamSink<String> streamSink = streamSource.addSink(new RedisSink<>(jedisPoolConfig, new RedisMapper<String>() {
            //定义操作方式
            @Override
            public RedisCommandDescription getCommandDescription() {
                return new RedisCommandDescription(RedisCommand.SET);
            }

            //定义键
            @Override
            public String getKeyFromData(String s) {
                return s;
            }

            //定义值
            @Override
            public String getValueFromData(String s) {
                return "three";
            }
        }));

        environment.execute();
    }
}

五、Flink结合MySQL

0、基础配置

建表

CREATE TABLE `alert_log` (
  `id` varchar(100) NOT NULL,
  `src_ip` varchar(100) DEFAULT NULL,
  `dst_ip` varchar(100) DEFAULT NULL,
  `src_port` int(11) DEFAULT NULL,
  `create_time` datetime DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

创建构造方法

    public AlertLog(String id, String srcIp, Integer srcPort, String dstIp, long createTime) {
        this.id = id;
        this.srcIp = srcIp;
        this.srcPort = srcPort;
        this.dstIp = dstIp;
        this.createTime = createTime;
    }

添加依赖

 		<dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.32</version>
        </dependency>

1、官方flink-jdbc写法：JdbcSink.sink

JdbcSink 类提供了静态方法 sink() 来创建一个 SinkFunction，用于将数据写入数据库。该类的方法允许设置不同的参数来控制数据写入的行为。
在这里插入图片描述

sink(String sql, JdbcStatementBuilder<T> statementBuilder, JdbcConnectionOptions connectionOptions): 该方法用于创建一个 SinkFunction，它需要传入 SQL 语句、JdbcStatementBuilder 对象、和 JdbcConnectionOptions 对象。
其中，JdbcStatementBuilder 用于构建 PreparedStatement 对象，JdbcConnectionOptions 用于指定 JDBC 连接的相关参数。默认使用 JdbcExecutionOptions.defaults() 来创建 JdbcExecutionOptions。
sink(String sql, JdbcStatementBuilder<T> statementBuilder, JdbcExecutionOptions executionOptions, JdbcConnectionOptions connectionOptions): 与上一个方法类似，但是可以传入自定义的 JdbcExecutionOptions 对象。
（具体用法见“六、Flink结合ClickHouse）

public class FlinkSink05 {
    /*
     * flink写入mysql
     */
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<AlertLog> streamSource = environment.fromCollection(Arrays.asList(
                new AlertLog("id1", "192.168.3.110", 1234, "192.168.3.122", System.currentTimeMillis()),
                new AlertLog("id2", "192.168.3.110", 5678, "192.168.3.122", System.currentTimeMillis())
        ));

        //官方flink-jdbc写法
        streamSource.addSink(JdbcSink.sink(
                "insert into alert_log values(?,?,?,?,?)",
                (preparedStatement, alertLog) ->{
                    preparedStatement.setObject(1, alertLog.getId());
                    preparedStatement.setObject(2, alertLog.getSrcIp());
                    preparedStatement.setObject(3, alertLog.getDstIp());
                    preparedStatement.setObject(4, alertLog.getSrcPort());
                    preparedStatement.setObject(5, new Date(alertLog.getCreateTime()));
                    },
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withDriverName("com.mysql.cj.jdbc.Driver")
                        .withUrl("jdbc:mysql://服务器IP地址:3306/flink23?useUnicode=true&characterEncoding=utf8")
                        .withUsername("root")
                        .withPassword("MySQL密码")
                        .build()
        ));

        environment.execute();
    }
}

new Date(alertLog.getCreateTime())
characterEncoding=utf8
.build()构建者模式

2、JDBC原生写法

//原生写法
        streamSource.addSink(new RichSinkFunction<AlertLog>() {
            private Connection connection;
            private PreparedStatement ps;
            private int count = 0;

            @Override
            public void open(Configuration parameters) throws Exception {
                Class.forName("com.mysql.cj.jdbc.Driver");

                //jdbc的url中需要开启批处理支持
                connection = DriverManager.getConnection("", "", "");
                ps = connection.prepareStatement("");
            }

            @Override
            public void close() throws Exception {
                super.close();
            }

            @Override
            public void invoke(AlertLog value, Context context) throws Exception {
                ++ count;
                ps.setString(1, value.getId());
                ps.setString(2, value.getSrcIp());

                if (count < 4000){
                    ps.addBatch();
                } else {
                    ps.execute();
                    ps.clearBatch();
                    count = 0;
                }
            }
        });

六、Flink结合ClickHouse

ClickHouse建表

CREATE TABLE test1.alert_log (
	id VARCHAR(100),
	src_ip VARCHAR(100),
	dst_ip VARCHAR (100),
	src_port Int32,
	create_time DateTime
)ENGINE = MergeTree
ORDER BY id;

使用 Flink 的 JDBC 连接器将数据写入 ClickHouse 数据库，ClickHouse结合flink的写法与MySQL的基本一样。

//官方flink-jdbc写法
        streamSource.addSink(JdbcSink.sink(
                "insert into alert_log values(?,?,?,?,?)",
                (preparedStatement, alertLog) ->{
                    preparedStatement.setObject(1, alertLog.getId());
                    preparedStatement.setObject(2, alertLog.getSrcIp());
                    preparedStatement.setObject(3, alertLog.getDstIp());
                    preparedStatement.setObject(4, alertLog.getSrcPort());
                    preparedStatement.setObject(5, alertLog.getCreateTime()/1000);//
                },
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withDriverName(ClickHouseDriver.class.getName())//
                        .withUrl("jdbc:clickhouse://服务器IP地址:8123/test1?serverTimezone=Asia/Shanghai")//
                        .withUsername("root")
                        .withPassword("ClickHouse密码")
                        .build()
        ));

提到的参数以及对应的含义：

serverTimezone=Asia/Shanghai: 这是 ClickHouse JDBC 连接参数之一，用于指定服务器的时区。在这里，你设置时区为亚洲/上海，以确保日期时间的正确性。
alertLog.getCreateTime()/1000: 这部分代码用于将 alertLog 中的 createTime 转换为秒级时间戳，以与 ClickHouse 数据库中的时间戳格式相匹配。ClickHouse 默认支持秒级时间戳。
withDriverName(ClickHouseDriver.class.getName()): 这是指定 JDBC 驱动程序的类名，以便 Flink 使用 ClickHouse 驱动程序进行数据库连接。
withUrl(“jdbc:clickhouse://服务器IP地址:8123/test1?serverTimezone=Asia/Shanghai”): 这是 ClickHouse 数据库的连接 URL。其中 服务器IP地址 需要替换为你的 ClickHouse 服务器的实际 IP 地址，test1 是数据库名称。serverTimezone=Asia/Shanghai 设置数据库的时区，保持与应用程序中的时区一致。

传入自定义的 `JdbcExecutionOptions` 对象。

在这里插入图片描述
JdbcExecutionOptions.Builder 类提供了一系列方法来设置执行选项，可以通过链式调用来组合这些选项。

withBatchSize(int size): 设置批量写入的批次大小，默认为 5000。
withBatchIntervalMs(long intervalMs): 设置批处理的时间间隔，即每隔多少毫秒将数据写入一次数据库，默认为 0，表示不进行定时批处理。
withMaxRetries(int maxRetries): 设置最大重试次数，默认为 3。
build(): 构建一个 JdbcExecutionOptions 对象，将之前设置的选项应用于其中。

        streamSource.addSink(JdbcSink.sink(
                "insert into alert_log values(?,?,?,?,?)",
                (preparedStatement, alertLog) ->{
                    preparedStatement.setObject(1, alertLog.getId());
                    preparedStatement.setObject(2, alertLog.getSrcIp());
                    preparedStatement.setObject(3, alertLog.getDstIp());
                    preparedStatement.setObject(4, alertLog.getSrcPort());
                    preparedStatement.setObject(5, alertLog.getCreateTime()/1000);//
                },
                new JdbcExecutionOptions.Builder()
                        .withBatchIntervalMs(10000)
                        .build(),
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withDriverName(ClickHouseDriver.class.getName())//
                        .withUrl("jdbc:clickhouse://服务器IP地址:8123/test1?serverTimezone=Asia/Shanghai")//
                        .withUsername("root")
                        .withPassword("ClickHouse密码")
                        .build()
        ));

.withBatchIntervalMs(10000) 设置批处理间隔为 10000 毫秒，即每隔 10 秒将数据写入一次数据库。

七、Flick读取kafka数据存储在ClickHouse中

environment.addSource(new FlinkKafkaConsumer<String>

streamOperator.addSink(JdbcSink.sink(

/*
接受kafka数据传送到ClickHouse
 */
public class Kafka2CkJob {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        Properties properties = new Properties();
        properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "服务器IP地址:9092, 服务器IP地址:9093");
        properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "kafka2ckjob");

        DataStreamSource<String> streamSource = environment.addSource(new FlinkKafkaConsumer<String>("alert_log", new SimpleStringSchema(), properties));

        ObjectMapper objectMapper = new ObjectMapper();

        SingleOutputStreamOperator<AlertLog> streamOperator = streamSource.map(str -> objectMapper.readValue(str, AlertLog.class));

        //数据表字段(id, src_ip, dst_ip, src_port, dst_port, protocol, device_num, src_mac, dst_mac, create_time)
        streamOperator.addSink(JdbcSink.sink(
                "insert into alert_log values(?,?,?,?,?, ?,?,?,?,?)",
                (preparedStatement, alertLog) -> {
                    preparedStatement.setObject(1, alertLog.getId());
                    preparedStatement.setObject(2, alertLog.getSrcIp());
                    preparedStatement.setObject(3, alertLog.getDstIp());
                    preparedStatement.setObject(4, alertLog.getSrcPort());
                    preparedStatement.setObject(5, alertLog.getDstPort());
                    preparedStatement.setObject(6, alertLog.getProtocol());
                    preparedStatement.setObject(7, alertLog.getDeviceNum());
                    preparedStatement.setObject(8, alertLog.getSrcMac());
                    preparedStatement.setObject(9, alertLog.getDstMac());
                    preparedStatement.setObject(10, alertLog.getCreateTime());
                },
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withDriverName(ClickHouseDriver.class.getName())
                        .withUrl("jdbc:clickhouse://服务器IP地址:8123/test1?serverTimezone=Asia/Shanghai")//
                        .withUsername("root")
                        .withPassword("ClickHouse密码")
                        .build()
        ));

        environment.execute();
    }
}

嘚嘚嘚嘚嘚嘚哒

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
【 Flink】数据格式转换、写入读取kafka、Redis、MySQL、ClickHouse

使用 Flink 创建数据流、应用转换操作[字符串 => 数字、json字符串 => java对象]、设置并行度以及进行简单的数据处理和输出；写入读取kafka、Redis、MySQL、ClickHouse；从kafka中读取数据再存储在ClickHouse中；
复制链接

扫一扫