Flink DataStream读写Hudi

一、pom依赖

测试案例中,pom依赖如下,根据需要自行删减。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.test</groupId>
    <artifactId>Examples</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.version>2.11.8</scala.version>
        <scala.binary.version>2.11</scala.binary.version>
        <hadoop.version>2.6.0</hadoop.version>
        <flink.version>1.14.5</flink.version>
        <kafka.version>2.0.0</kafka.version>
        <hbase.version>1.2.0</hbase.version>
        <hudi.version>0.12.0</hudi.version>
    </properties>
    <dependencies>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-statebackend-rocksdb_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-runtime-web_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-api-java-bridge_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hudi</groupId>
            <artifactId>hudi-flink1.14-bundle</artifactId>
            <version>${hudi.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <classifier>core</classifier>
            <version>2.3.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-log4j12</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>log4j</artifactId>
                    <groupId>log4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-log4j12</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>log4j</artifactId>
                    <groupId>log4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <!--下面是打印日志的,可以不加-->
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j-impl</artifactId>
            <scope>provided</scope>
            <version>2.17.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-api</artifactId>
            <scope>provided</scope>
            <version>2.17.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <scope>provided</scope>
            <version>2.17.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/com.alibaba.fastjson2/fastjson2 -->
        <dependency>
            <groupId>com.alibaba.fastjson2</groupId>
            <artifactId>fastjson2</artifactId>
            <version>2.0.16</version>
        </dependency>

        <!--restTemplate启动器-->
        <!-- https://mvnrepository.com/artifact/org.springframework.boot/spring-boot-starter-data-rest -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-rest</artifactId>
            <version>2.7.0</version>
            <exclusions>
                <exclusion>
                    <groupId>org.springframework.boot</groupId>
                    <artifactId>spring-boot-starter-logging</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- 配置管理 -->
        <dependency>
            <groupId>com.typesafe</groupId>
            <artifactId>config</artifactId>
            <version>1.2.1</version>
        </dependency>
    </dependencies>

    <build>
        <finalName>${pom.artifactId}-${pom.version}</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>com.test.main.Examples</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>

        <resources>
            <resource>
                <directory>src/main/resources</directory>
                <excludes>
                    <exclude>environment/dev/*</exclude>
                    <exclude>environment/test/*</exclude>
                    <exclude>environment/smoke/*</exclude>
                    <exclude>environment/pre/*</exclude>
                    <exclude>environment/online/*</exclude>
                    <exclude>application.properties</exclude>
                </excludes>
            </resource>
            <resource>
                <directory>src/main/resources/environment/${environment}</directory>
                <targetPath>.</targetPath>
            </resource>
        </resources>
    </build>

    <profiles>
        <profile>
            <!-- 开发环境 -->
            <id>dev</id>
            <properties>
                <environment>dev</environment>
            </properties>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
        </profile>
        <profile>
            <!-- 测试环境 -->
            <id>test</id>
            <properties>
                <environment>test</environment>
            </properties>
        </profile>
        <profile>
            <!-- 冒烟环境 -->
            <id>smoke</id>
            <properties>
                <environment>smoke</environment>
            </properties>
        </profile>
        <profile>
            <!-- 生产环境 -->
            <id>online</id>
            <properties>
                <environment>online</environment>
            </properties>
        </profile>
    </profiles>
</project>

Hudi官网文档链接:

Flink Guide | Apache Hudi

二、DataStream API方式读写Hudi

2.1 写Hudi

package com.test.hudi;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.GenericRowData;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.data.StringData;
import org.apache.hudi.common.model.HoodieTableType;
import org.apache.hudi.configuration.FlinkOptions;
import org.apache.hudi.util.HoodiePipeline;

import java.util.HashMap;
import java.util.Map;

public class FlinkDataStreamWrite2HudiTest {
    public static void main(String[] args) throws Exception {
        // 1.创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2.必须开启checkpoint 默认有5个checkpoint后,hudi目录下才会有数据;不然只有一个.hoodie目录
        String checkPointPath = "hdfs://hw-cdh-test02:8020/flinkinfo/meta/savepoints/FlinkDataStreamWrite2HudiTest";
        StateBackend backend = new EmbeddedRocksDBStateBackend(true);
        env.setStateBackend(backend);
        CheckpointConfig conf = env.getCheckpointConfig();
        // 任务流取消和故障应保留检查点
        conf.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        conf.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        conf.setCheckpointInterval(1000);//milliseconds
        conf.setCheckpointTimeout(10 * 60 * 1000);//milliseconds
        conf.setMinPauseBetweenCheckpoints(2 * 1000);//相邻两次checkpoint之间的时间间隔
        conf.setCheckpointStorage(checkPointPath);

        // 3.准备数据
        DataStreamSource<Student> studentDS = env.fromElements(
                new Student(101L, "Johnson", 17L, "swimming"),
                new Student(102L, "Lin", 15L, "shopping"),
                new Student(103L, "Tom", 5L, "play"));

        // 4.创建Hudi数据流
        // 4.1 Hudi表名和路径
        String studentHudiTable = "ods_student_table";
        String studentHudiTablePath = "hdfs://hw-cdh-test02:8020/user/hive/warehouse/lake/" + studentHudiTable;
        Map<String, String> studentOptions = new HashMap<>();
        studentOptions.put(FlinkOptions.PATH.key(), studentHudiTablePath);
        studentOptions.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.MERGE_ON_READ.name());

        HoodiePipeline.Builder studentBuilder = HoodiePipeline.builder(studentHudiTable)
                .column("id BIGINT")
                .column("name STRING")
                .column("age BIGINT")
                .column("hobby STRING")
                .pk("id")
//                .pk("id,age")// 可以设置联合主键,用逗号分隔
                .options(studentOptions);

        // 5.转成RowData流
        DataStream<RowData> studentRowDataDS = studentDS.map(new MapFunction<Student, RowData>() {
            @Override
            public RowData map(Student value) throws Exception {
                try {
                    Long id = value.id;
                    String name = value.name;
                    Long age = value.age;
                    String hobby = value.hobby;

                    GenericRowData row = new GenericRowData(4);
                    row.setField(0, Long.valueOf(id));
                    row.setField(1, StringData.fromString(name));
                    row.setField(2, Long.valueOf(age));
                    row.setField(3, StringData.fromString(hobby));

                    return row;
                } catch (Exception e) {
                    e.printStackTrace();
                    return null;
                }
            }
        });

        studentBuilder.sink(studentRowDataDS, false);

        env.execute("FlinkDataStreamWrite2HudiTest");
    }

    public static class Student{
        public Long id;
        public String name;
        public Long age;
        public String hobby;

        public Student() {
        }

        public Student(Long id, String name, Long age, String hobby) {
            this.id = id;
            this.name = name;
            this.age = age;
            this.hobby = hobby;
        }

        public Long getId() {
            return id;
        }

        public void setId(Long id) {
            this.id = id;
        }

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public Long getAge() {
            return age;
        }

        public void setAge(Long age) {
            this.age = age;
        }

        public String getHobby() {
            return hobby;
        }

        public void setHobby(String hobby) {
            this.hobby = hobby;
        }

        @Override
        public String toString() {
            return "Student{" +
                    "id=" + id +
                    ", name='" + name + '\'' +
                    ", age=" + age +
                    ", hobby='" + hobby + '\'' +
                    '}';
        }
    }
}

案例中,通过env.fromElements造三条数据写入Hudi,通过查询,可证明3条数据写入成功:

 在实际开发中,需要切换数据源,比如从kafka读取数据,写入Hudi,将上面的数据源进行替换,并完成RowData转换即可。(切记,一定要开启checkpoint,否则只有一个,hoodie目录。本人在这里踩过坑,调了一个下午,数据都没有写入成功,只有一个hoodie目录,后来经过研究才知道需要设置checkpoint。本案例中,由于是造的三条数据,跑完之后程序就停了,不设置checkpoint,数据也会写入hudi表;但是如果正在的流计算,从kafka读数据,写入hudi,如果不设置checkpoint,数据最终无法写入hudi表)。

2.2 读Hudi

package com.test.hudi;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;
import org.apache.hudi.common.model.HoodieTableType;
import org.apache.hudi.configuration.FlinkOptions;
import org.apache.hudi.util.HoodiePipeline;

import java.util.HashMap;
import java.util.Map;

public class FlinkDataStreamReadFromHudiTest {
    public static void main(String[] args) throws Exception {
        // 1. 创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 2.创建Hudi数据流
        String studentHudiTable = "ods_student_table";
        String studentHudiTablePath = "hdfs://hw-cdh-test02:8020/user/hive/warehouse/lake/" + studentHudiTable;
        Map<String, String> studentOptions = new HashMap<>();
        studentOptions.put(FlinkOptions.PATH.key(), studentHudiTablePath);
        studentOptions.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.MERGE_ON_READ.name());
        studentOptions.put(FlinkOptions.READ_AS_STREAMING.key(), "true");// this option enable the streaming read
        studentOptions.put(FlinkOptions.READ_START_COMMIT.key(), "16811748000000");// specifies the start commit instant time
        studentOptions.put(FlinkOptions.READ_STREAMING_CHECK_INTERVAL.key(), "4");//
        studentOptions.put(FlinkOptions.CHANGELOG_ENABLED.key(), "true");//
        HoodiePipeline.Builder studentBuilder = HoodiePipeline.builder(studentHudiTable)
                .column("id BIGINT")
                .column("name STRING")
                .column("age BIGINT")
                .column("hobby STRING")
                .pk("id")
                .options(studentOptions);
        DataStream<RowData> studentRowDataDS = studentBuilder.source(env);

        // 3. 数据转换与输出
        DataStream<Student> studentDS = studentRowDataDS.map(new MapFunction<RowData, Student>() {
            @Override
            public Student map(RowData value) throws Exception {
                try {
                    String rowKind = value.getRowKind().name();
                    Long id = value.getLong(0);
                    String name = value.getString(1).toString();
                    Long age = value.getLong(2);
                    String hobby = value.getString(3).toString();

                    Student student = new Student(id, name, age, hobby, rowKind);
                    return student;
                } catch (Exception e) {
                    e.printStackTrace();
                    return null;
                }
            }
        });
        studentDS.print();

        env.execute("FlinkDataStreamReadFromHudiTest");
    }

    public static class Student{
        public Long id;
        public String name;
        public Long age;
        public String hobby;
        public String rowKind;

        public Student() {
        }

        public Student(Long id, String name, Long age, String hobby, String rowKind) {
            this.id = id;
            this.name = name;
            this.age = age;
            this.hobby = hobby;
            this.rowKind = rowKind;
        }

        public Long getId() {
            return id;
        }

        public void setId(Long id) {
            this.id = id;
        }

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public Long getAge() {
            return age;
        }

        public void setAge(Long age) {
            this.age = age;
        }

        public String getHobby() {
            return hobby;
        }

        public void setHobby(String hobby) {
            this.hobby = hobby;
        }

        public String getRowKind() {
            return rowKind;
        }

        public void setRowKind(String rowKind) {
            this.rowKind = rowKind;
        }

        @Override
        public String toString() {
            return "Student{" +
                    "id=" + id +
                    ", name='" + name + '\'' +
                    ", age=" + age +
                    ", hobby='" + hobby + '\'' +
                    ", rowKind='" + rowKind + '\'' +
                    '}';
        }
    }
}

输出结果:

 其中,rowKind,是对行的描述,有 INSERT, UPDATE_BEFORE, UPDATE_AFTER, DELETE,分别对应op的 +I, -U, +U, -D,表示 插入、更新前、更新后、删除 操作。

三、Table API方式读写Hudi

3.1 写Hudi

3.1.1 数据来自DataStream

package com.test.hudi;

import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

public class FlinkDataStreamSqlWrite2HudiTest {
    public static void main(String[] args) throws Exception {
        // 1.创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);

        // 2.必须开启checkpoint 默认有5个checkpoint后,hudi目录下才会有数据;不然只有一个.hoodie目录
        String checkPointPath = "hdfs://hw-cdh-test02:8020/flinkinfo/meta/savepoints/FlinkDataStreamWrite2HudiTest";
        StateBackend backend = new EmbeddedRocksDBStateBackend(true);
        env.setStateBackend(backend);
        CheckpointConfig conf = env.getCheckpointConfig();
        // 任务流取消和故障应保留检查点
        conf.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        conf.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        conf.setCheckpointInterval(1000);//milliseconds
        conf.setCheckpointTimeout(10 * 60 * 1000);//milliseconds
        conf.setMinPauseBetweenCheckpoints(2 * 1000);//相邻两次checkpoint之间的时间间隔
        conf.setCheckpointStorage(checkPointPath);

        // 3.准备数据,真实环境中,这里可以替换成从kafka读取数据
        DataStreamSource<Student> studentDS = env.fromElements(
                new Student(201L, "zhangsan", 117L, "eat"),
                new Student(202L, "lisi", 115L, "drink"),
                new Student(203L, "wangwu", 105L, "sleep"));
        // 由于后续没有DataStream的执行算子,可以会报错:
        // Exception in thread "main" java.lang.IllegalStateException: No operators defined in streaming topology. Cannot execute.
        // 不过不影响数据写入Hudi
        // 当然,也可以加一步DataStream的执行算子,比如 print
//        studentDS.print("DataStream: ");

        // 4.通过DataStream创建表
        // 4.1 第一个参数:表名;第二个参数:DataStream;第三个可选参数:指定列名,可以指定DataStream中的元素名和列名的匹配关系,比如 "userId as user_id, name, age, hobby"
        tabEnv.registerDataStream("tmp_student_table", studentDS, "id, name, age, hobby");

        // 5.准备Hudi表的数据流,并将数据写入Hudi表
        tabEnv.executeSql("" +
                "CREATE TABLE out_ods_student_table(\n" +
                "    id BIGINT COMMENT '学号',\n" +
                "    name STRING\t COMMENT '姓名',\n" +
                "    age BIGINT  COMMENT '年龄',\n" +
                "    hobby STRING    COMMENT '爱好',\n" +
                "    PRIMARY KEY (id) NOT ENFORCED\n" +
                ")\n" +
                "WITH(\n" +
                "    'connector' = 'hudi',\n" +
                "    'path' = 'hdfs://hw-cdh-test02:8020/user/hive/warehouse/lake/ods_student_table',\n" +
                "    'table.type' = 'MERGE_ON_READ',\n" +
                "    'compaction.async.enabled' = 'true',\n" +
                "    'compaction.tasks' = '1',\n" +
                "    'compaction.trigger.strategy' = 'num_commits',\n" +
                "    'compaction.delta_commits' = '3',\n" +
                "    'hoodie.cleaner.policy'='KEEP_LATEST_COMMITS',\n" +
                "    'hoodie.cleaner.commits.retained'='30',\n" +
                "    'hoodie.keep.min.commits'='35' ,\n" +
                "    'hoodie.keep.max.commits'='40'\n" +
                ")");
        tabEnv.executeSql("insert into out_ods_student_table select id,name,age,hobby from tmp_student_table");


        env.execute("FlinkDataStreamSqlWrite2HudiTest");
    }

    public static class Student{
        public Long id;
        public String name;
        public Long age;
        public String hobby;

        public Student() {
        }

        public Student(Long id, String name, Long age, String hobby) {
            this.id = id;
            this.name = name;
            this.age = age;
            this.hobby = hobby;
        }

        public Long getId() {
            return id;
        }

        public void setId(Long id) {
            this.id = id;
        }

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public Long getAge() {
            return age;
        }

        public void setAge(Long age) {
            this.age = age;
        }

        public String getHobby() {
            return hobby;
        }

        public void setHobby(String hobby) {
            this.hobby = hobby;
        }

        @Override
        public String toString() {
            return "Student{" +
                    "id=" + id +
                    ", name='" + name + '\'' +
                    ", age=" + age +
                    ", hobby='" + hobby + '\'' +
                    '}';
        }
    }
}

通过查看Hudi表,证明3条数据写入成功:

3.1.2 数据来自Table

package com.test.hudi;

import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

public class FlinkValuesSqlWrite2HudiTest {
    public static void main(String[] args) throws Exception {
        // 1. 创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);

        // 2.必须开启checkpoint 默认有5个checkpoint后,hudi目录下才会有数据;不然只有一个.hoodie目录
        String checkPointPath = "hdfs://hw-cdh-test02:8020/flinkinfo/meta/savepoints/FlinkDataStreamWrite2HudiTest";
        StateBackend backend = new EmbeddedRocksDBStateBackend(true);
        env.setStateBackend(backend);
        CheckpointConfig conf = env.getCheckpointConfig();
        // 任务流取消和故障应保留检查点
        conf.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        conf.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        conf.setCheckpointInterval(1000);//milliseconds
        conf.setCheckpointTimeout(10 * 60 * 1000);//milliseconds
        conf.setMinPauseBetweenCheckpoints(2 * 1000);//相邻两次checkpoint之间的时间间隔
        conf.setCheckpointStorage(checkPointPath);

        // 3.准备Hudi表的数据流,并将数据写入Hudi表
        tabEnv.executeSql("" +
                "CREATE TABLE out_ods_student_table(\n" +
                "    id BIGINT COMMENT '学号',\n" +
                "    name STRING\t COMMENT '姓名',\n" +
                "    age BIGINT  COMMENT '年龄',\n" +
                "    hobby STRING    COMMENT '爱好',\n" +
                "    PRIMARY KEY (id) NOT ENFORCED\n" +
                ")\n" +
                "WITH(\n" +
                "    'connector' = 'hudi',\n" +
                "    'path' = 'hdfs://hw-cdh-test02:8020/user/hive/warehouse/lake/ods_student_table',\n" +
                "    'table.type' = 'MERGE_ON_READ',\n" +
                "    'compaction.async.enabled' = 'true',\n" +
                "    'compaction.tasks' = '1',\n" +
                "    'compaction.trigger.strategy' = 'num_commits',\n" +
                "    'compaction.delta_commits' = '3',\n" +
                "    'hoodie.cleaner.policy'='KEEP_LATEST_COMMITS',\n" +
                "    'hoodie.cleaner.commits.retained'='30',\n" +
                "    'hoodie.keep.min.commits'='35' ,\n" +
                "    'hoodie.keep.max.commits'='40'\n" +
                ")");
        tabEnv.executeSql("" +
                "insert into out_ods_student_table values\n" +
                "    (301, 'xiaoming', 201, 'read'),\n" +
                "    (302, 'xiaohong', 202, 'write'),\n" +
                "    (303, 'xiaogang', 203, 'sing')");

        env.execute("FlinkValuesSqlWrite2HudiTest");
    }
}

通过查看Hudi表,证明3条数据写入成功: 

3.2 读Hudi

package com.test.hudi;

import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

public class FlinkSqlReadFromHudiTest {
    public static void main(String[] args) throws Exception {
        // 1.创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamTableEnvironment tabEnv = StreamTableEnvironment.create(env);

        // 2.准备Hudi表的数据流,并从Hudi表读取数据
        tabEnv.executeSql("" +
                "CREATE TABLE out_ods_student_table(\n" +
                "    id BIGINT COMMENT '学号',\n" +
                "    name STRING\t COMMENT '姓名',\n" +
                "    age BIGINT  COMMENT '年龄',\n" +
                "    hobby STRING    COMMENT '爱好',\n" +
                "    PRIMARY KEY (id) NOT ENFORCED\n" +
                ")\n" +
                "WITH(\n" +
                "    'connector' = 'hudi',\n" +
                "    'path' = 'hdfs://hw-cdh-test02:8020/user/hive/warehouse/lake/ods_student_table',\n" +
                "    'table.type' = 'MERGE_ON_READ',\n" +
                "    'compaction.async.enabled' = 'true',\n" +
                "    'compaction.tasks' = '1',\n" +
                "    'compaction.trigger.strategy' = 'num_commits',\n" +
                "    'compaction.delta_commits' = '3',\n" +
                "    'hoodie.cleaner.policy'='KEEP_LATEST_COMMITS',\n" +
                "    'hoodie.cleaner.commits.retained'='30',\n" +
                "    'hoodie.keep.min.commits'='35' ,\n" +
                "    'hoodie.keep.max.commits'='40'\n" +
                ")");
        tabEnv.executeSql("select id,name,age,hobby from out_ods_student_table").print();

        env.execute("FlinkSqlReadFromHudiTest");
    }
}

输出结果:

四、补充

4.1 联合主键

在Flink Table操作Hudi的时候,可能会涉及到联合组件,可以在SQL中加入联合主键。比如:

tabEnv.executeSql("" +
        "CREATE TABLE out_ods_userinfo_table_test(\n" +
        "    province_id BIGINT COMMENT '省份编号',\n" +
        "    user_id BIGINT COMMENT '用户编号',\n" +
        "    name STRING\t COMMENT '姓名',\n" +
        "    age BIGINT COMMENT '年龄',\n" +
        "    hobby STRING COMMENT '爱好',\n" +
        "    PRIMARY KEY (province_id,user_id) NOT ENFORCED\n" +
        ")\n" +
        "WITH(\n" +
        "    'connector' = 'hudi',\n" +
        "    'path' = 'hdfs://hw-cdh-test02:8020/user/hive/warehouse/lake/ods_userinfo_table_test',\n" +
        "    'table.type' = 'MERGE_ON_READ',\n" +
        "    'hoodie.datasource.write.keygenerator.class'='org.apache.hudi.keygen.ComplexKeyGenerator',\n" +
        "    'hoodie.datasource.write.recordkey.field'= 'province_id,user_id',\n" +
        "    'compaction.async.enabled' = 'true',\n" +
        "    'compaction.tasks' = '1',\n" +
        "    'compaction.trigger.strategy' = 'num_commits',\n" +
        "    'compaction.delta_commits' = '3',\n" +
        "    'hoodie.cleaner.policy'='KEEP_LATEST_COMMITS',\n" +
        "    'hoodie.cleaner.commits.retained'='30',\n" +
        "    'hoodie.keep.min.commits'='35' ,\n" +
        "    'hoodie.keep.max.commits'='40'\n" +
        ")");

4.2 读取指定时间之后的数据

根据官方文档说明,可以读取指定提交时间之后的数据。比如指定时间20230318130057,那么读取到的都是提交时间在 2023-03-18 13:00:57 之后的数据,之前的数据读取不到。

4.2.1 DataStream API

options.put(FlinkOptions.READ_START_COMMIT.key(), "'20230318130057'"); // specifies the start commit instant time

4.2.2 Table API

'read.start-commit' = '20230318130057', -- specifies the start commit instant time

<think>好的,我现在需要解释“流批一体”以及如何实现它。首先,用户之前询问了流读的概念,现在转向流批一体,可能是想了解数据处理架构的演变。流批一体是什么意思呢?我应该先定义这个概念,说明它解决了什么问题。 可能用户已经知道传统的流处理和批处理是分开的,比如使用不同的系统(如Flink做流,Spark做批处理),这会带来维护两套代码、数据一致性等问题。流批一体旨在统一处理,减少复杂性。 接下来,如何实现流批一体?可能需要提到几个关键技术点,比如统一的API(例如Apache Flink的Table API)、存储层支持(如Delta Lake或Iceberg)、执行引擎的统一(同一套引擎处理流和批)。此外,Lambda架构和Kappa架构的区别可能也需要提及,但流批一体更接近Kappa的改进版,通过单一处理流程处理所有数据。 需要举例说明,比如使用Flink SQL同时处理实时流和历史数据,或者通过统一存储格式让批处理作业读取实时数据。同时,可能存在的挑战,比如状态管理、时间窗口处理的不同,也需要简要提到。 另外,用户可能想知道具体的实现步骤或架构设计,比如数据源、存储层、计算层的整合方式。可能还需要提到一些实际应用场景,如实时报表同时支持历史数据分析,或者机器学习模型同时用实时和离线数据训练。 最后,总结流批一体的优势,如降低复杂度、减少维护成本、提高数据一致性等,并指出需要注意的地方,比如选择合适的框架和存储解决方案,确保系统可扩展性。</think>--- **流批一体**是一种数据处理架构设计理念,旨在**统一流式处理(实时)和批处理(离线)的计算逻辑、数据存储和业务代码**,使得同一套系统或代码能够同时处理实时增量数据(流)和离线全量数据(批),从而简化架构、降低维护成本,并保证数据一致性。 --- ### **1. 为什么需要流批一体?** 传统数据处理中,流处理和批处理通常是分离的: - **流处理**:处理实时数据(如 Kafka 消息、Binlog 变更),低延迟但可能牺牲准确性。 - **批处理**:处理离线全量数据(如 HDFS 文件),高准确性但高延迟。 这会导致以下问题: - **两套代码逻辑**:同样的业务逻辑需要分别实现流和批的代码,维护成本高。 - **数据口径不一致**:流和批计算结果可能因处理逻辑差异导致数据不一致(如统计指标对不上)。 - **资源浪费**:需维护两套计算集群(如 Flink 集群和 Spark 集群)。 **流批一体的核心目标**: 通过统一的数据模型、API、计算引擎和存储层,实现“**一套代码处理所有数据**”(流和批)。 --- ### **2. 流批一体的实现原理** #### **(1)统一数据模型** 将流数据和批数据抽象为同一数据模型: - **流数据**:视为“**无界数据流**”(Unbounded Stream),本质是持续追加的数据。 - **批数据**:视为“**有界数据流**”(Bounded Stream),即数据流的一个有限快照。 例如,Apache Flink 的 `DataStream` 和 `DataSet` API 最终统一为 `Table API`,将批数据视为流数据的特例。 #### **(2)统一计算引擎** 引擎需同时支持流和批的计算语义(如窗口、聚合、Join): - **流式处理**:基于事件时间(Event Time)和处理时间(Processing Time)的增量计算。 - **批处理**:基于全量数据的离线计算,但使用与流处理相同的算子(如 Map、Reduce)。 **代表性引擎**:Apache Flink、Apache Spark Structured Streaming。 #### **(3)统一存储层** 存储层需同时支持实时写入和批量读取,通常采用**湖仓一体(Lakehouse)**架构: - **存储格式**:列存+日志(如 Delta Lake、Apache Iceberg、Hudi),支持 ACID 事务和增量更新。 - **数据可见性**:实时数据写入后,流和批任务能立即读取(无需等待离线分区生成)。 --- ### **3. 流批一体的实现方案** #### **方案一:统一 API 层(如 Flink Table API / SQL)** ```sql -- 同一段 SQL 既可处理实时流,也可处理离线批数据 SELECT user_id, COUNT(order_id) AS order_count FROM orders -- 流模式:从 Kafka 读取实时订单流 -- 批模式:从 HDFS 读取历史订单表 WHERE event_time BETWEEN '2023-01-01' AND '2023-01-02' GROUP BY user_id; ``` **实现步骤**: 1. 将数据源统一抽象为表(如 Kafka 流表、HDFS 批表)。 2. 使用 SQL 或 Table API 定义业务逻辑。 3. 根据数据源类型自动切换流/批执行模式。 --- #### **方案二:批处理作为流处理的特例** 以流处理引擎为基础,将批数据视为“有限流”: - **Flink 的 Batch 模式**:将批作业作为流作业的一种特殊形式运行。 - **Spark Structured Streaming**:通过 `readStream` 和 `read` 接口统一数据读取。 ```python # 使用 PyFlink 处理流和批(同一逻辑) from pyflink.table import StreamTableEnvironment t_env = StreamTableEnvironment.create(env) # 创建表(流表或批表) t_env.execute_sql(""" CREATE TABLE orders ( user_id STRING, order_id STRING, event_time TIMESTAMP(3) ) WITH ( 'connector' = 'kafka', -- 流模式 'topic' = 'orders_topic', 'format' = 'json' ) """) # 执行聚合(流和批通用) result = t_env.sql_query(""" SELECT user_id, COUNT(order_id) FROM orders GROUP BY user_id """) ``` --- #### **方案三:统一存储层(Delta Lake / Iceberg)** 通过存储层实现流批统一读写: 1. **实时写入**:流任务将数据实时写入 Delta Lake/Iceberg 表。 2. **批量读取**:批任务直接读取同一张表的历史全量数据。 3. **增量读取**:批任务也可读取某个时间范围内的增量数据(类似流处理)。 ```scala // 流任务:实时写入 Delta Lake val stream = spark.readStream .format("kafka") .load("kafka://orders_topic") stream.writeStream .format("delta") .outputMode("append") .option("checkpointLocation", "/checkpoint") .start("delta_lake_path") // 批任务:读取同一 Delta Lake 表(全量或增量) val batch = spark.read .format("delta") .load("delta_lake_path") .where("event_time >= '2023-01-01'") // 增量读取 ``` --- ### **4. 关键技术点** 1. **时间语义统一** - 流和批需统一使用**事件时间(Event Time)**,避免处理时间和事件时间混用导致结果不一致。 - 支持水位线(Watermark)机制,处理乱序数据。 2. **状态管理** - 流处理需要状态存储(如窗口状态),批处理需支持全量状态访问(如全局排序)。 3. **资源调度** - 流处理需长期占用资源,批处理需按需分配资源,引擎需动态调整(如 Flink 的 Reactive Mode)。 4. **数据回填(Backfill)** - 当历史数据修正时,流批一体系统需支持重新处理历史数据(如通过版本控制)。 --- ### **5. 流批一体的应用场景** 1. **实时 & 离线指标统一** - 例如:同一段代码生成实时大屏(流)和离线报表(批),确保指标口径一致。 2. **机器学习特征工程** - 实时特征(流)和历史特征(批)使用同一逻辑生成,避免线上线下特征不一致。 3. **数据湖增量 ETL** - 流任务实时写入数据湖,批任务直接读取数据湖中的全量数据,无需维护两套 ETL 流程。 4. **数据修正与回滚** - 当发现历史数据错误时,通过流批一体系统重新处理数据(批模式),并实时生效(流模式)。 --- ### **6. 流批一体 vs Lambda vs Kappa 架构** | **架构** | **核心思想** | **缺点** | |----------------|------------------------------------------|---------------------------------------| | **Lambda** | 流、批两套系统并行,结果合并 | 维护成本高,数据一致性难保障 | | **Kappa** | 全量数据通过流处理回放完成 | 历史数据重放资源消耗大 | | **流批一体** | 流和批统一引擎、API 和存储 | 技术复杂度高,依赖引擎和存储的成熟度 | --- ### **7. 实现工具** - **计算引擎**:Apache Flink、Apache Spark(Structured Streaming)、RisingWave。 - **存储层**:Delta Lake、Apache Iceberg、Apache Hudi。 - **消息队列**:Apache Kafka、Pulsar(支持流式存储)。 --- ### **8. 总结** - **流批一体的本质**:通过统一的数据抽象、计算引擎和存储层,实现流和批处理逻辑的融合。 - **核心价值**:降低架构复杂度、减少维护成本、保障数据一致性。 - **落地关键**:选择支持流批统一的引擎(如 Flink)和存储格式(如 Iceberg),并通过统一 API 层屏蔽底层差异。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值