Flink DataStream Connectors 数据流格式

京河小蚁

已于 2022-06-19 08:22:51 修改

阅读量1.2k

点赞数 1

分类专栏： flink 文章标签： flink

于 2022-06-19 08:21:53 首次发布

本文链接：https://blog.csdn.net/u010772882/article/details/125352568

版权

flink 专栏收录该内容

86 篇文章 33 订阅

订阅专栏

文章目录

数据流格式
- 可用的格式
Avro
Azure Table Storage
CSV
- 高级配置
Hadoop
- Using Hadoop InputFormats
- Using Hadoop OutputFormats
Parquet
Text files format

数据流格式

可用的格式

Avro
Azure Table
Hadoop
Parquet
Text files

Avro

Flink 内置支持 Apache Avro 格式。在 Flink 中将更容易地读写基于 Avro schema 的 Avro 数据。 Flink 的序列化框架可以处理基于 Avro schemas 生成的类。为了能够使用 Avro format，需要在自动构建工具（例如 Maven 或 SBT）中添加如下依赖到项目中。

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-avro</artifactId>
  <version>1.15.0</version>
</dependency>

如果读取 Avro 文件数据，你必须指定 AvroInputFormat。

示例：

AvroInputFormat<User> users = new AvroInputFormat<User>(in, User.class);
DataStream<User> usersDS = env.createInput(users);

注意，User 是一个通过 Avro schema生成的 POJO 类。Flink 还允许选择 POJO 中字符串类型的键。例如：

usersDS.keyBy("name");

注意，在 Flink 中可以使用 GenericData.Record 类型，但是不推荐使用。因为该类型的记录中包含了完整的 schema，导致数据非常密集，使用起来可能很慢。

Flink 的 POJO 字段选择也适用于从 Avro schema 生成的 POJO 类。但是，只有将字段类型正确写入生成的类时，才能使用。如果字段是 Object 类型，则不能将该字段用作 join 键或 grouping 键。在 Avro 中如 {“name”: “type_double_test”, “type”: “double”}, 这样指定字段是可行的，但是如 ({“name”: “type_double_test”, “type”: [“double”]},) 这样指定包含一个字段的复合类型就会生成 Object 类型的字段。注意，如 ({“name”: “type_double_test”, “type”: [“null”, “double”]},) 这样指定 nullable 类型字段也是可能产生 Object 类型的!

Azure Table Storage

本例使用 HadoopInputFormat 包装器来使用现有的 Hadoop input format 实现访问 Azure’s Table Storage.

下载并编译 azure-tables-hadoop 项目。该项目开发的 input format 在 Maven 中心尚不存在，因此，我们必须自己构建该项目。执行如下命令：

git clone https://github.com/mooso/azure-tables-hadoop.git
cd azure-tables-hadoop
mvn clean install

使用 quickstarts 创建一个新的 Flink 项目：

curl https://flink.apache.org/q/quickstart.sh | bash

在你的 pom.xml 文件部分添加如下依赖：

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-hadoop-compatibility_2.12</artifactId>
    <version>1.15.0</version>
</dependency>
<dependency>
    <groupId>com.microsoft.hadoop</groupId>
    <artifactId>microsoft-hadoop-azure</artifactId>
    <version>0.0.5</version>
</dependency>

flink-hadoop-compatibility 是一个提供 Hadoop input format 包装器的 Flink 包。 microsoft-hadoop-azure 可以将之前构建的部分添加到项目中。

现在可以开始进行项目的编码。我们建议将项目导入 IDE，例如 IntelliJ。你应该将其作为 Maven 项目导入。跳转到文件 Job.java。这是 Flink 作业的初始框架。

粘贴如下代码：

import java.util.Map;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.DataStream;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.hadoopcompatibility.mapreduce.HadoopInputFormat;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import com.microsoft.hadoop.azure.AzureTableConfiguration;
import com.microsoft.hadoop.azure.AzureTableInputFormat;
import com.microsoft.hadoop.azure.WritableEntity;
import com.microsoft.windowsazure.storage.table.EntityProperty;

public class AzureTableExample {

  public static void main(String[] args) throws Exception {
    // 安装 execution environment
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    env.setRuntimeMode(RuntimeExecutionMode.BATCH);
    // 使用 Hadoop input format 包装器创建 AzureTableInputFormat
    HadoopInputFormat<Text, WritableEntity> hdIf = new HadoopInputFormat<Text, WritableEntity>(new AzureTableInputFormat(), Text.class, WritableEntity.class, new Job());

    // 设置 Account URI，如 https://apacheflink.table.core.windows.net
    hdIf.getConfiguration().set(azuretableconfiguration.Keys.ACCOUNT_URI.getKey(), "TODO");
    // 设置存储密钥
    hdIf.getConfiguration().set(AzureTableConfiguration.Keys.STORAGE_KEY.getKey(), "TODO");
    // 在此处设置表名
    hdIf.getConfiguration().set(AzureTableConfiguration.Keys.TABLE_NAME.getKey(), "TODO");

    DataStream<Tuple2<Text, WritableEntity>> input = env.createInput(hdIf);
    // 如何在 map 中使用数据的简单示例。
    DataStream<String> fin = input.map(new MapFunction<Tuple2<Text,WritableEntity>, String>() {
      @Override
      public String map(Tuple2<Text, WritableEntity> arg0) throws Exception {
        System.err.println("--------------------------------\nKey = "+arg0.f0);
        WritableEntity we = arg0.f1;

        for(Map.Entry<String, EntityProperty> prop : we.getProperties().entrySet()) {
          System.err.println("key="+prop.getKey() + " ; value (asString)="+prop.getValue().getValueAsString());
        }

        return arg0.f0.toString();
      }
    });

    // 发送结果（这仅在本地模式有效）
    fin.print();

    // 执行程序
    env.execute("Azure Example");
  }
}

该示例展示了如何访问 Azure 表和如何将数据转换为 Flink 的 DataStream（更具体地说，集合的类型是 DataStream<Tuple2<Text, WritableEntity>>）。你可以将所有已知的 transformations 应用到 DataStream 实例。

CSV

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-csv</artifactId>
	<version>1.15.0</version>
</dependency>

Flink 支持使用CsvReaderFormat. 读者利用 Jackson 库并允许传递 CSV 模式和解析选项的相应配置。

CsvReaderFormat可以像这样初始化和使用：

CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
FileSource<SomePojo> source = 
        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();

在这种情况下，用于 CSV 解析的模式是根据SomePojo使用该Jackson库的类的字段自动派生的。

注意：您可能需要在@JsonPropertyOrder({field1, field2, …})类定义中添加注释，字段顺序与 CSV 文件列的顺序完全匹配。

高级配置

如果您需要对 CSV 架构或解析选项进行更细粒度的控制，请使用更底层的forSchema静态工厂方法CsvReaderFormat：

CsvReaderFormat<T> forSchema(CsvMapper mapper, 
                             CsvSchema schema, 
                             TypeInformation<T> typeInformation)

下面是一个使用自定义列分隔符读取 POJO 的示例：

//Has to match the exact order of columns in the CSV file
@JsonPropertyOrder({"city","lat","lng","country","iso2",
                    "adminName","capital","population"})
    public static class CityPojo {
    public String city;
    public BigDecimal lat;
    public BigDecimal lng;
    public String country;
    public String iso2;
    public String adminName;
    public String capital;
    public long population;
}

CsvMapper mapper = new CsvMapper();
CsvSchema schema =
        mapper.schemaFor(CityPojo.class).withoutQuoteChar().withColumnSeparator('|');

CsvReaderFormat<CityPojo> csvFormat =
        CsvReaderFormat.forSchema(mapper, schema, TypeInformation.of(CityPojo.class));

FileSource<CityPojo> source =
        FileSource.forRecordStreamFormat(csvFormat,Path.fromLocalFile(...)).build();

对应的 CSV 文件：

Berlin|52.5167|13.3833|Germany|DE|Berlin|primary|3644826
San Francisco|37.7562|-122.443|United States|US|California||3592294
Beijing|39.905|116.3914|China|CN|Beijing|primary|19433000

还可以使用细粒度Jackson设置读取更复杂的数据类型：

public static class ComplexPojo {
    private long id;
    private int[] array;
}

CsvReaderFormat<ComplexPojo> csvFormat =
        CsvReaderFormat.forSchema(
                new CsvMapper(),
                CsvSchema.builder()
                        .addColumn(
                                new CsvSchema.Column(0, "id", CsvSchema.ColumnType.NUMBER))
                        .addColumn(
                                new CsvSchema.Column(4, "array", CsvSchema.ColumnType.ARRAY)
                                        .withArrayElementSeparator("#"))
                        .build(),
                TypeInformation.of(ComplexPojo.class));

对应的 CSV 文件：

0,1#2#3
1,
2,1

与 TextLineInputFormat类似，CsvReaderFormat可用于continues and batch 模式（参见TextLineInputFormat 示例）。

Hadoop

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-hadoop-compatibility_2.12</artifactId>
	<version>1.15.0</version>
</dependency>

如果你想在本地运行你的 Flink 应用（例如在 IDE 中），你需要按照如下所示将 hadoop-client 依赖也添加到 pom.xml：

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.8.5</version>
    <scope>provided</scope>
</dependency>

Using Hadoop InputFormats

在 Flink 中使用 Hadoop InputFormats，必须首先使用 HadoopInputs 工具类的 readHadoopFile 或 createHadoopInput 包装 Input Format。前者用于从 FileInputFormat 派生的 Input Format，而后者必须用于通用的 Input Format。生成的 InputFormat 可通过使用 ExecutionEnvironmen#createInput 创建数据源。

生成的 DataStream 包含 2 元组，其中第一个字段是键，第二个字段是从 Hadoop InputFormat 接收的值。

下面的示例展示了如何使用 Hadoop 的 TextInputFormat。

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Tuple2<LongWritable, Text>> input =
    env.createInput(HadoopInputs.readHadoopFile(new TextInputFormat(),
                        LongWritable.class, Text.class, textPath));

// 对数据进行一些处理。
[...]

Using Hadoop OutputFormats

Flink 为 Hadoop OutputFormats 提供了一个兼容性包装器。支持任何实现 org.apache.hadoop.mapred.OutputFormat 或扩展 org.apache.hadoop.mapreduce.OutputFormat 的类。 OutputFormat 包装器期望其输入数据是包含键和值的 2-元组的 DataSet。这些将由 Hadoop OutputFormat 处理。

下面的示例展示了如何使用 Hadoop 的 TextOutputFormat。

// 获取我们希望发送的结果
DataStream<Tuple2<Text, IntWritable>> hadoopResult = [...];

// 设置 the Hadoop TextOutputFormat。
HadoopOutputFormat<Text, IntWritable> hadoopOF =
  // 创建 Flink wrapper.
  new HadoopOutputFormat<Text, IntWritable>(
    // 设置 Hadoop OutputFormat 并指定 job。
    new TextOutputFormat<Text, IntWritable>(), job
  );
hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(job, new Path(outputPath));

// 使用 Hadoop TextOutputFormat 发送数据。
hadoopResult.output(hadoopOF);

Parquet

Flink 支持读取 Parquet 文件并生成 Flink RowData 和 Avro 记录。要使用 Parquet format，你需要将 flink-parquet 依赖添加到项目中：

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-parquet</artifactId>
    <version>1.15.0</version>
</dependency>

要使用 Avro 格式，你需要将 parquet-avro 依赖添加到项目中：

<dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-avro</artifactId>
    <version>1.12.2</version>
    <optional>true</optional>
    <exclusions>
        <exclusion>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
        </exclusion>
        <exclusion>
            <groupId>it.unimi.dsi</groupId>
            <artifactId>fastutil</artifactId>
        </exclusion>
    </exclusions>
</dependency>

此格式与新的 Source 兼容，可以同时在批和流模式下使用。因此，你可使用此格式处理以下两类数据：

有界数据: 列出所有文件并全部读取。
无界数据：监控目录中出现的新文件

当你开启一个 File Source，会被默认为有界读取。如果你想在连续读取模式下使用 File Source，你必须额外调用 AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)。

Vectorized reader

// Parquet rows are decoded in batches
FileSource.forBulkFileFormat(BulkFormat,Path...)
// Monitor the Paths to read data as unbounded data
FileSource.forBulkFileFormat(BulkFormat,Path...)
.monitorContinuously(Duration.ofMillis(5L))
.build();

Avro Parquet reader

// Parquet rows are decoded in batches
FileSource.forRecordStreamFormat(StreamFormat,Path...)
// Monitor the Paths to read data as unbounded data
FileSource.forRecordStreamFormat(StreamFormat,Path...)
        .monitorContinuously(Duration.ofMillis(5L))
        .build();

下面的案例都是基于有界数据的。如果你想在连续读取模式下使用 File Source，你必须额外调用 AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)。

Flink RowData

在此示例中，你将创建由 Parquet 格式的记录构成的 Flink RowDatas DataStream。我们把 schema 信息映射为只读字段（“f7”、“f4” 和 “f99”）。每个批次读取 500 条记录。其中，第一个布尔类型的参数用来指定是否需要将时间戳列处理为 UTC。第二个布尔类型参数用来指定在进行 Parquet 字段映射时，是否要区分大小写。这里不需要水印策略，因为记录中不包含事件时间戳。

final LogicalType[] fieldTypes =
        new LogicalType[] {
        new DoubleType(), new IntType(), new VarCharType()
        };

final ParquetColumnarRowInputFormat<FileSourceSplit> format =
        new ParquetColumnarRowInputFormat<>(
        new Configuration(),
        RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
        500,
        false,
        true);
final FileSource<RowData> source =
        FileSource.forBulkFileFormat(format,  /* Flink Path */)
        .build();
final DataStream<RowData> stream =
        env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

Avro Records

Flink 支持三种方式来读取 Parquet 文件并创建 Avro records ：

Generic record
Specific record
Reflect record

Generic record

使用 JSON 定义 Avro schemas。你可以从 Avro specification 获取更多关于 Avro schemas 和类型的信息。此示例使用了一个在 official Avro tutorial 中描述的示例相似的 Avro schema：

{"namespace": "example.avro",
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "favoriteNumber",  "type": ["int", "null"]},
    {"name": "favoriteColor", "type": ["string", "null"]}
  ]
}

这个 schema 定义了一个具有三个属性的的 user 记录：name，favoriteNumber 和 favoriteColor。你可以在 record specification 找到更多关于如何定义 Avro schema 的详细信息。

在此示例中，你将创建包含由 Avro Generic records 格式构成的 Parquet records 的 DataStream。 Flink 会基于 JSON 字符串解析 Avro schema。也有很多其他的方式解析 schema，例如基于 java.io.File 或 java.io.InputStream。请参考 Avro Schema 以获取更多详细信息。然后，你可以通过 AvroParquetReaders 为 Avro Generic 记录创建 AvroParquetRecordFormat。

// 解析 avro schema
final Schema schema =
        new Schema.Parser()
        .parse(
        "{\"type\": \"record\", "
        + "\"name\": \"User\", "
        + "\"fields\": [\n"
        + "        {\"name\": \"name\", \"type\": \"string\" },\n"
        + "        {\"name\": \"favoriteNumber\",  \"type\": [\"int\", \"null\"] },\n"
        + "        {\"name\": \"favoriteColor\", \"type\": [\"string\", \"null\"] }\n"
        + "    ]\n"
        + "    }");

final FileSource<GenericRecord> source =
        FileSource.forRecordStreamFormat(
        AvroParquetReaders.forGenericRecord(schema), /* Flink Path */)
        .build();

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(10L);

final DataStream<GenericRecord> stream =
        env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

Specific record

基于之前定义的 schema，你可以通过利用 Avro 代码生成来生成类。一旦生成了类，就不需要在程序中直接使用 schema。你可以使用 avro-tools.jar 手动生成代码，也可以直接使用 Avro Maven 插件对配置的源目录中的任何 .avsc 文件执行代码生成。请参考 Avro Getting Started 获取更多信息。

[
  {"namespace": "org.apache.flink.formats.parquet.generated",
    "type": "record",
    "name": "Address",
    "fields": [
      {"name": "num", "type": "int"},
      {"name": "street", "type": "string"},
      {"name": "city", "type": "string"},
      {"name": "state", "type": "string"},
      {"name": "zip", "type": "string"}
    ]
  }
]

你可以使用 Avro Maven plugin 生成 Address Java 类。

@org.apache.avro.specific.AvroGenerated
public class Address extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
    // 生成的代码...
}

你可以通过 AvroParquetReaders 为 Avro Specific 记录创建 AvroParquetRecordFormat，然后创建一个包含由 Avro Specific records 格式构成的 Parquet records 的 DateStream。

final FileSource<GenericRecord> source =
        FileSource.forRecordStreamFormat(
                AvroParquetReaders.forSpecificRecord(Address.class), /* Flink Path */)
        .build();

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(10L);
        
final DataStream<GenericRecord> stream =
        env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

Reflect record

除了需要预定义 Avro Generic 和 Specific 记录， Flink 还支持基于现有 Java POJO 类从 Parquet 文件创建 DateStream。在这种场景中，Avro 会使用 Java 反射为这些 POJO 类生成 schema 和协议。请参考 Avro reflect 文档获取更多关于 Java 类型到 Avro schemas 映射的详细信息。

本例使用了一个简单的 Java POJO 类 Datum：

public class Datum implements Serializable {

    public String a;
    public int b;

    public Datum() {}

    public Datum(String a, int b) {
        this.a = a;
        this.b = b;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) {
            return true;
        }
        if (o == null || getClass() != o.getClass()) {
            return false;
        }

        Datum datum = (Datum) o;
        return b == datum.b && (a != null ? a.equals(datum.a) : datum.a == null);
    }

    @Override
    public int hashCode() {
        int result = a != null ? a.hashCode() : 0;
        result = 31 * result + b;
        return result;
    }
}

你可以通过 AvroParquetReaders 为 Avro Reflect 记录创建一个 AvroParquetRecordFormat，然后创建一个包含由 Avro Reflect records 格式构成的 Parquet records 的 DateStream。

final FileSource<GenericRecord> source =
        FileSource.forRecordStreamFormat(
                AvroParquetReaders.forReflectRecord(Datum.class), /* Flink Path */)
        .build();

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(10L);
        
final DataStream<GenericRecord> stream =
        env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

使用 Parquet files 必备条件

为了支持读取 Avro Reflect 数据，Parquet 文件必须包含特定的 meta 信息。为了生成 Parquet 数据，Avro schema 信息中必须包含 namespace，以便让程序在反射执行过程中能确定唯一的 Java Class 对象。

下面的案例展示了上文中的 User 对象的 schema 信息。但是当前案例包含了一个指定文件目录的 namespace（当前案例下的包路径），反射过程中可以找到对应的 User 类。

// avro schema with namespace
final String schema = 
                    "{\"type\": \"record\", "
                        + "\"name\": \"User\", "
                        + "\"namespace\": \"org.apache.flink.formats.parquet.avro\", "
                        + "\"fields\": [\n"
                        + "        {\"name\": \"name\", \"type\": \"string\" },\n"
                        + "        {\"name\": \"favoriteNumber\",  \"type\": [\"int\", \"null\"] },\n"
                        + "        {\"name\": \"favoriteColor\", \"type\": [\"string\", \"null\"] }\n"
                        + "    ]\n"
                        + "    }";

由上述 scheme 信息创建的 Parquet 文件包含以下 meta 信息：

creator:        parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
extra:          parquet.avro.schema =
{"type":"record","name":"User","namespace":"org.apache.flink.formats.parquet.avro","fields":[{"name":"name","type":"string"},{"name":"favoriteNumber","type":["int","null"]},{"name":"favoriteColor","type":["string","null"]}]}
extra:          writer.model.name = avro

file schema:    org.apache.flink.formats.parquet.avro.User
--------------------------------------------------------------------------------
name:           REQUIRED BINARY L:STRING R:0 D:0
favoriteNumber: OPTIONAL INT32 R:0 D:1
favoriteColor:  OPTIONAL BINARY L:STRING R:0 D:1

row group 1:    RC:3 TS:143 OFFSET:4
--------------------------------------------------------------------------------
name:            BINARY UNCOMPRESSED DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: Jack, max: Tom, num_nulls: 0]
favoriteNumber:  INT32 UNCOMPRESSED DO:0 FPO:51 SZ:41/41/1.00 VC:3 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 1, max: 3, num_nulls: 0]
favoriteColor:   BINARY UNCOMPRESSED DO:0 FPO:92 SZ:55/55/1.00 VC:3 ENC:RLE,PLAIN,BIT_PACKED ST:[min: green, max: yellow, num_nulls: 0]

使用包 org.apache.flink.formats.parquet.avro 路径下已定义的 User 类：

public class User {
    private String name;
    private Integer favoriteNumber;
    private String favoriteColor;

    public User() {}

    public User(String name, Integer favoriteNumber, String favoriteColor) {
        this.name = name;
        this.favoriteNumber = favoriteNumber;
        this.favoriteColor = favoriteColor;
    }

    public String getName() {
        return name;
    }

    public Integer getFavoriteNumber() {
        return favoriteNumber;
    }

    public String getFavoriteColor() {
        return favoriteColor;
    }
}

你可以通过下面的程序读取类型为 User 的 Avro Reflect records：

final FileSource<GenericRecord> source =
        FileSource.forRecordStreamFormat(
        AvroParquetReaders.forReflectRecord(User.class), /* Flink Path */)
        .build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(10L);

final DataStream<GenericRecord> stream =
        env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

Text files format

Flink 支持使用 TextLineInputFormat 从文件中读取文本行。此 format 使用 Java 的内置 InputStreamReader 以支持的字符集编码来解码字节流。要使用该 format，你需要将 Flink Connector Files 依赖项添加到项目中：

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>flink-connector-files</artifactId>
	<version>1.15.0</version>
</dependency>

此 format 与新 Source 兼容，可以在批处理和流模式下使用。因此，你可以通过两种方式使用此 format：

批处理模式的有界读取
流模式的连续读取：监视目录中出现的新文件

有界读取示例:

在此示例中，我们创建了一个 DataStream，其中包含作为字符串的文本文件的行。此处不需要水印策略，因为记录不包含事件时间戳。

final FileSource<String> source =
  FileSource.forRecordStreamFormat(new TextLineInputFormat(), /* Flink Path */)
  .build();
final DataStream<String> stream =
  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

连续读取示例: 在此示例中，我们创建了一个 DataStream，随着新文件被添加到目录中，其中包含的文本文件行的字符串流将无限增长。我们每秒会进行新文件监控。此处不需要水印策略，因为记录不包含事件时间戳。

final FileSource<String> source =
    FileSource.forRecordStreamFormat(new TextLineInputFormat(), /* Flink Path */)
  .monitorContinuously(Duration.ofSeconds(1L))
  .build();
final DataStream<String> stream =
  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");

京河小蚁

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Flink DataStream Connectors 数据流格式

Flink 内置支持 Apache Avro 格式。在 Flink 中将更容易地读写基于 Avro schema 的 Avro 数据。 Flink 的序列化框架可以处理基于 Avro schemas 生成的类。为了能够使用 Avro format，需要在自动构建工具（例如 Maven 或 SBT）中添加如下依赖到项目中。如果读取 Avro 文件数据，你必须指定 AvroInputFormat。示例：注意，User 是一个通过 Avro schema生成的 POJO 类。Flink 还允许选择 POJO
复制链接

扫一扫