Flink接收kafka数据根据event time存储到相应目录文件并以parquet文件格式存储到HDFS

需求描述

消费kafka的消息,根据数据的时间时间,将数据分小时的存入到HDFS中(比如2020-05-28 00:09:12的数据存储到${root_path}/20200528/00)的文件夹中。如果利用textfile进行存储,将消耗大量的存储空间,并且查询也较慢,需要压缩成为parquet格式,减少空间,加快后序的处理速度。

数据格式:事件时间戳\t日志数据

系统环境

kafka:0.10
HDFS:2.4
flink: 1.8.1

概述

基于BucketingSink在2.7版本以下的HDFS上实现

flink的sink到HDFS有两种框架StreamFileSink和BucketingSink。利用StreamFileSink实现的文章已经很多了,如果你的HDFS版本在2.7以上,可以按照他们的代码来实,不用再往下看了。如果没有,那就跟着我一步一步的去实现,不要考虑StreamFileSink,否则在运行的时候,你注定会看到如下的报错信息

2020-05-21 11:08:27,639 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Custom Source -> Filter -> Sink: Unnamed (2/10) (61155cec36f558ad35136b726ca67fee) switched from RUNNING to FAILED.
java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer
	at org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.<init>(HadoopRecoverableWriter.java:57)
	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202)
	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69)
	at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.<init>(Buckets.java:112)
	at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:242)
	at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:327)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
	at java.lang.Thread.run(Thread.java:748)

实现

  • 重写org.apache.flink.streaming.connectors.fs.Writer来实现利用AvroParquetWriter写入parquet文件
  • 重写org.apache.flink.streaming.connectors.fs.bucketing.BasePathBucketer来实现根据数据事件时间分小时存储数据

将数据转为GenericData.Record

因为AvroParquetWriter是通过操作org.apache.avro.generic包中的GenericData.Record来实现的,所以在BucketingSink的泛型最好就是重写之前最好就是GenericData.Record,这样能减少对象之间的相互转换。那么,在flink的operator阶段的最后,就需要将数据转化为GenericData.Record对象。

    public static String schema = "{\"name\" :\"UserGroup\"" +
            ",\"type\" :\"record\"" +
            ",\"fields\" :" +
            "     [ {\"name\" :\"data\",\"type\" :\"string\" }] " +
            "} ";
    /*
    *将数据转换为GenericData.Record对象
    */
    public static GenericData.Record transformData(String msg) {
        GenericData.Record record = new GenericData.Record(Schema.parse(schema));
        record.put("data", msg);
        return record;
    }

重写Writer实现写入parquet

import org.apache.flink.streaming.connectors.fs.Writer;
import org.apache.flink.util.Preconditions;

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetFileWriter;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;

import java.io.IOException;

/**
 * Parquet writer.
 *
 * @param <T>
 */
public class ParquetSinkWriter<T extends GenericRecord> implements Writer<T> {

    private static final long serialVersionUID = -97530255651203388L;

    private final CompressionCodecName compressionCodecName = CompressionCodecName.SNAPPY;
    private final int pageSize = 64 * 1024;

    private final String schemaRepresentation;

    private transient Schema schema;
    private transient ParquetWriter<T> writer;
    private transient Path path;

    private int position;

    public ParquetSinkWriter(String schemaRepresentation) {
        this.schemaRepresentation = Preconditions.checkNotNull(schemaRepresentation);
    }

    @Override
    public void open(FileSystem fs, Path path) throws IOException {
        this.position = 0;
        this.path = path;

        if (writer != null) {
            writer.close();
        }

        writer = createWriter();
    }

    @Override
    public long flush() throws IOException {
        Preconditions.checkNotNull(writer);
        position += writer.getDataSize();
        writer.close();
        writer = createWriter();

        return position;
    }

    @Override
    public long getPos() throws IOException {
        Preconditions.checkNotNull(writer);
        return position + writer.getDataSize();
    }

    @Override
    public void close() throws IOException {
        if (writer != null) {
            writer.close();
            writer = null;
        }
    }

    @Override
    public void write(T element) throws IOException {
        Preconditions.checkNotNull(writer);
        writer.write(element);
    }

    @Override
    public Writer<T> duplicate() {
        return new ParquetSinkWriter<>(schemaRepresentation);
    }

    private ParquetWriter<T> createWriter() throws IOException {
        if (schema == null) {
            schema = new Schema.Parser().parse(schemaRepresentation);
        }

        return AvroParquetWriter.<T>builder(path)
                .withSchema(schema)
                .withDataModel(GenericData.get())
                .withCompressionCodec(compressionCodecName)
                .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
                .withPageSize(pageSize)
                .build();
    }
}

重写BasePathBucketer实现分小时存储

import org.apache.avro.generic.GenericData;
import org.apache.flink.streaming.connectors.fs.Clock;
import org.apache.flink.streaming.connectors.fs.bucketing.BasePathBucketer;
import org.apache.hadoop.fs.Path;

import java.io.File;
import java.text.SimpleDateFormat;
import java.util.Date;

public class DayHourPathBucketer extends BasePathBucketer<GenericData.Record> {
    SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd"+File.separator+"HH");
    @Override
    public Path getBucketPath(Clock clock, Path basePath, GenericData.Record element) {
        String data = (String) element.get("data");
        long event_time = Long.parseLong(data.split("\t")[0]);
        String day_hour = sdf.format(new Date(event_time));
        return  new Path(basePath+File.separator+day_hour);
    }
}

主函数

import com.utils.DayHourPathBucketer;
import com.utils.ParquetSinkWriter;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;

import java.util.Properties;

public class KafkaSink2HDFSasParquet {

    public static String brockers = "localhost:9092";
    
    public static String baseDir = "hdfs://locakhost:9000/flink";
    
    public static String topic = "mytopic";

   
    public static String schema = "{\"name\" :\"UserGroup\"" +
            ",\"type\" :\"record\"" +
            ",\"fields\" :" +
            "     [ {\"name\" :\"data\",\"type\" :\"string\" }] " +
            "} ";

    public static GenericData.Record transformData(String msg) {
        GenericData.Record record = new GenericData.Record(Schema.parse(schema));
        record.put("data", msg);
        return record;
    }

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(20 * 60 * 1000);//20分钟
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);//至少一次就可以了,业务逻辑不需要EXACTLY_ONCE
        Properties props = new Properties();
        props.setProperty("bootstrap.servers", brockers);
        props.setProperty("group.id", "flink.xxx.xxx");


        FlinkKafkaConsumer010<String> consumer = new FlinkKafkaConsumer010<String>(topic,
                new SimpleStringSchema(), props);
        consumer.setStartFromLatest();
        DataStreamSource<String> stream = env.addSource(consumer);

        SingleOutputStreamOperator<GenericData.Record> output = stream.map(KafkaSink2HDFSasParquet::transformData);


        BucketingSink<GenericData.Record> sink = new BucketingSink<GenericData.Record>(baseDir);
        sink.setBucketer(new DayHourPathBucketer());
        sink.setBatchSize(1024 * 1024 * 400); //400M
        ParquetSinkWriter writer = new ParquetSinkWriter<GenericData.Record>(schema);
        sink.setWriter(writer);


        output.addSink(sink);
        env.execute("KafkaSink2HDFSasParquet");
    }
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>KafkaSink2HDFSasParquet</artifactId>
    <version>1.0</version>
    <properties>
        <flink.version>1.8.1</flink.version>
        <kafka.version>0.10.2.1</kafka.version>
        <flink-extends.version>1.0.2</flink-extends.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>



    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <!--scala-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.11</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <!--Adding Connector and Library Dependencies-->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.10_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-filesystem_2.11</artifactId>
            <version>1.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-avro</artifactId>
            <version>1.8.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.parquet</groupId>
            <artifactId>parquet-hadoop</artifactId>
            <version>1.8.1</version>
        </dependency>

    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.0.0</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <excludes>
                                    <exclude>com.google.code.findbugs:jsr305</exclude>
                                    <exclude>org.slf4j:*</exclude>
                                    <exclude>log4j:*</exclude>
                                </excludes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <!-- Do not copy the signatures in the META-INF folder.
                                    Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>my.prorgams.main.clazz</mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>


</project>

项目github地址

https://github.com/qiyongbo/KafkaSink2HDFSasParquet.git

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
以下是一个基于Flink消费Kafka数据并将其写入HDFS的示例: ```java import org.apache.flink.api.common.serialization.SimpleStringSchema; import org.apache.flink.core.fs.FileSystem; import org.apache.flink.formats.orc.OrcSplitReaderUtil; import org.apache.flink.formats.orc.OrcWriterFactory; import org.apache.flink.formats.orc.vector.StringColumnVector; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer; import org.apache.flink.streaming.util.serialization.JSONKeyValueDeserializationSchema; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.Table; import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; import org.apache.flink.table.descriptors.*; import org.apache.flink.types.Row; import java.util.Properties; public class FlinkKafkaHdfsOrcDemo { public static void main(String[] args) throws Exception { // set up the streaming execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // set parallelism to 1 for demo purposes // set up the Kafka consumer properties Properties kafkaProps = new Properties(); kafkaProps.setProperty("bootstrap.servers", "localhost:9092"); kafkaProps.setProperty("group.id", "flink-kafka-consumer-group"); // create a FlinkKafkaConsumer instance to consume Kafka data FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("my-topic", new SimpleStringSchema(), kafkaProps); // create a data stream from the Kafka source DataStream<String> kafkaStream = env.addSource(kafkaConsumer); // parse the JSON data and create a table from it EnvironmentSettings settings = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build(); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, settings); tableEnv.connect(new Kafka().version("universal").topic("my-topic").startFromEarliest().property("bootstrap.servers", "localhost:9092").property("group.id", "flink-kafka-consumer-group")) .withFormat(new Json().deriveSchema()) .withSchema(new Schema().field("name", DataTypes.STRING()).field("age", DataTypes.INT())) .createTemporaryTable("myTable"); Table myTable = tableEnv.from("myTable"); // create an OrcWriterFactory to write ORC data OrcWriterFactory<Row> orcWriterFactory = (OrcWriterFactory<Row>) OrcSplitReaderUtil.createRowOrcWriterFactory( new String[]{"name", "age"}, new OrcSplitReaderUtil.TypeDescription[]{ OrcSplitReaderUtil.TypeDescription.createString(), OrcSplitReaderUtil.TypeDescription.createInt() }); // create a FlinkKafkaProducer instance to write Kafka data FlinkKafkaProducer<Row> kafkaProducer = new FlinkKafkaProducer<>( "my-topic", new OrcRowSerializationSchema("/path/to/hdfs/file.orc", orcWriterFactory), kafkaProps, FlinkKafkaProducer.Semantic.EXACTLY_ONCE); // write the table data to HDFS in ORC format myTable.execute().output(kafkaProducer); // execute the job env.execute("Flink Kafka HDFS ORC Demo"); } // implementation of OrcRowSerializationSchema private static class OrcRowSerializationSchema implements FlinkKafkaProducer.SerializationSchema<Row> { private final String filePath; private final OrcWriterFactory<Row> orcWriterFactory; private transient OrcWriterFactory.Writer<Row> orcWriter; public OrcRowSerializationSchema(String filePath, OrcWriterFactory<Row> orcWriterFactory) { this.filePath = filePath; this.orcWriterFactory = orcWriterFactory; } @Override public byte[] serialize(Row row) { try { if (orcWriter == null) { orcWriter = orcWriterFactory.createWriter(filePath, FileSystem.getHadoopFileSystem(new org.apache.flink.core.fs.Path(filePath).toUri()), true); } StringColumnVector nameVector = new StringColumnVector(1); nameVector.vector[0] = row.getField(0).toString(); StringColumnVector ageVector = new StringColumnVector(1); ageVector.vector[0] = row.getField(1).toString(); orcWriter.addRow(nameVector, ageVector); return null; } catch (Exception e) { throw new RuntimeException(e); } } } } ``` 该示例使用Flink的Table API从Kafka消费数据,并将其写入HDFS中的ORC文件。示例代码使用`JsonKeyValueDeserializationSchema`解析JSON格式数据,并使用`OrcWriterFactory`将数据写入ORC文件。在示例中,`OrcWriterFactory`被配置为使用String和Int类型的列。还创建了一个`OrcRowSerializationSchema`类,它将Flink的`Row`类型转换为ORC文件中的列向量,并使用`OrcWriterFactory.Writer`将数据写入ORC文件。 注意:在实际使用中,应该根据实际需求修改示例代码,并根据需要添加适当的错误处理和容错机制。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值