Flink接收kafka数据根据event time存储到相应目录文件并以parquet文件格式存储到HDFS
需求描述
消费kafka的消息,根据数据的时间时间,将数据分小时的存入到HDFS中(比如2020-05-28 00:09:12的数据存储到${root_path}/20200528/00)的文件夹中。如果利用textfile进行存储,将消耗大量的存储空间,并且查询也较慢,需要压缩成为parquet格式,减少空间,加快后序的处理速度。
数据格式:事件时间戳\t日志数据
系统环境
kafka:0.10
HDFS:2.4
flink: 1.8.1
概述
基于BucketingSink在2.7版本以下的HDFS上实现
flink的sink到HDFS有两种框架StreamFileSink和BucketingSink。利用StreamFileSink实现的文章已经很多了,如果你的HDFS版本在2.7以上,可以按照他们的代码来实,不用再往下看了。如果没有,那就跟着我一步一步的去实现,不要考虑StreamFileSink,否则在运行的时候,你注定会看到如下的报错信息
2020-05-21 11:08:27,639 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Filter -> Sink: Unnamed (2/10) (61155cec36f558ad35136b726ca67fee) switched from RUNNING to FAILED.
java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer
at org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.<init>(HadoopRecoverableWriter.java:57)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202)
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69)
at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.<init>(Buckets.java:112)
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:242)
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:327)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:278)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
实现
- 重写org.apache.flink.streaming.connectors.fs.Writer来实现利用AvroParquetWriter写入parquet文件
- 重写org.apache.flink.streaming.connectors.fs.bucketing.BasePathBucketer来实现根据数据事件时间分小时存储数据
将数据转为GenericData.Record
因为AvroParquetWriter是通过操作org.apache.avro.generic包中的GenericData.Record来实现的,所以在BucketingSink的泛型最好就是重写之前最好就是GenericData.Record,这样能减少对象之间的相互转换。那么,在flink的operator阶段的最后,就需要将数据转化为GenericData.Record对象。
public static String schema = "{\"name\" :\"UserGroup\"" +
",\"type\" :\"record\"" +
",\"fields\" :" +
" [ {\"name\" :\"data\",\"type\" :\"string\" }] " +
"} ";
/*
*将数据转换为GenericData.Record对象
*/
public static GenericData.Record transformData(String msg) {
GenericData.Record record = new GenericData.Record(Schema.parse(schema));
record.put("data", msg);
return record;
}
重写Writer实现写入parquet
import org.apache.flink.streaming.connectors.fs.Writer;
import org.apache.flink.util.Preconditions;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetFileWriter;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import java.io.IOException;
/**
* Parquet writer.
*
* @param <T>
*/
public class ParquetSinkWriter<T extends GenericRecord> implements Writer<T> {
private static final long serialVersionUID = -97530255651203388L;
private final CompressionCodecName compressionCodecName = CompressionCodecName.SNAPPY;
private final int pageSize = 64 * 1024;
private final String schemaRepresentation;
private transient Schema schema;
private transient ParquetWriter<T> writer;
private transient Path path;
private int position;
public ParquetSinkWriter(String schemaRepresentation) {
this.schemaRepresentation = Preconditions.checkNotNull(schemaRepresentation);
}
@Override
public void open(FileSystem fs, Path path) throws IOException {
this.position = 0;
this.path = path;
if (writer != null) {
writer.close();
}
writer = createWriter();
}
@Override
public long flush() throws IOException {
Preconditions.checkNotNull(writer);
position += writer.getDataSize();
writer.close();
writer = createWriter();
return position;
}
@Override
public long getPos() throws IOException {
Preconditions.checkNotNull(writer);
return position + writer.getDataSize();
}
@Override
public void close() throws IOException {
if (writer != null) {
writer.close();
writer = null;
}
}
@Override
public void write(T element) throws IOException {
Preconditions.checkNotNull(writer);
writer.write(element);
}
@Override
public Writer<T> duplicate() {
return new ParquetSinkWriter<>(schemaRepresentation);
}
private ParquetWriter<T> createWriter() throws IOException {
if (schema == null) {
schema = new Schema.Parser().parse(schemaRepresentation);
}
return AvroParquetWriter.<T>builder(path)
.withSchema(schema)
.withDataModel(GenericData.get())
.withCompressionCodec(compressionCodecName)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withPageSize(pageSize)
.build();
}
}
重写BasePathBucketer实现分小时存储
import org.apache.avro.generic.GenericData;
import org.apache.flink.streaming.connectors.fs.Clock;
import org.apache.flink.streaming.connectors.fs.bucketing.BasePathBucketer;
import org.apache.hadoop.fs.Path;
import java.io.File;
import java.text.SimpleDateFormat;
import java.util.Date;
public class DayHourPathBucketer extends BasePathBucketer<GenericData.Record> {
SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd"+File.separator+"HH");
@Override
public Path getBucketPath(Clock clock, Path basePath, GenericData.Record element) {
String data = (String) element.get("data");
long event_time = Long.parseLong(data.split("\t")[0]);
String day_hour = sdf.format(new Date(event_time));
return new Path(basePath+File.separator+day_hour);
}
}
主函数
import com.utils.DayHourPathBucketer;
import com.utils.ParquetSinkWriter;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import java.util.Properties;
public class KafkaSink2HDFSasParquet {
public static String brockers = "localhost:9092";
public static String baseDir = "hdfs://locakhost:9000/flink";
public static String topic = "mytopic";
public static String schema = "{\"name\" :\"UserGroup\"" +
",\"type\" :\"record\"" +
",\"fields\" :" +
" [ {\"name\" :\"data\",\"type\" :\"string\" }] " +
"} ";
public static GenericData.Record transformData(String msg) {
GenericData.Record record = new GenericData.Record(Schema.parse(schema));
record.put("data", msg);
return record;
}
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(20 * 60 * 1000);//20分钟
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);//至少一次就可以了,业务逻辑不需要EXACTLY_ONCE
Properties props = new Properties();
props.setProperty("bootstrap.servers", brockers);
props.setProperty("group.id", "flink.xxx.xxx");
FlinkKafkaConsumer010<String> consumer = new FlinkKafkaConsumer010<String>(topic,
new SimpleStringSchema(), props);
consumer.setStartFromLatest();
DataStreamSource<String> stream = env.addSource(consumer);
SingleOutputStreamOperator<GenericData.Record> output = stream.map(KafkaSink2HDFSasParquet::transformData);
BucketingSink<GenericData.Record> sink = new BucketingSink<GenericData.Record>(baseDir);
sink.setBucketer(new DayHourPathBucketer());
sink.setBatchSize(1024 * 1024 * 400); //400M
ParquetSinkWriter writer = new ParquetSinkWriter<GenericData.Record>(schema);
sink.setWriter(writer);
output.addSink(sink);
env.execute("KafkaSink2HDFSasParquet");
}
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>KafkaSink2HDFSasParquet</artifactId>
<version>1.0</version>
<properties>
<flink.version>1.8.1</flink.version>
<kafka.version>0.10.2.1</kafka.version>
<flink-extends.version>1.0.2</flink-extends.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<!--scala-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<!--Adding Connector and Library Dependencies-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.10_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_2.11</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.8.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
<exclude>com.google.code.findbugs:jsr305</exclude>
<exclude>org.slf4j:*</exclude>
<exclude>log4j:*</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<!-- Do not copy the signatures in the META-INF folder.
Otherwise, this might cause SecurityExceptions when using the JAR. -->
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>my.prorgams.main.clazz</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>