Flink 同步Kafka数据,压缩并存储到HDFS

此文章只是做一个记录,获取数据同步到HDFS比较简单,官网有完整的代码,主要是如下:

DataStream<Tuple2<IntWritable,Text>> input = ...;

BucketingSink<Tuple2<IntWritable,Text>> sink = new BucketingSink<Tuple2<IntWritable,Text>>("/base/path");
sink.setBucketer(new DateTimeBucketer<>("yyyy-MM-dd--HHmm", ZoneId.of("America/Los_Angeles")));
sink.setWriter(new SequenceFileWriter<IntWritable, Text>());
sink.setBatchSize(1024 * 1024 * 400); // this is 400 MB,
sink.setBatchRolloverInterval(20 * 60 * 1000); // this is 20 mins

input.addSink(sink);

根据官网提示sequenceFile可以提供压缩,默认的sink.setWriter是String类型, 要使用sequenceFile,类型必须是<IntWritable,Text>:

sink.setWriter(new SequenceFileWriter<IntWritable, Text>("com.hadoop.compression.lzo.LzoCodec",
				SequenceFile.CompressionType.BLOCK));

所以获取之后之后不能像使用String类型那样直接返回,我使用FlatMap对数据进行转换,转换成<IntWritable, Text>, Text是数据,IntWritable无所谓,随便给一个数字即可。


import java.time.ZoneId;
import java.util.Properties;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.connectors.fs.SequenceFileWriter;
import org.apache.flink.streaming.connectors.fs.StringWriter;
import org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink;
import org.apache.flink.streaming.connectors.fs.bucketing.DateTimeBucketer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.util.Collector;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

/**
 * Hello world!
 *
 */
public class App {
	public static void main(String[] args) throws Exception {

		// final ExecutionEnvironment env =
		// ExecutionEnvironment.getExecutionEnvironment();
		System.loadLibrary("gplcompression");
		String topic = args[0];
		String group_id = args[1];
		String hdfs_path = args[2];
		String app_name = args[3];
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.enableCheckpointing(5000);
		CheckpointConfig config = env.getCheckpointConfig();
		config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
		Properties prop = new Properties();
		prop.setProperty("bootstrap.servers",
				"10.200.10.108:9092,10.200.10.109:9092,10.200.10.110:9092,10.200.81.108:9092,10.200.81.109:9092");
		prop.setProperty("group.id", group_id);

		DataStream<String> stream = env.addSource(new FlinkKafkaConsumer010<>(topic, new SimpleStringSchema(), prop));


		DataStream<Tuple2<IntWritable, Text>> line = stream.flatMap(new LineSplitter());

		BucketingSink<Tuple2<IntWritable, Text>> sink = new BucketingSink<Tuple2<IntWritable, Text>>(hdfs_path);
		sink.setBucketer(new DateTimeBucketer<Tuple2<IntWritable, Text>>("yyyy-MM-dd", ZoneId.of("Asia/Shanghai")));
		sink.setWriter(new SequenceFileWriter<IntWritable, Text>("com.hadoop.compression.lzo.LzoCodec",
				SequenceFile.CompressionType.BLOCK));
		// sink.setWriter(new StringWriter());
		sink.setBatchSize(1024 * 1024 * 1024L); // this is 1000 MB,
		sink.setBatchRolloverInterval(60 * 60 * 1000L); // this is 60 mins
		sink.setPendingPrefix("");
		sink.setPendingSuffix("");
		sink.setInProgressPrefix(".");
		line.addSink(sink);
		env.execute(app_name);

	}

	public static class LineSplitter implements FlatMapFunction<String, Tuple2<IntWritable, Text>> {
		public void flatMap(String line, Collector<Tuple2<IntWritable, Text>> out) {
			// for (String word : line.split(" ")) {
			if (! line.contains("INFO")) {

				out.collect(new Tuple2<IntWritable, Text>(new IntWritable(1), new Text(line)));
			}
			// }
		}
	}

}

LineSplitter实际什么也没干,仅仅是把数据转换成Tuple2。

这里有一个东西需要提示,当我提交JOB的时候显示native-lzo library的错误,大意是找不到LZO的so文件,这个错误折腾了我几个小时,最后找到了解决办法:

Flink配置文件添加如下参数:


yarn.application-master.env.LD_LIBRARY_PATH: /opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native:$LD_LIBRARY_PATH
yarn.taskmanager.env.LD_LIBRARY_PATH: /opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native:$LD_LIBRARY_PATH

以上参数是解决JOB提交到YARN之后正常,但是在提交的时候还需要设置:

export LD_LIBRARY_PATH= /opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native:$LD_LIBRARY_PATH

通过上述2个配置可以解决LZO的文件,代码第一行就是加载LZO,如果加载失败,那么你提交JOB也基本是失败。

POM.XML文件增加如下依赖:

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-core</artifactId>
			<version>1.8.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-java</artifactId>
			<version>1.8.1</version>
			<scope>provided</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-java_2.11</artifactId>
			<version>1.8.1</version>
			<scope>provided</scope>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-connector-kafka-0.10_2.11</artifactId>
			<version>1.8.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-planner_2.11</artifactId>
			<version>1.8.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-api-java-bridge_2.11</artifactId>
			<version>1.8.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-scala_2.11</artifactId>
			<version>1.8.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-common</artifactId>
			<version>1.8.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-connector-filesystem_2.11</artifactId>
			<version>1.8.1</version>
		</dependency>

		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-client</artifactId>
			<version>3.0.0-cdh6.2.0</version>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-hadoop-compatibility_2.11</artifactId>
			<version>1.8.1</version>
		</dependency>

		<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
		<dependency>
			<groupId>com.alibaba</groupId>
			<artifactId>fastjson</artifactId>
			<version>1.2.59</version>
		</dependency>

 

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

tom_fans

谢谢打赏

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值