SparkStreaming重复消费消息队列中的数据解决方案
问题:在E-MapReduce上使用SparkStreaming消费阿里云LogService(可以当作Kafka类似的消息队列来发送和消费数据)中的数据时,每个batch都会消费到之前所有的数据。
如图:在向LogService中发送了16条数据后,每个batch都能消费到所有的数据
代码如下:
public class javaStreamingDirect {
public static void main(String[] args) throws InterruptedException {
String logServiceProject = "teststreaming04";
String logStoreName = "teststreaming4logstore";
String loghubConsumerGroupName = "filter_info_count";
//String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
String accessKeyId = "";
String accessKeySecret = "";
Duration batchInterval = Durations.seconds(5);
SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);
ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
//ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");
HashMap<String, String> zkParam = new HashMap<>();
zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
//zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
zkParam.put("enable.auto.commit", "false");
JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
ssc,
logServiceProject,
logStoreName,
loghubConsumerGroupName,
accessKeyId,
accessKeySecret,
loghubEndpoint,
zkParam,
LogHubCursorPosition.END_CURSOR);
javaLoghubstream.print();
ssc.start();
ssc.awaitTermination();
}
}
去翻官网,发现在SparkStreaming Kafka Direct 部分发现了如下api文档:
这里的大致意思是说,Kafka会自动提交偏移量,但是因为之前代码中设置了参数enable.auto.commit 为false,导致了偏移量不会自动提交,需要手动去提交偏移量,那么代码按照api可做如下修改:
public class javaStreamingDirect {
public static void main(String[] args) throws InterruptedException {
String logServiceProject = "teststreaming04";
String logStoreName = "teststreaming4logstore";
String loghubConsumerGroupName = "filter_info_count";
//String loghubEndpoint = "teststreaming04.cn-hangzhou.log.aliyuncs.com";
String loghubEndpoint = "teststreaming04.cn-hangzhou-intranet.log.aliyuncs.com";
String accessKeyId = "";
String accessKeySecret = "";
Duration batchInterval = Durations.seconds(5);
SparkConf conf = new SparkConf().setAppName("javaStreamingDirect");
JavaStreamingContext ssc = new JavaStreamingContext(conf, batchInterval);
ssc.checkpoint("oss://test/checkpoint/javaStreamingDirect");
//ssc.checkpoint("D:/SparkData/streamingCheckPoint/directStream");
HashMap<String, String> zkParam = new HashMap<>();
zkParam.put("zookeeper.connect", "emr-worker-1,emr-header-2,emr-header-3:2181");
//zkParam.put("zookeeper.connect", "192.168.96.119,192.168.96.118,192.168.96.117:2181");
zkParam.put("enable.auto.commit", "false");
JavaInputDStream<String> javaLoghubstream = LoghubUtils.createDirectStream(
ssc,
logServiceProject,
logStoreName,
loghubConsumerGroupName,
accessKeyId,
accessKeySecret,
loghubEndpoint,
zkParam,
LogHubCursorPosition.END_CURSOR);
// 偏移量提交
javaLoghubstream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
((CanCommitOffsets) javaLoghubstream.inputDStream()).commitAsync();
}
});
javaLoghubstream.print();
ssc.start();
ssc.awaitTermination();
}
}
这样设置后,每个batch之后,消费完数据后,都会向Kafka提交消费到的数据的偏移量,在下一个批次后不会再重复消费数据。
总结:避免重复消费数据的方法除了提交偏移量,还可以使用receive来拿到流数据,但是要保证高可用的话必须开启WAL机制将数据写入hdfs,但是这样的话对性能会造成一定影响;另外若不想手动提交,可以将参数enable.auto.commit设置为true,让Kafka自动提交,但是这样可能出现的问题是数据被消费了,但是spark程序在还没有来得及将数据输出,这样没法保证数据的一致性,所以建议若要使用direct的话,手动提交偏移量比较好。