怎么去提交偏移量
任务运行完,然后提交偏移量。
提交偏移量,首先要知道当前的偏移量offset是多少?偏移量存在哪里?
val messages = km.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
messages这里面是有偏移量信息的,但是只要对messages这个对象做一下操作,里面偏移量信息就不见了,后面进行偏移量时,就没有偏移量信息了,那么数据就会丢失。
那怎么办呢?调用messages.foreachRDD(rdd=>{…}),所有业务逻辑都在foreachRDD中进行,然后这个RDD里面是有偏移量信息的,最后提交offset。
但这里有个问题,开始程序取到的messages是DStream(SparkStreaming编程),然后到这转化成了rdd,变成SparkCore编程了,如果需要使用window算子等场景不支持了,而DStream支持window操作。所以这种方案在大部分情况下可行,一部分情况不可行。
然后,选择了添加监听器来管理偏移量的方案。
监听器管理偏移量
代码实现-scala(单词统计):
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[3]").setAppName("test")
conf.set("spark.streaming.kafka.maxRatePerPartition", "5")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
val ssc = new StreamingContext(conf,Seconds(50))
val brokers = "xxx:9092"
val topics = "xxx_infologic"
val groupId = "xxx_test" //注意,这个也就是我们的消费者的名字
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokers,
"group.id" -> groupId,
"client.id" -> "test",
"enable.auto.commit" -> "false" //要关闭自动提交偏移量的功能
)
//关键步骤一:设置监听器,帮我们完成偏移量的提交
ssc.addStreamingListener(new MyListener(kafkaParams));
//关键步骤二: 创建对象,然后通过这个对象获取到上次的偏移量,然后获取到数据流
val km = new KafkaManager(kafkaParams)
val messages = km.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
//完成你的业务逻辑即可
messages //只要对messages这个对象做一下操作,里面偏移量信息就不见了
.map(_._2)
.foreachRDD( rdd =>{
rdd.foreach( line =>{
print(line)
print("-============================================")
})
})
ssc.start()
ssc.awaitTermination()
ssc.stop()
}
}
代码实现-设置监听器,目的是为了让RD开发更方便。
import kafka.common.TopicAndPartition;
import org.apache.spark.streaming.kafka.KafkaCluster;
import org.apache.spark.streaming.kafka.OffsetRange;
import org.apache.spark.streaming.scheduler.*;
import scala.Option;
import scala.collection.JavaConversions;
import scala.collection.immutable.List;
import java.util.HashMap;
import java.util.Map;
public class MyListener implements StreamingListener {
private KafkaCluster kc;
public scala.collection.immutable.Map<String, String> kafkaParams;
public MyListener(scala.collection.immutable.Map<String, String> kafkaParams){
this.kafkaParams=kafkaParams;
kc = new KafkaCluster(kafkaParams);
}
@Override
public void onReceiverStarted(StreamingListenerReceiverStarted receiverStarted) {
}
@Override
public void onReceiverError(StreamingListenerReceiverError receiverError) {
}
@Override
public void onReceiverStopped(StreamingListenerReceiverStopped receiverStopped) {
}
@Override
public void onBatchSubmitted(StreamingListenerBatchSubmitted batchSubmitted) {
}
@Override
public void onBatchStarted(StreamingListenerBatchStarted batchStarted) {
}
/**
* 批次完成时调用的方法
* 当一个SparkStreaing程序运行完了以后,会触发这个方法
* 里面方法里面完成偏移量的提交
* @param batchCompleted
*/
@Override
public void onBatchCompleted(StreamingListenerBatchCompleted batchCompleted) {
/**
* 一个批次里面是有多个task,一般你有几个分区,就会有几个task任务。
* 万一,比如有10个task,有8个task运行成功了,2个 task运行失败了。
* 但是我们偏移量会被照常提交,那这样的话,会丢数据。
* 所以我们要判断一个事,只有所有的task都运行成功了,才提交偏移量。
*
* 10 task 5task 运行成功 5task运行失败,不让提交偏移量
* 会有小量的数据重复,这个是在企业里面95%的场景都是接受的。
* 如果是我们的公司,我们公司里面所有的实时的任务都接受有少量的数据重复。但是就是不允许丢失。
*
*如果是运行成功的task,是没有失败的原因的( failureReason 这个字段是空的)
* 如果说一个task是失败了,那必行failureReason 这个字段里面有值,会告诉你失败的原因。
*
*/
//如果本批次里面有任务失败了,那么就终止偏移量提交
scala.collection.immutable.Map<Object, OutputOperationInfo> opsMap = batchCompleted.batchInfo().outputOperationInfos();
Map<Object, OutputOperationInfo> javaOpsMap = JavaConversions.mapAsJavaMap(opsMap);
for (Map.Entry<Object, OutputOperationInfo> entry : javaOpsMap.entrySet()) {
//failureReason不等于None(是scala中的None),说明有异常,不保存offset
if(!"None".equalsIgnoreCase(entry.getValue().failureReason().toString())) {
return;
}
}
long batchTime = batchCompleted.batchInfo().batchTime().milliseconds();
/**
* topic,分区,偏移量
*/
Map<String, Map<Integer, Long>> offset = getOffset(batchCompleted);
for (Map.Entry<String, Map<Integer, Long>> entry : offset.entrySet()) {
String topic = entry.getKey();
Map<Integer, Long> paritionToOffset = entry.getValue();
//只需要这儿把偏移信息放入到zookeeper就可以了。
for(Map.Entry<Integer,Long> p2o : paritionToOffset.entrySet()){
Map<TopicAndPartition, Object> map = new HashMap<TopicAndPartition, Object>();
TopicAndPartition topicAndPartition =
new TopicAndPartition(topic,p2o.getKey());
map.put(topicAndPartition,p2o.getValue());
scala.collection.immutable.Map<TopicAndPartition, Object>
topicAndPartitionObjectMap = TypeHelper.toScalaImmutableMap(map);
kc.setConsumerOffsets(kafkaParams.get("group.id").get(), topicAndPartitionObjectMap);
}
}
}
@Override
public void onOutputOperationStarted(StreamingListenerOutputOperationStarted outputOperationStarted) {
}
@Override
public void onOutputOperationCompleted(StreamingListenerOutputOperationCompleted outputOperationCompleted) {
}
private Map<String, Map<Integer, Long>> getOffset(StreamingListenerBatchCompleted batchCompleted) {
Map<String, Map<Integer, Long>> map = new HashMap<>();
scala.collection.immutable.Map<Object, StreamInputInfo> inputInfoMap = batchCompleted.batchInfo().streamIdToInputInfo();
Map<Object, StreamInputInfo> infos = JavaConversions.mapAsJavaMap(inputInfoMap);
infos.forEach((k, v) -> {
Option<Object> optOffsets = v.metadata().get("offsets");
if (!optOffsets.isEmpty()) {
Object objOffsets = optOffsets.get();
if (List.class.isAssignableFrom(objOffsets.getClass())) {
List<OffsetRange> scalaRanges = (List<OffsetRange>) objOffsets;
Iterable<OffsetRange> ranges = JavaConversions.asJavaIterable(scalaRanges);
for (OffsetRange range : ranges) {
if (!map.containsKey(range.topic())) {
map.put(range.topic(), new HashMap<>());
}
map.get(range.topic()).put(range.partition(), range.untilOffset());
}
}
}
});
return map;
}
}
为了让API支持多语言
import scala.Tuple2;
public class TypeHelper {
@SuppressWarnings("unchecked")
public static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(java.util.Map<K, V> javaMap) {
final java.util.List<Tuple2<K, V>> list = new java.util.ArrayList<>(javaMap.size());
for (final java.util.Map.Entry<K, V> entry : javaMap.entrySet()) {
list.add(Tuple2.apply(entry.getKey(), entry.getValue()));
}
final scala.collection.Seq<Tuple2<K, V>> seq = scala.collection.JavaConverters.asScalaBufferConverter(list).asScala().toSeq();
return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(seq);
}
}
代码实现–java(单词统计):
import kafka.serializer.StringDecoder;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
import java.util.*;
public class JavaWordCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("test_kafka_offset_monitor").setMaster("local[4]");
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.seconds(5));
String topics="xxx_arbitrationlogic"; //主题
String groupId="xxx_test_consumer";//你的consumer的名字
String brokers="xxx:9092";//brokers
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, String> kafkaParams = new HashMap<>();//kafka参数
kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
//关键步骤一:增加监听器,批次完成时自动帮你自动提交偏移量
ssc.addStreamingListener(new MyListener(TypeHelper.toScalaImmutableMap(kafkaParams)));
//关键步骤二:使用数据平台提供的KafkaManager,根据偏移量获取数据
// 如果是Java代码 调用createDirectStream
final KafkaManager kafkaManager = new KafkaManager(TypeHelper.toScalaImmutableMap(kafkaParams));
JavaPairInputDStream<String, String> myDStream = kafkaManager.createDirectStream(
ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
myDStream.map(new Function<Tuple2<String,String>, String>() {
@Override
public String call(Tuple2<String, String> tuple) throws Exception {
return tuple._2;
}
}).flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String line) throws Exception {
return Arrays.asList(line.split("_")).iterator();
}
}).mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<>(word,1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer a, Integer b) throws Exception {
return a+b;
}
}).foreachRDD(new VoidFunction<JavaPairRDD<String, Integer>>() {
@Override
public void call(JavaPairRDD<String, Integer> rdd) throws Exception {
rdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
@Override
public void call(Tuple2<String, Integer> wordCount) throws Exception {
System.out.println("单词:"+ wordCount._1 + " "+ "次数:"+wordCount._2);
}
});
}
});
ssc.start();
try {
ssc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
ssc.stop();
}
}