一般都使用spark streaming从kafka 中消费数据,然后写到其他存储;项目中需要从kafka topic中读数据然后经过 spark streaming 处理后回写到kafka 另一个topic,此处记录下实现方法。
环境:
spark:1.6.1
stremaing-kafka: spark-streaming-kafka_2.10,1.6.1
本例中,每个executor上存在一个单例的kafka producer,message在每个executor本地并行输出;当然,如果性能不满足需求,可以考虑再在executor上使用kafka producer pool,实现executor间并行,executor并发输出。
核心代码
package org.frey.example;
import com.google.common.collect.Lists;
import org.frey.example.KafkaProducer;
import kafka.producer.KeyedMessage;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
/**
* KafkaOutput
*
* @author FREY
* @date 2016/4/28
*/
public