spark与kafka整合需要引入spark-streaming-kafka.jar,该jar根据kafka版本有2个分支,分别是spark-streaming-kafka-0-8和spark-streaming-kafka-0-10。
jar包分支选择原则:0.10.0>kafka版本>=0.8.2.1,选择spark-streaming-kafka-0-8;kafka版本>=0.10.0,选择spark-streaming-kafka-0-10。
kafka0.8.2.1及之后版本依次是0.8.2.1(2015年3月11号发布)、0.8.2.2(2015年10月2号发布)、0.9.x、0.10.x(0.10.0.0于2016年5月22号发布)、0.11.x、1.0.x(1.0.0版本于2017年11月1号发布)、1.1.x、2.0.x(2.0.0版本于2018年7月30日发布)。
本次学习使用kafka1.0.0版本,故需要引入spark-streaming-kafka-0-10的jar,如下
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.11</artifactId> <version>2.2.1</version> </dependency>
PS:从jar包的groupId可看出,该jar是由spark项目组开发的。
简单用例1:本例在spark2.4.0(scala2.12)、kafka2.2.0(scala2.12)环境测试通过
import org.apache.commons.collections.MapUtils; import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.spark.SparkConf; import org.apache.spark.sql.*; import org.apache.spark.storage.StorageLevel; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.*; import org.apache.spark.streaming.kafka010.*; import com.alibaba.fastjson.JSON; import java.util.*; public class SparkConsumerTest { public static void main(String[] args) throws Exception { System.setProperty("hadoop.home.dir", "C:/Users/lenovo/Downloads/winutils-master/winutils-master/hadoop-2.7.1"); SparkConf conf = new SparkConf().setAppName("heihei").setMaster("local[*]"); conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10)); Properties props = new Properties(); props.setProperty("bootstrap.servers", "192.168.56.100:9092"); props.setProperty("group.id", "my-test-consumer-group"); props.setProperty("enable.auto.commit", "true"); props.setProperty("auto.commit.interval.ms", "1000"); props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); Map kafkaParams = new HashMap(8); kafkaParams.putAll(props); JavaInputDStream<ConsumerRecord<String, String>> javaInputDStream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(Arrays.asList("test"), kafkaParams)); javaInputDStream.persist(StorageLevel.MEMORY_AND_DISK_SER()); SparkSession spark = SparkSession.builder().config(conf).getOrCreate(); javaInputDStream.foreachRDD(rdd -> { Dataset<Row> df = spark.createDataFrame(rdd.map(consumerRecord -> { Map testMap = JSON.parseObject(consumerRecord.value(), Map.class); return new DemoBean(MapUtils.getString(testMap, "id"), MapUtils.getString(testMap, "name"), MapUtils.getIntValue(testMap, "age")); }), DemoBean.class); DataFrameWriter writer = df.write(); String url = "jdbc:postgresql://192.168.56.100/postgres"; String table = "test"; Properties connectionProperties = new Properties(); connectionProperties.put("user", "postgres"); connectionProperties.put("password", "abc123"); connectionProperties.put("driver", "org.postgresql.Driver"); connectionProperties.put("batchsize", "3000"); writer.mode(SaveMode.Append).jdbc(url, table, connectionProperties); }); jssc.start(); jssc.awaitTermination(); } }
DemoBean是另外的一个实体类。
相应pom.xml:
<dependencies> <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka --> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.12</artifactId> <version>2.2.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-streams --> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>2.2.0</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 --> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.9</version> </dependency> <!-- https://mvnrepository.com/artifact/commons-io/commons-io --> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.6</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 --> <dependency> <groupId