Spark2.x 自从引入了 Structured Streaming
后,未来数据操作将逐步转化到 DataFrame/DataSet
,以下将介绍 Spark2.x 如何与 Kafka0.10+
整合
Structured Streaming + Kafka
引包
groupId = org.apache.spark artifactId = spark-sql-kafka-0-10_2.11 version = 2.1.1
为了让更直观的展示包的依赖,以下是我的工程 sbt 文件
name := "spark-test" version := "1.0" scalaVersion := "2.11.7" // https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1" % "provided" // https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.1.1" % "provided" // https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.11 libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.1.1" % "provided" // https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.3" // https://mvnrepository.com/artifact/mysql/mysql-connector-java libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.38" // https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.10.2.1" //libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.1" libraryDependencies += "org.apache.spark" % "spark-sql-kafka-0-10_2.11" % "2.1.1"
Structured Streaming 连接 Kafka
def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .appName("Spark structured streaming Kafka example") // .master("local[2]") .getOrCreate() val inputstream = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "127.0.0.1:9092") .option("subscribe", "testss") .load() import spark.implicits._ val query = inputstream.select($"key", $"value") .as[(String, String)].map(kv => kv._1 + " " + kv._2).as[String] .writeStream .outputMode("append") .format("console") .start() query.awaitTermination() }
流的元数据如下
Column | Type |
---|---|
key | binary |
value | binary |
topic | string |
partition | int |
offset | long |
timestamp | long |
timestampType | int |
可配参数
Option | value | meaning |
---|---|---|
assign | json string {"topicA":[0,1],"topicB":[2,4]} | 用于指定消费的 TopicPartitions,assign ,subscribe ,subscribePattern 是三种消费方式,只能同时指定一个 |
subscribe | A comma-separated list of topics | 用于指定要消费的 topic |
subscribePattern | Java regex string | 使用正则表达式匹配消费的 topic |
kafka.bootstrap.servers | A comma-separated list of host:port | kafka brokers |
不能配置的参数
group.id
: 对每个查询,kafka 自动创建一个唯一的 groupauto.offset.reset
: 可以通过 startingOffsets 指定,Structured Streaming 会对任何流数据维护 offset, 以保证承诺的 exactly once.key.deserializer
: 在 DataFrame 上指定,默认ByteArrayDeserializer
value.deserializer
: 在 DataFrame 上指定,默认ByteArrayDeserializer
enable.auto.commit
:interceptor.classes
:
Stream + Kafka
从最新offset开始消费
def main(args: Array[String]): Unit = { val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "localhost:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "use_a_separate_group_id_for_each_stream", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val ssc =new StreamingContext(OpContext.sc, Seconds(2)) val topics = Array("test") val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.foreachRDD(rdd=>{ val offsetRanges=rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd.foreachPartition(iter=>{ val o: OffsetRange = offsetRanges(TaskContext.get.partitionId) println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") }) stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) }) // stream.map(record => (record.key, record.value)).print(1) ssc.start() ssc.awaitTermination() }
从指定的offset开始消费
def main(args: Array[String]): Unit = { val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "localhost:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "use_a_separate_group_id_for_each_stream", // "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val ssc = new StreamingContext(OpContext.sc, Seconds(2)) val fromOffsets = Map(new TopicPartition("test", 0) -> 1100449855L) val stream = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets) ) stream.foreachRDD(rdd => { val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges for (o <- offsetRanges) { println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") } stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) }) // stream.map(record => (record.key, record.value)).print(1) ssc.start() ssc.awaitTermination() }
转载于:https://blog.51cto.com/13064681/1943431