一、实现功能
Streaming从Kafka中读取消息,而不同topic有可能会有不同的日志结构,需要依据不同的topic结构进行对应的处理。
二、环境
1.kafka_2.11-0.10.0.1
特别提醒:kafka_2.11-0.10.2.1好像有问题,Streaming创建Direct直接连接获取不到信息,一直报错,坑了两天尽量不要用!换了其他版本后kafka_2.11-0.10.0.1即可。
2.JDK1.8
3.Scala2.11.8
4.zookeeper 3.4.5-cdh5.7.0
5.cdh5.7.0
三、创建Kafka对应topic
1.kafka环境搭建和启动
参考:https://blog.csdn.net/u010886217/article/details/82973573
2.kafka创建三个topic
bin/kafka-topics.sh --create --zookeeper hadoop:2181/kafka10_01 --replication-factor 1 --partitions 1 --topic hello_topic
bin/kafka-topics.sh --create --zookeeper hadoop:2181/kafka10_01 --replication-factor 1 --partitions 1 --topic hello_topic2
bin/kafka-topics.sh --create --zookeeper hadoop:2181/kafka10_01 --replication-factor 1 --partitions 1 --topic hello_topic3
四、代码实现
1.依赖
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.1</version>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Spark SQL -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
<!--<scope>compile</scope>-->
</dependency>
<!-- Spark SQL -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.0</version>
<!--<scope>compile</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.zookeeper/zookeeper -->
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.5-cdh5.7.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
</dependencies>
2.代码实现
import java.io.File
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.HasOffsetRanges
import org.apache.spark.{SparkConf, SparkContext, TaskContext}
//import org.apache.spark.sql.SparkSession
//import org.apache.spark.streaming.kafka.KafkaUtils
//import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
/**
* Created by Administrator on 2019/12/7.
*/
object StreamingKafkaMutiTopics {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.WARN)
Logger.getLogger("org.apache.kafka.clients.consumer").setLevel(Level.WARN)
// val warehouseLocation = new File("hdfs://cluster/hive/warehouse").getAbsolutePath
// @transient
// val spark = SparkSession
// .builder()
// .appName("Spark SQL To Hive")
// .config("spark.sql.warehouse.dir", warehouseLocation)
// .enableHiveSupport()
// .getOrCreate()
// spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sparkConfig=new SparkConf()
.setAppName("mutiTopics")
.setMaster("local[2]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
@transient
val sc = new SparkContext(sparkConfig)
val scc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map[String, Object](
"auto.offset.reset" -> "latest", //latest,earliest
"value.deserializer" -> classOf[StringDeserializer]
, "key.deserializer" -> classOf[StringDeserializer]
, "bootstrap.servers" -> "hadoop01:9092"
, "group.id" -> "test_jason"
, "enable.auto.commit" -> (false: java.lang.Boolean)
)
var stream: InputDStream[ConsumerRecord[String, String]] = null
val topics = Array("hello_topic","hello_topic2","hello_topic3")
stream = KafkaUtils.createDirectStream[String, String](
scc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD(rdd=>{
if(!rdd.isEmpty()){
val offsetRanges=rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition(
partition=> {
val o=offsetRanges(TaskContext.get.partitionId)
if(o.topic=="hello_topic"){
//hello_topic 处理逻辑
println("hello_topic logic:"+ s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
if(o.topic=="hello_topic2"){
//hello_topic2 处理逻辑
println("hello_topic2 logic:"+ s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
if(o.topic=="hello_topic3"){
//hello_topic3 处理逻辑
println("hello_topic3 logic:"+ s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
}
)
}
})
stream.map(record=>(record.key,record.value)).print()
scc.start()
scc.awaitTermination()
}
}
3.开启消息生产者
bin/kafka-console-producer.sh --broker-list hadoop:9092 --topic hello_topic
bin/kafka-console-producer.sh --broker-list hadoop:9092 --topic hello_topic2
bin/kafka-console-producer.sh --broker-list hadoop:9092 --topic hello_topic3
输入测试数据,结果
4.代码实现结果
-------------------------------------------
Time: 1576397344000 ms
-------------------------------------------
hello_topic3 logic:hello_topic3 0 10 10
hello_topic2 logic:hello_topic2 0 26 27
hello_topic logic:hello_topic 0 2 2
-------------------------------------------
Time: 1576397345000 ms
-------------------------------------------
(null,sdf sdf sdf sdfwesdf sdf sdf sdfwe)
-------------------------------------------
Time: 1576397346000 ms
-------------------------------------------
五、参考
1.https://blog.csdn.net/xianpanjia4616/article/details/90081537