方式一:监听端口号,此方式需要先在linux上开启nc -lk 端口号服务,之后SparkStreaming可以从此端口拉取到数据,并进行实时处理,代码如下:
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object StreamingWordCount {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
val sc: SparkContext = new SparkContext(conf)
val ssc: StreamingContext = new StreamingContext(sc, Seconds(5))
val dStream: ReceiverInputDStream[String] = ssc.socketTextStream("linux01", 8888)
val dStream2: DStream[String] = dStream.flatMap(_.split(" "))
val dStream3: DStream[(String, Int)] = dStream2.map((_, 1))
val reduced: DStream[(String, Int)] = dStream3.reduceByKey(_ + _)
reduced.print()
ssc.start()
ssc.awaitTermination()
}
}
方式二:SparkStreaming整合kafka,此方式需要kafka集群正常运行,代码如下:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingKafkaWordCount {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[*]")
val ssc: StreamingContext = new StreamingContext(conf, Seconds(5))
//因控制台显示的信息类型为Info,信息量太大,好多都是不需要的,故将显示的信息类型设置成WARN,只显示结果信息
ssc.sparkContext.setLogLevel("WARN")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "test01:9092,test02:9092,test03:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "1",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = Array("wordcount")
/*
ssc: StreamingContext,
locationStrategy: LocationStrategy,
consumerStrategy: ConsumerStrategy[K, V]
*/
//此处为创建原始的DStream,使用KafkaUtils静态对象调用createDirectStream方法.此方式调用的是Kafka底层的API,Consumer直连Leader,效率更高
val value: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
//参数一:StreamingContext对象
ssc,
//参数二:位置策略,将kafka与Worker设置在同一台机器,这样拉取,处理数据的效率会变高
LocationStrategies.PreferConsistent,
//参数三:消费者策略,需要传入两个参数,参数1是消费kafka中得哪些topic,参数2是kafka的一些配置参数
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
//此处需要进行一次map,不然程序运行会报object not serializable异常,这是因为上一步中的InputStream中存放的数据为ConsumerRecord类型,此类型没有序列化,故需要单独将Key或Value取出来再打印
val value1: DStream[String] = value.map(cr => {
cr.value()
})
value1.print()
ssc.start()
ssc.awaitTermination()
}
}