1.创建maven项目
首先创建一个maven工程,具体流程可查看这篇文章
创建Maven项目
2.接下来是pom文件的编辑
这里我们用的spark版本是2.4.5,scala是2.12,所以要选择spark-streaming-kafka-0-10_2.12
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 该插件用于将Scala代码编译成class文件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<!-- 声明绑定到maven的compile阶段 -->
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
这是pom文件需要配置的内容
3.启动kafka
依赖下载好之后,我们就可以开始写代码了,可以先在linux中开启kafka
启动zookeeper
zkServer.sh start
#再启动我们的kafka
kafka-server-start.sh config/server.properties
启动kafka的时候注意我们的配置文件的路径
创建一个topic
创建Topic:(创建一个名为test的topic,只有一个副本,一个分区。)
kafka-topics.sh --create --zookeeper master:2181 --replication-factor 1 --partitions 1 --topic test
创建生产者
kafka-console-producer.sh --broker-list master:9092 --topic test
注意自己虚拟机的ip地址,我的是master
4.创建spark-streaming
OK,接下来我们在idea中创建一个scala文件来作为消费者
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies, LocationStrategy}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaSparkStreamingConsumer
{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
// master为linux主机名,如果指定为master,并且在idea直接运行
// 则需要在C:\Windows\System32\drivers\etc\hosts下添加主机名和linux ip的映射
// 或者直接指定 linux 的ip也可以
val brokers = "master:9092"
val topics = Array("test2")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val dstream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils
.createDirectStream[String, String](ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
dstream
.map((record: ConsumerRecord[String, String]) => {
println(record.value())
record.value()
})
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
.print()
ssc.start()
ssc.awaitTermination()
}
}
KafkaUtils.createDirectStream
通过Direct方式创建DStream,这种方式是kafka 0.8版本新增加的功能,我们现在用的是0.10版本,API又有了新的变化,现在我们对此方法做一个说明:
LocationStrategies.PreferConsistent:
topic中的分区均匀的分配到executor,也就是每个executor均匀的去消费topic中的分区数据,每个executor对应的topic分区数一样。
ConsumerStrategies.Subscribe[String, String](topics, kafkaParam):
指定消费者要消费的topic,以及配置参数。
运行,就可以看到我们的streaming处理结果了