Kafka+SparkStreaming+MySQL
描述:SparkStreaming接收Kafka集群中生产者生产的数据,通过SparkStreaming的算子处理后输出到MySQL数据库中,实现单词计数功能
准备工作
- 启动Zookeeper
- 启动Kafka集群
- 创建topic spark
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --create --topic spark --partitions 3 --replication-factor 3
- 进入生产者模式
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-console-producer.sh --topic spark --broker-list HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
- 启动MySQL,创建数据库t_word,设置字段word(varchar)和count(int)
编码阶段(使用Scala语言)
1. 导入依赖
pom.xml
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.4</version>
</dependency>
<!--Spark Streaming开发需要依赖的jar-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.4</version>
<!--scope代表依赖的作用范围,provided表示容器提供-->
<!--本地运行注释,集群运行打开-->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.9.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.9.2</version>
</dependency>
<!--SparkStreaming对Kafka集成依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.4</version>
</dependency>
<!--MySQL驱动依赖-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
</dependencies>
2. 核心代码
package com.baizhi.test
import java.sql.{Driver, DriverManager}
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* 从Kafka中读取数据,并存储到Mysql数据库中
*/
object InputFromKafkaAndOutputToMysql {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("input from kafka and output to mysql").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
//设置日志级别
ssc.sparkContext.setLogLevel("ERROR")
//通过Kafka创建DStream对象
//添加Kafka配置信息
val kafkaParams = Map[String, Object](
//Kafka的集群地址列表
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092",
//设置消费组
ConsumerConfig.GROUP_ID_CONFIG -> "g2",
//读取数据的Key和Value反序列化器
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
)
val message = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](List("spark"), kafkaParams))
message
//对value进行按照空格切割
.flatMap(_.value.split(" "))
//通过map映射->(hello,1)
.map((_, 1))
.reduceByKey((v1: Int, v2: Int) => v1 + v2, 1)
.foreachRDD(rdd => {
//遍历每个分区
rdd.foreachPartition(iter => {
classOf[Driver]
//MySQL数据库的配置信息
val connection = DriverManager.getConnection("jdbc:mysql://192.168.170.1:3306/test", "root", "960619")
//定义sql语句
val selectSql = "select * from t_word where word = ?"
val updateSql = "update t_word set count = ? where word = ?"
val insertSql = "insert into t_word values(?,1)"
iter.foreach(record => {//record是一个二元组->(word,count)
//首先根据word查询,如果word已存在,则直接修改它的count的值,不存在则把这个word添加到数据库中
val queryStatement = connection.prepareStatement(selectSql)
queryStatement.setString(1, record._1)
val rs = queryStatement.executeQuery()
//word存在
if (rs.next()) {
val count = rs.getInt("count")
val updateStatement = connection.prepareStatement(updateSql)
updateStatement.setInt(1, count + record._2)
updateStatement.setString(2,record._1)
updateStatement.executeUpdate()
}else{
val insertStatement = connection.prepareStatement(insertSql)
insertStatement.setString(1,record._1)
insertStatement.executeUpdate()
}
})
//关闭连接
connection.close()
})
})
//启动应用程序
ssc.start()
//优雅的关闭
ssc.awaitTermination()
}
}

563

被折叠的 条评论
为什么被折叠?



