使用Kafka+SparkStreaming+MySQL实现大数据入门项目(单词计数)

2 篇文章 0 订阅
1 篇文章 0 订阅


描述:SparkStreaming接收Kafka集群中生产者生产的数据,通过SparkStreaming的算子处理后输出到MySQL数据库中,实现单词计数功能

准备工作

  1. 启动Zookeeper
  2. 启动Kafka集群
  3. 创建topic spark
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-topics.sh --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --create --topic spark --partitions 3 --replication-factor 3
  1. 进入生产者模式
[root@HadoopNode01 kafka_2.11-2.2.0]# bin/kafka-console-producer.sh --topic spark --broker-list HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092
  1. 启动MySQL,创建数据库t_word,设置字段word(varchar)和count(int)

编码阶段(使用Scala语言)

1. 导入依赖

pom.xml

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.4.4</version>
    </dependency>
    <!--Spark Streaming开发需要依赖的jar-->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.11</artifactId>
        <version>2.4.4</version>
        <!--scope代表依赖的作用范围,provided表示容器提供-->
        <!--本地运行注释,集群运行打开-->
        <!--<scope>provided</scope>-->
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.9.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.9.2</version>
    </dependency>
    <!--SparkStreaming对Kafka集成依赖-->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
        <version>2.4.4</version>
    </dependency>
    <!--MySQL驱动依赖-->
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>5.1.38</version>
    </dependency>
</dependencies>

2. 核心代码

package com.baizhi.test
import java.sql.{Driver, DriverManager}
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
  * 从Kafka中读取数据,并存储到Mysql数据库中
  */
object InputFromKafkaAndOutputToMysql {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("input from kafka and output to mysql").setMaster("local[*]")
    val ssc = new StreamingContext(conf, Seconds(5))
    //设置日志级别
    ssc.sparkContext.setLogLevel("ERROR")
    //通过Kafka创建DStream对象
    //添加Kafka配置信息
    val kafkaParams = Map[String, Object](
    //Kafka的集群地址列表
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092",
     //设置消费组
      ConsumerConfig.GROUP_ID_CONFIG -> "g2",
      //读取数据的Key和Value反序列化器
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]
    )
    val message = KafkaUtils.createDirectStream[String, String](
      ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String](List("spark"), kafkaParams))
    message
    //对value进行按照空格切割
      .flatMap(_.value.split(" "))
    //通过map映射->(hello,1)
      .map((_, 1))
      .reduceByKey((v1: Int, v2: Int) => v1 + v2, 1)
      .foreachRDD(rdd => {
        //遍历每个分区
        rdd.foreachPartition(iter => {
          classOf[Driver]
          //MySQL数据库的配置信息
          val connection = DriverManager.getConnection("jdbc:mysql://192.168.170.1:3306/test", "root", "960619")
          //定义sql语句
          val selectSql = "select * from t_word where word = ?"
          val updateSql = "update t_word set count = ? where word = ?"
          val insertSql = "insert into t_word values(?,1)"
          iter.foreach(record => {//record是一个二元组->(word,count)
          //首先根据word查询,如果word已存在,则直接修改它的count的值,不存在则把这个word添加到数据库中
            val queryStatement = connection.prepareStatement(selectSql)
            queryStatement.setString(1, record._1)
            val rs = queryStatement.executeQuery()
            //word存在
            if (rs.next()) {
              val count = rs.getInt("count")
              val updateStatement = connection.prepareStatement(updateSql)
              updateStatement.setInt(1, count + record._2)
              updateStatement.setString(2,record._1)
              updateStatement.executeUpdate()
            }else{
              val insertStatement = connection.prepareStatement(insertSql)
              insertStatement.setString(1,record._1)
              insertStatement.executeUpdate()
            }
          })
          //关闭连接
          connection.close()
        })
      })

    //启动应用程序
    ssc.start()
    //优雅的关闭
    ssc.awaitTermination()
  }
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值