Scala_Spark-电商平台离线分析项目-需求七前数据生成与数据消费测试

Scala_Spark-电商平台离线分析项目-需求七前数据生成与数据消费测试

第四模块:广告流量实时统计统计

技术点:SparkStreaming、kafka集群

kafka.broker.list=node01:9092,node02:9092,node03:9092
kafka.topics=AdRealTimeLog0308

(一)执行步骤

1)试验一下本地生产数据能否发送到kafka里去

  1. 开启zookeeper集群

    【三台服务器启动zookeeper】,三台机器都执行以下命令启动zookeeper
    cd /export/servers/zookeeper-3.4.5-cdh5.14.0
    bin/zkServer.sh start
    进程QuorumPeerMain

  2. 开启kafka集群

    【启动kafka集群】默认端口9092
    三台机器启动kafka服务
    [root@node01 servers]# cd /export/servers/kafka_2.11-1.0.0/
    前台启动 ./kafka-server-start.sh …/config/server.properties
    后台启动命令 nohup bin/kafka-server-start.sh config/server.properties > /dev/null 2>&1 &

  3. node01开启一个消费者

    所有配置完成的情况下

    [root@node01 ~]# kafka-console-consumer.sh --zookeeper node01:2181 --topic AdRealTimeLog0308
    
    Using the ConsoleConsumer with old consumer is deprecated and will be removed in a future major release. Consider using the new consumer by passing [bootstrap-server] instead of [zookeeper].
    
    // 等待消费
    
    
  4. 本地运行模拟生产实时数据文件MockRealTimeData.scala
    // node01上出现 集群上消费成功 代表生产数据发送到kafka没有问题
    1573145438221 3 3 93 5
    1573145438221 6 6 87 17
    1573145438221 0 0 10 16
    1573145438221 7 7 11 15
    1573145438221 0 0 8 18
    1573145438221 0 0 97 1
    

2)试验一下能否在IDEA本地消费kafka集群里的数据

  1. 在以上试验成功状态下,IDEA运行AdverStat.scala文件

    MockRealTimeData.scala 生产数据到kafka

    //本地控制台输出 本地消费成功 代表本地从kafka消费数据没有问题
    
    0000-00-00 00:00:01,200   WARN --- [                                              main]  org.apache.spark.streaming.kafka010.KafkaUtils                                  (line:   66)  :  overriding receive.buffer.bytes to 65536 see KAFKA-3135
    1573150923605 1 1 93 2
    1573150923605 7 7 19 0
    1573150923605 2 2 72 0
    1573150923605 1 1 3 3
    1573150923605 1 1 44 0
    1573150923605 7 7 65 15
    1573150923605 6 6 64 11
    1573150923605 9 9 20 2
    

(二)IDEA代码

1)实现

1.主程序 AdverStat.scala
import commons.conf.ConfigurationManager
import commons.constant.Constants
import org.apache.hadoop.hdfs.DFSClient.Conf
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * AdverStat.scala
 *
 * 第三个模块:广告流量实时统计
 * 需求七前数据生成与数据消费测试
 *
 * 技术点:sparkStreaming kafka
 */
object AdverStat {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setAppName("adverstat").setMaster("local[*]")
    val sparkSession = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()

    // 标准应当是 val streamingContext = StreamingContext.getActiveOrCreate(checkpointDir,func)
    val streamingContext = new StreamingContext(sparkSession.sparkContext,Seconds(5))

    // 配置kafka相关信息
    val kafka_brokers = ConfigurationManager.config.getString(Constants.KAFKA_BROKERS) //node01:9092,node02:9092,node03:9092
    val kafka_topics = ConfigurationManager.config.getString(Constants.KAFKA_TOPICS) //kafka.topics=AdRealTimeLog0308

    // kafka配置信息
    val kafkaParam = Map(
      "bootstrap.servers" -> kafka_brokers,
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "group1",
      // auto.offset.reset
      // latest: 先去Zookeeper获取offset,如果有,直接使用,如果没有,从最新的数据开始消费
      // earlist: 先去zookeeper获取offset,如果有直接使用,如果没有,从最开始的数据开始消费
      // none: 先去Zookeeper获取offset,如果有,直接使用,如果没有,直接报错
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> (false:java.lang.Boolean)
    )

    // 创建DStream
    // 从kafka中消费数据 拿到的每一条数据都是messege,里面是key-value
    val adRealTimeDStream = KafkaUtils.createDirectStream[String,String](
      streamingContext,
      // 让kafka分区均匀地在excutor上分配 有三种选择
      LocationStrategies.PreferConsistent,
      // 消费者订阅
      ConsumerStrategies.Subscribe[String,String](Array(kafka_topics),kafkaParam)
    )

    // 取出了DStream里面每一条数据的value值
    // adReadTimeValueDStream:DStream[RDD RDD RDD ...] RDD[String]
    // String: timestamp province city userid adid
    val adReadTimeValueDStream = adRealTimeDStream.map(item => item.value())

    // adRealTimeFilterDStream  所有不在黑名单里的实时数据都在里面了
    val adRealTimeFilterDStream =adReadTimeValueDStream.transform{
      logRDD =>
        // blackListArray:Array[AdBlacklist]  AdBlacklist:userId
        val blackListArray = AdBlacklistDAO.findAll() // 这里连接了数据库 创建了mySqlPool

        // userIdArray:Array[Long] [userId1,userId2,...]
        var userIdArray = blackListArray.map(item=>item.userid)

        // 过滤掉已经存在在黑名单里的
        logRDD.filter{
          // log: timestamp province city userid adid
          case log =>
            val logSplit = log.split(" ")
            val userId = logSplit(3).toLong
            ! userIdArray.contains(userId)
        }
    }

    adRealTimeFilterDStream.foreachRDD(rdd=>rdd.foreach(println(_)))

    streamingContext.start()
    streamingContext.awaitTermination()



  }

}

2.模拟生产实时数据 MockRealTimeData.scala
/*
 * MockRealTimeData.scala
 */

import java.util.Properties

import commons.conf.ConfigurationManager
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}

import scala.collection.mutable.ArrayBuffer
import scala.util.Random

object MockRealTimeData {

  /**
    * 模拟的数据
    * 时间点: 当前时间毫秒
    * userId: 0 - 99
    * 省份、城市 ID相同 : 1 - 9
    * adid: 0 - 19
    * ((0L,"北京","北京"),(1L,"上海","上海"),(2L,"南京","江苏省"),(3L,"广州","广东省"),(4L,"三亚","海南省"),(5L,"武汉","湖北省"),(6L,"长沙","湖南省"),(7L,"西安","陕西省"),(8L,"成都","四川省"),(9L,"哈尔滨","东北省"))
    * 格式 :timestamp province city userid adid
    * 某个时间点 某个省份 某个城市 某个用户 某个广告
    */
  def generateMockData(): Array[String] = {
    val array = ArrayBuffer[String]()
    val random = new Random()
    // 模拟实时数据:
    // timestamp province city userid adid
    for (i <- 0 to 50) {

      val timestamp = System.currentTimeMillis()
      val province = random.nextInt(10)
      val city = province
      val adid = random.nextInt(20)
      val userid = random.nextInt(100)

      // 拼接实时数据
      array += timestamp + " " + province + " " + city + " " + userid + " " + adid
    }
    array.toArray
  }

  def createKafkaProducer(broker: String): KafkaProducer[String, String] = {

    // 创建配置对象
    val prop = new Properties()
    // 添加配置
    prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, broker)
    prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

    // 根
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值