Scala_Spark-电商平台离线分析项目-需求七前数据生成与数据消费测试
第四模块:广告流量实时统计统计
技术点:SparkStreaming、kafka集群
kafka.broker.list=node01:9092,node02:9092,node03:9092
kafka.topics=AdRealTimeLog0308
(一)执行步骤
1)试验一下本地生产数据能否发送到kafka里去
-
开启zookeeper集群
【三台服务器启动zookeeper】,三台机器都执行以下命令启动zookeeper
cd /export/servers/zookeeper-3.4.5-cdh5.14.0
bin/zkServer.sh start
进程QuorumPeerMain -
开启kafka集群
【启动kafka集群】默认端口9092
三台机器启动kafka服务
[root@node01 servers]# cd /export/servers/kafka_2.11-1.0.0/
前台启动 ./kafka-server-start.sh …/config/server.properties
后台启动命令 nohup bin/kafka-server-start.sh config/server.properties > /dev/null 2>&1 & -
node01开启一个消费者
所有配置完成的情况下
[root@node01 ~]# kafka-console-consumer.sh --zookeeper node01:2181 --topic AdRealTimeLog0308 Using the ConsoleConsumer with old consumer is deprecated and will be removed in a future major release. Consider using the new consumer by passing [bootstrap-server] instead of [zookeeper]. // 等待消费
-
本地运行模拟生产实时数据文件MockRealTimeData.scala
// node01上出现 集群上消费成功 代表生产数据发送到kafka没有问题 1573145438221 3 3 93 5 1573145438221 6 6 87 17 1573145438221 0 0 10 16 1573145438221 7 7 11 15 1573145438221 0 0 8 18 1573145438221 0 0 97 1
2)试验一下能否在IDEA本地消费kafka集群里的数据
-
在以上试验成功状态下,IDEA运行AdverStat.scala文件
MockRealTimeData.scala 生产数据到kafka
//本地控制台输出 本地消费成功 代表本地从kafka消费数据没有问题 0000-00-00 00:00:01,200 WARN --- [ main] org.apache.spark.streaming.kafka010.KafkaUtils (line: 66) : overriding receive.buffer.bytes to 65536 see KAFKA-3135 1573150923605 1 1 93 2 1573150923605 7 7 19 0 1573150923605 2 2 72 0 1573150923605 1 1 3 3 1573150923605 1 1 44 0 1573150923605 7 7 65 15 1573150923605 6 6 64 11 1573150923605 9 9 20 2
(二)IDEA代码
1)实现
1.主程序 AdverStat.scala
import commons.conf.ConfigurationManager
import commons.constant.Constants
import org.apache.hadoop.hdfs.DFSClient.Conf
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* AdverStat.scala
*
* 第三个模块:广告流量实时统计
* 需求七前数据生成与数据消费测试
*
* 技术点:sparkStreaming kafka
*/
object AdverStat {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("adverstat").setMaster("local[*]")
val sparkSession = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate()
// 标准应当是 val streamingContext = StreamingContext.getActiveOrCreate(checkpointDir,func)
val streamingContext = new StreamingContext(sparkSession.sparkContext,Seconds(5))
// 配置kafka相关信息
val kafka_brokers = ConfigurationManager.config.getString(Constants.KAFKA_BROKERS) //node01:9092,node02:9092,node03:9092
val kafka_topics = ConfigurationManager.config.getString(Constants.KAFKA_TOPICS) //kafka.topics=AdRealTimeLog0308
// kafka配置信息
val kafkaParam = Map(
"bootstrap.servers" -> kafka_brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group1",
// auto.offset.reset
// latest: 先去Zookeeper获取offset,如果有,直接使用,如果没有,从最新的数据开始消费
// earlist: 先去zookeeper获取offset,如果有直接使用,如果没有,从最开始的数据开始消费
// none: 先去Zookeeper获取offset,如果有,直接使用,如果没有,直接报错
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false:java.lang.Boolean)
)
// 创建DStream
// 从kafka中消费数据 拿到的每一条数据都是messege,里面是key-value
val adRealTimeDStream = KafkaUtils.createDirectStream[String,String](
streamingContext,
// 让kafka分区均匀地在excutor上分配 有三种选择
LocationStrategies.PreferConsistent,
// 消费者订阅
ConsumerStrategies.Subscribe[String,String](Array(kafka_topics),kafkaParam)
)
// 取出了DStream里面每一条数据的value值
// adReadTimeValueDStream:DStream[RDD RDD RDD ...] RDD[String]
// String: timestamp province city userid adid
val adReadTimeValueDStream = adRealTimeDStream.map(item => item.value())
// adRealTimeFilterDStream 所有不在黑名单里的实时数据都在里面了
val adRealTimeFilterDStream =adReadTimeValueDStream.transform{
logRDD =>
// blackListArray:Array[AdBlacklist] AdBlacklist:userId
val blackListArray = AdBlacklistDAO.findAll() // 这里连接了数据库 创建了mySqlPool
// userIdArray:Array[Long] [userId1,userId2,...]
var userIdArray = blackListArray.map(item=>item.userid)
// 过滤掉已经存在在黑名单里的
logRDD.filter{
// log: timestamp province city userid adid
case log =>
val logSplit = log.split(" ")
val userId = logSplit(3).toLong
! userIdArray.contains(userId)
}
}
adRealTimeFilterDStream.foreachRDD(rdd=>rdd.foreach(println(_)))
streamingContext.start()
streamingContext.awaitTermination()
}
}
2.模拟生产实时数据 MockRealTimeData.scala
/*
* MockRealTimeData.scala
*/
import java.util.Properties
import commons.conf.ConfigurationManager
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import scala.collection.mutable.ArrayBuffer
import scala.util.Random
object MockRealTimeData {
/**
* 模拟的数据
* 时间点: 当前时间毫秒
* userId: 0 - 99
* 省份、城市 ID相同 : 1 - 9
* adid: 0 - 19
* ((0L,"北京","北京"),(1L,"上海","上海"),(2L,"南京","江苏省"),(3L,"广州","广东省"),(4L,"三亚","海南省"),(5L,"武汉","湖北省"),(6L,"长沙","湖南省"),(7L,"西安","陕西省"),(8L,"成都","四川省"),(9L,"哈尔滨","东北省"))
* 格式 :timestamp province city userid adid
* 某个时间点 某个省份 某个城市 某个用户 某个广告
*/
def generateMockData(): Array[String] = {
val array = ArrayBuffer[String]()
val random = new Random()
// 模拟实时数据:
// timestamp province city userid adid
for (i <- 0 to 50) {
val timestamp = System.currentTimeMillis()
val province = random.nextInt(10)
val city = province
val adid = random.nextInt(20)
val userid = random.nextInt(100)
// 拼接实时数据
array += timestamp + " " + province + " " + city + " " + userid + " " + adid
}
array.toArray
}
def createKafkaProducer(broker: String): KafkaProducer[String, String] = {
// 创建配置对象
val prop = new Properties()
// 添加配置
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, broker)
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
// 根