什么叫日活:
- 通常: 打开应用的用户即为活跃用户,不考虑用户的使用情况。每天一台设备打开多次会被计为一个活跃用户。 也就是只需要统计第一次打开即可
- 游戏用户: 每天打开/登录游戏的用户数(针对游戏DAU的定义)
我们采用第一种日活的定义, 日活(DAU)统计思路: - 从 kafka 读取用户启动日志
- 当天只保留用户的第一次启动记录, 过滤掉其他启动记录: 借助于 Redis
- 然后把第一次启动记录保存在 hbase 以供其他应用查询
创建实现处理模块
模块命名: gmall-realtime
加入依赖
<dependencies>
<dependency>
<groupId>org.example</groupId>
<artifactId>gmall-common</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 用于项目的打包插件 -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
从 Kafka 读取数据
- 配置文件config.properties
#kafka配置
kafka.servers=hadoop12:9092,hadoop13:9092,hadoop14:9092
kafka.group.id=gmall1015
- 配置文件log4j.properties
log4j.appender.donglin.MyConsole=org.apache.log4j.ConsoleAppender
log4j.appender.donglin.MyConsole.target=System.err
log4j.appender.donglin.MyConsole.layout=org.apache.log4j.PatternLayout
log4j.appender.donglin.MyConsole.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %10p (%c:%M) - %m%n
log4j.rootLogger=error,donglin.MyConsole
- 读取配置文件ConfigUtil
package com.donglin.gmall.realtime.util
import java.util.Properties
object ConfigUtil {
val is = ConfigUtil.getClass.getClassLoader.getResourceAsStream("config.properties")
private val props = new Properties()
props.load(is)
def getProperty(key:String) = props.getProperty(key)
/* def main(args: Array[String]): Unit = {
println(getProperty("kafka.servers"))
}*/
}
- MyKafkaUtil
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
object MykafkaUtil {
val params = Map[String,String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> ConfigUtil.getProperty("kafka.servers"),
ConsumerConfig.GROUP_ID_CONFIG -> ConfigUtil.getProperty("kafka.group.id")
)
def getKafkaStream(ssc:StreamingContext,topic:String,otherTopics:String*) ={
KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](
ssc,
params,
(otherTopics :+ topic).toSet
).map(_._2)
}
}
- 样例类:StartupLog
为了数据访问方便, 把用户访问记录封装在样例类中.
package com.donglin.gmall.realtime.bean
import java.text.SimpleDateFormat
import java.util.Date
case class StartupLog(mid: String,
uid: String,
appId: String,
area: String,
os: String,
channel: String,
logType: String,
version: String,
ts: Long,
var logDate: String = "", //2020-03-20
var logHour: String = ""){ //10 11
val d = new Date(ts)
logDate = new SimpleDateFormat("yyyy-MM-dd").format(d)
logHour = new SimpleDateFormat("HH").format(d)
}
- DauApp
package com.donglin.gmall.realtime.app
import com.alibaba.fastjson.JSON
import com.donglin.gmall.common.Constant
import com.donglin.gmall.realtime.bean.StartupLog
import com.donglin.gmall.realtime.util.MykafkaUtil
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object DauApp {
def main(args: Array[String]): Unit = {
//1.从kafka消费数据
val conf = new SparkConf().setMaster("local[*]").setAppName("DauApp")
val ssc = new StreamingContext(conf, Seconds(3))
val sourceStream = MykafkaUtil.getKafkaStream(ssc, Constant.TOPIC_STARTUP)
//1.1 把数据封装到样例类中,解析json字符串的时候,使用fastJson比较方便
val startupLogStream = sourceStream.map(jsonStr => JSON.parseObject(jsonStr, classOf[StartupLog]))
startupLogStream.print(1000)
ssc.start()
ssc.awaitTermination()
}
}
如果不知道怎么启动,请置顶,看目录(各个模块启动顺序)
结果
对数据进行清洗和过滤(redis去重)
redis去重的逻辑
1.把已经启动的设备id保存到redis中,用set集合,就可以只保留一个
set
key value
"topic_startup:" + 2020-09-21 mid1,mid2
redis怎么去重
- config.properties (增加redis的配置)
#kafka配置
kafka.servers=hadoop12:9092,hadoop13:9092,hadoop14:9092
kafka.group.id=gmall1015
# redis的配置
redis.host=hadoop12
redis.port=6379
- RedisUtil
package com.donglin.gmall.realtime.util
import redis.clients.jedis.Jedis
object RedisUtil {
val host = ConfigUtil.getProperty("redis.host")
val port = ConfigUtil.getProperty("redis.port").toInt
def getClient = {
val client = new Jedis(host, port, 60 * 1000)
client.connect()
client
}
}
- 补全DauApp
package com.donglin.gmall.realtime.app
import java.text.SimpleDateFormat
import java.util.Date
import com.alibaba.fastjson.JSON
import com.donglin.gmall.common.Constant
import com.donglin.gmall.realtime.bean.StartupLog
import com.donglin.gmall.realtime.util.{MykafkaUtil, RedisUtil}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object DauApp {
def main(args: Array[String]): Unit = {
//1.从kafka消费数据
val conf = new SparkConf().setMaster("local[*]").setAppName("DauApp")
val ssc = new StreamingContext(conf, Seconds(3))
val sourceStream = MykafkaUtil.getKafkaStream(ssc, Constant.TOPIC_STARTUP)
//1.1 把数据封装到样例类中,解析json字符串的时候,使用fastJson比较方便
val startupLogStream = sourceStream.map(jsonStr => JSON.parseObject(jsonStr, classOf[StartupLog]))
//2.过滤去重得到日活明细
//2.1 需要借助第三方的工具进行去重:redis
val firstStartupLogStream = startupLogStream.transform(rdd => {
//2.2 从redis中读取已经启动的设备
val client = RedisUtil.getClient
val key = Constant.TOPIC_STARTUP + new SimpleDateFormat("yyyy-MM-DD").format(new Date())
val mids = client.smembers(key)
client.close()
//2.3 把已经启动的设备过滤掉 rdd中只留下那些在redis中不存在的那些记录
val midsBD = ssc.sparkContext.broadcast(mids)
//2.4 考虑到某个mid在一个批次内启动了多次(而且是这个mid第一次启动),会出现重复的情况
rdd
.filter(log => !midsBD.value.contains(log.mid))
.map(log => (log.mid,log))
.groupByKey()
.map{
//case (_,it) => it.toList.sortBy(_.ts).head
case (_,it) => it.toList.minBy(_.ts)
}
})
//2.4 把第一次启动的设备保存到redis中
firstStartupLogStream.foreachRDD(rdd=>{
rdd.foreachPartition(logId=>{
//获取连接
val client = RedisUtil.getClient
logId.foreach(log=>{
//每次向set中存入一个mid
client.sadd(Constant.TOPIC_STARTUP + ":" + log.logDate,log.mid)
})
client.close()
})
})
firstStartupLogStream.print(1000)
ssc.start()
ssc.awaitTermination()
}
}
各个模块启动顺序
如果报错,请点击这里
请将版本改低一点