本文借鉴于GItHub上博主yangtong123的项目
https://github.com/yangtong123/RoadOfStudySpark
(Spark学习之路)本文对其代码进行了略微修改,仅作交流学习。
问题引入
当下电商盛行,为了更好的获取用户的喜好,消费习惯,不免对用户的消费行为进行分析,由于获取用户数据难度较大,在此我模仿博主yangtong123用Scala模拟了一组用户行为数据集,然后对用户的行为进行分析,测试结果仅供学习。
实验数据集模拟
package com.spark.sql.news
import java.io.{FileOutputStream, OutputStreamWriter, PrintWriter}
import java.text.SimpleDateFormat
import java.util.{Calendar, Date}
import scala.util.Random
object OfflineDataGenerator {
def main(args: Array[String]): Unit = {
val buffer = new StringBuilder("")
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val random = new Random
val sections = Array[String]("Electronic",
"Clothing", "Books", "Home Appliances", "Foods",
"Sports", "Toys", "BeautyProducts", "Furniture", "DigitalMedia")
val actions = Array[String]("view","purchase","add_to_Cart",
"select","add_to_WishList","dislike")
val newOldUserArr = Array[Int](1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
// 生成日期,默认就是昨天
val cal = Calendar.getInstance()
cal.setTime(new Date())
for(i<- 1 to 20){
cal.add(Calendar.DAY_OF_YEAR,-1)
val yesterday = cal.getTime
val date = sdf.format(yesterday)
// 生成10000条访问数据
for (_ <- 0 until 3000) {
// 生成时间戳
val timestamp = new Date().getTime
// 生成随机userid(默认1000注册用户, 每天1/10的访客是未注册用户)
var userId: String = null
userId = String.valueOf(random.nextInt(100000))
// 生成随机的pageId
val pageId = random.nextInt(10000)
// 生成随机模块
val section = sections(random.nextInt(10))
val action = actions(random.nextInt(6))
buffer.append(date).append(",")
.append(timestamp).append(",")
.append(userId).append(",")
.append(pageId).append(",")
.append(section).append(",")
.append(action).append("\n")
}
val ra = random.nextInt(10) + 10
// 生成10条注册数据
for (_ <- 0 until ra) {
// 生成时间戳
val timestamp = new Date().getTime
// 新用户都是userId为null
val userId: String = null
val pageID: String = null
val section: String = null
val action = "register"
buffer.append(date).append(",")
.append(timestamp).append(",")
.append(userId).append(",")
.append(pageID).append(",")
.append(section).append(",")
.append(action).append("\n")
}
// 输出到文件
var pw: PrintWriter = null
try {
pw = new PrintWriter(new OutputStreamWriter(
new FileOutputStream("此处填写路径名称",true)))
pw.write(buffer.toString)
} catch {
case e: Exception => e.printStackTrace()
} finally {
pw.close()
}
}
}
}
(分别有date,时间戳,userId,pageId,category,行为这六个字段,每天随机生成十几个新用户,新用户的数据为null,最后将数据保存到指定文件夹下) 生成后上传至HDFS上。
数据表生成
package com.spark.sql.news
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{IntegerType, LongType,
StringType, StructField, StructType}
object NewsOfflineStatSpark {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[6]")
.setAppName("spark-sql-01")
val sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate()
val schema = StructType(
Seq(
StructField("date",StringType),
StructField("dateLong", LongType),
StructField("userId", IntegerType),
StructField("pageId", IntegerType),
StructField("category",StringType),
StructField("action", StringType),
)
)
val frmUsers = sparkSession.read
.schema(schema)
.option("separator", ",")
.option("header",false)//没有表头
.option("quteChar", "\"")
.option("escapeChar", "\\")
.csv("hdfs://single01:9000/news/access_test.log/")//填自己的HDFS路径
.repartition(4)//因为数据量较少,默认分区数200较大,故设置为4
.cache()//缓存,减少重复计算,提高计算的效率
frmUsers.createTempView("news_access")
数据分析(本文采用spark_sql的函数式编程)
//查询浏览种类最多的用户
frmUsers
.filter($"userId"isNotNull)
.groupBy($"userId")
.agg(
count($"category").as("category_cnt")
)
.orderBy($"category_cnt".desc)
.show(30)
//查询近一个月活跃的用户排名
frmUsers
.filter($"userId"isNotNull)
.where(datediff(current_timestamp(),$"date")<30)
.groupBy($"userId")
.agg(
count($"userId") as "cnt",
)
.select($"userId",$"cnt",
dense_rank()over(Window.orderBy($"cnt"desc)) as("rank")
)
.where($"rank"<3)
.show(1000)
//查询最受欢迎的板块
frmUsers
.filter($"userId"isNotNull)
.groupBy($"category")
.agg(
count($"category")as("cat_cnt"),
)
.select($"category",$"cat_cnt"
,dense_rank()over(Window.orderBy($"cat_cnt".desc)) as("rnk")
)
.where($"rnk"<=2)
.show(1000)
}
}
当然也可以用sql语言直接进行编程,
例如:(查找某用户的行为)
sparkSession.sql(
"""
|select *
|from news_access
|where userId = 18617
|""".stripMargin)
}.show(10)
其他思考与练习
1、查询消费率(购买次数/加购,收藏,搜索,购买等所有行为的总次数)最多的三种种类及其比率
2、查询购买率最低(最高)的用户及其比率
3、查询优质(劣质)目标用户(近一个季度存在购买行为)及其购买的次数
如有其他问题欢迎指正