1. 数据准备
1.1 数据说明
💡 数据集见文末!
数据集为活品交易数据集,每个订单可能包含多个货品,每个订单可以产生多次交易,不同的货品有不同的单价,表结构如下图:
1.2 案例类准备
case class tbDate (dateid:String, years:Int, theyear:Int, month:Int, day:Int, weekday:Int, week:Int, quarter:Int, period:Int, halfmonth:Int) extends Serializable
case class tbStockDetail (ordernumber:String, rownum:Int, itemid:String, number:Int, price:Double, amount:Double) extends Serializable
case class tbStock (ordernumber:String,locationid:String,dateid:String) extends Serializable
1.3 加载数据
val tbDateRdd = sc.textFile("PracticalExercisesData/tbDate.txt")
val tbDateDs = tbDateRdd.map(_.split(","))
.map(attr => tbDate(attr(0), attr(1).trim().toInt, attr(2).trim().toInt, attr(3).trim().toInt, attr(4).trim().toInt, attr(5).trim().toInt, attr(6).trim().toInt, attr(7).trim().toInt, attr(8).trim().toInt, attr(9).trim().toInt)).toDS()
val tbStockRdd = sc.textFile("PracticalExercisesData/tbStock.txt")
val tbStockDs = tbStockRdd.map(_.split(","))
.map(attr => tbStock(attr(0), attr(1), attr(2))).toDS()
val tbStockDetailRdd = sc.textFile("PracticalExercisesData/tbStockDetail.txt")
val tbStockDetailDs = tbStockDetailRdd.map(_.split(","))
.map(attr => tbStockDetail(attr(0), attr(1).trim().toInt, attr(2), attr(3).trim().toInt, attr(4).trim().toDouble, attr(5).trim().toDouble)).toDS
1.4 创建视图
tbDateDs.createOrReplaceTempView("tbDate")
tbStockDs.createOrReplaceTempView("tbStock")
tbStockDetailDs.createOrReplaceTempView("tbStockDetail")
2. 业务分析
2.1 计算所有订单中每年的销售单数、销售总额
spark.sql(
"""
| select
| t3.theyear,
| count(distinct t1.ordernumber),
| sum(t2.amount)
| from tbStock t1
| join tbStockDetail t2 on t1.ordernumber = t2.ordernumber
| join tbDate t3 on t1.dateid = t3.dateid
| group by t3.theyear
| order by t3.theyear
|""".stripMargin)
2.2 计算所有订单每年最大金额订单的销售额
1️⃣ 统计每年,每个订单一共有多少销售额
spark.sql(
"""
| select
| t1.dateid,
| t1.ordernumber,
| sum(t2.amount) SumOfAmount
| from tbStock t1
| join tbStockDetail t2 on t1.ordernumber = t2.ordernumber
| group by t1.dateid, t1.ordernumber
|""".stripMargin).createOrReplaceTempView("t3")
2️⃣ 以上一步查询结果为基础表,和表 tbDate 使用 dateid join,求出每年最大金额订单的销售额
spark.sql(
"""
| select
| t4.theyear,
| max(t3.SumOfAmount) MaxOfAmount
| from t3
| join tbDate t4 on t3.dateid = t4.dateid
| group by theyear
| order by theyear desc
|""".stripMargin)
2.3 计算所有订单中每年最畅销货品
1️⃣ 求出每年每个货品的销售额
spark.sql(
"""
| select
| t3.theyear,
| t2.itemid,
| sum(t2.amount) SumOfAmount
| from tbStock t1
| join tbStockDetail t2 on t1.ordernumber = t2.ordernumber
| join tbDate t3 on t1.dateid = t3.dateid
| group by t3.theyear, t2.itemid
|""".stripMargin).createOrReplaceTempView("t4")
2️⃣ 在第一步的基础上,统计每年单个货品中的最大金额
spark.sql(
"""
| select
| t4.theyear,
| max(t4.SumOfAmount) MaxOfAmount
| from t4
| group by t4.theyear
|""".stripMargin).createOrReplaceTempView("t5")
3️⃣ 用最大销售额和统计好的每个货品的销售额join,以及用年join,集合得到最畅销货品那一行信息
spark.sql(
"""
| select
| t4.theyear,
| t4.itemid,
| t5.MaxOfAmount
| from
| t4 join t5 on t4.theyear = t5.theyear
| and t4.SumOfAmount = t5.MaxOfAmount
| order by t4.theyear
|""".stripMargin)
🌈 数据集:
链接:https://pan.baidu.com/s/1vVituBq8ZGiCVBDCYu6sgg
提取码:czck
❤️ END ❤️