Apache Spark 练习四：使用Spark分析挖掘零售交易数据

最新推荐文章于 2023-12-03 08:48:00 发布

liulizhi1996

最新推荐文章于 2023-12-03 08:48:00 发布

阅读量1k

点赞数

分类专栏： Spark 文章标签： spark 零售大数据

本文链接：https://blog.csdn.net/liulizhi1996/article/details/128399454

版权

Spark 专栏收录该内容

7 篇文章 4 订阅

订阅专栏

一、源数据

本章所分析的数据来自于一家英国的零售电商平台在01/12/2010至09/12/2011期间的交易数据（https://www.kaggle.com/datasets/carrie1/ecommerce-data），每条记录由8个属性组成，具体的含义如下表所示：

字段名称	类型	含义	举例
InvoiceNo	string	订单编号（退货订单以C开头）	536365
StockCode	string	产品代码	85123A
Description	string	产品描述	WHITE METAL LANTERN
Quantity	integer	购买数量（负数表示退货）	6
InvoiceDate	string	订单日期和时间	12/1/2010 8:26
UnitPrice	double	单价（英镑）	3.39
CustomerID	integer	客户编号	17850
Country	string	国家名称	United Kingdom

二、练习题

0. 数据预处理

我们将原始文件存放在HDFS的/SparkLearning目录下，然后从HDFS中读取数据到DataFrame格式。由于顾客编号CustomID和商品描述Description均存在部分缺失，所以需要先进行数据清洗，过滤掉有缺失值的记录。

// 加载Spark会话
val spark = SparkSession
  .builder()
  .appName("E-Commerce")
  .getOrCreate()
import spark.implicits._

// 读取源数据
val raw_data = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("hdfs:///SparkLearning/E_Commerce_Data.csv")
// 过滤掉有缺失值的记录
val df = raw_data
  .filter($"CustomerID" =!= 0)
  .filter($"Description" =!= "")

1. 客户数最多的10个国家

val res = df
  .select($"CustomerID", $"Country")
  .distinct()
  .groupBy($"Country")
  .count()
  .withColumnRenamed("count", "CustomerNum")
  .sort($"CustomerNum".desc)
  .limit(10)

2. 销量最高的10个国家

val res = df
  .select($"Quantity", $"Country")
  .groupBy($"Country")
  .sum("Quantity")
  .withColumnRenamed("sum(Quantity)", "totalQuantity")
  .sort($"totalQuantity".desc)
  .limit(10)

3. 各个国家的总销售额分布情况

val res = df
  .select($"Country", ($"Quantity" * $"UnitPrice").as("Revenue"))
  .groupBy($"Country")
  .sum("Revenue")
  .withColumnRenamed("sum(Revenue)", "totalRevenue")
  .orderBy($"totalRevenue".desc)

4. 销量最高的10个商品

val res = df
  .select($"StockCode", $"Description", $"Quantity")
  .groupBy($"StockCode", $"Description")
  .sum("Quantity")
  .withColumnRenamed("sum(Quantity)", "totalQuantity")
  .orderBy($"totalQuantity".desc)
  .limit(10)

5. 商品描述的热门关键词Top300

val res = df
  .select($"Description")
  .flatMap(row => row.getString(0).split(" "))
  .map(word => word.toLowerCase)
  .filter(word => word.nonEmpty)
  .toDF("keyword")
  .groupBy($"keyword")
  .count()
  .orderBy($"count".desc)
  .limit(300)

6. 退货订单数最多的10个国家

val res = df
  .filter($"Quantity" < 0)
  .select($"Country", $"InvoiceNo")
  .distinct()
  .groupBy($"Country")
  .count()
  .orderBy($"count".desc)
  .limit(10)

7. 月销售额随时间的变化趋势

val res = df
  .select(
    date_format(to_timestamp($"InvoiceDate", "M/d/yyyy H:mm"), "yyyy-MM").as("month"),
    ($"Quantity" * $"UnitPrice").as("revenue")
  )
  .groupBy($"month")
  .sum("revenue")
  .withColumnRenamed("sum(revenue)", "totalRevenue")
  .orderBy($"month")

8. 日销量随时间的变化趋势

val res = df
  .select(
    to_date(to_timestamp($"InvoiceDate", "M/d/yyyy H:mm")).as("date"),
    $"Quantity"
  )
  .groupBy($"date")
  .sum("Quantity")
  .withColumnRenamed("sum(Quantity)", "totalQuantity")
  .orderBy($"date")

9. 各国的购买订单量和退货订单量的关系

val res = df
  .select("Country", "InvoiceNo")
  .distinct()
  .selectExpr(
    "Country",
    "IF(InvoiceNo NOT REGEXP '^C', 1, 0) AS purchase",
    "IF(InvoiceNo REGEXP '^C', 1, 0) AS refund"
  )
  .groupBy("Country")
  .agg(
    Map(
      "purchase" -> "sum",
      "refund" -> "sum"
    )
  )

10. 商品的平均单价与销量的关系

val res = df
  .select("StockCode", "UnitPrice", "Quantity")
  .groupBy("StockCode")
  .agg(
    (sum_distinct($"UnitPrice") / count_distinct($"UnitPrice")).as("avgPrice"),
    sum($"Quantity").as("totalQuantity")
  )
  .select("StockCode", "avgPrice", "totalQuantity")