Spark SQL练习（电商用户行为分析）

最新推荐文章于 2024-08-12 18:19:22 发布

不秀不亏不怼不皮

最新推荐文章于 2024-08-12 18:19:22 发布

阅读量347

点赞数 14

分类专栏：大数据文章标签： spark 大数据 scala

本文链接：https://blog.csdn.net/zzy66666c/article/details/140431056

版权

大数据专栏收录该内容

12 篇文章 0 订阅

订阅专栏

本文借鉴于GItHub上博主yangtong123的项目

https://github.com/yangtong123/RoadOfStudySpark

(Spark学习之路)本文对其代码进行了略微修改，仅作交流学习。

问题引入

当下电商盛行，为了更好的获取用户的喜好，消费习惯，不免对用户的消费行为进行分析，由于获取用户数据难度较大，在此我模仿博主yangtong123用Scala模拟了一组用户行为数据集，然后对用户的行为进行分析，测试结果仅供学习。

实验数据集模拟

package com.spark.sql.news

import java.io.{FileOutputStream, OutputStreamWriter, PrintWriter}
import java.text.SimpleDateFormat
import java.util.{Calendar, Date}

import scala.util.Random

object OfflineDataGenerator {
    def main(args: Array[String]): Unit = {
        val buffer = new StringBuilder("")
        
        val sdf = new SimpleDateFormat("yyyy-MM-dd")
        val random = new Random
        val sections = Array[String]("Electronic", 
"Clothing", "Books", "Home Appliances", "Foods",
 "Sports", "Toys", "BeautyProducts", "Furniture", "DigitalMedia")
        val actions = Array[String]("view","purchase","add_to_Cart",
"select","add_to_WishList","dislike")

        val newOldUserArr = Array[Int](1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
        
        // 生成日期，默认就是昨天
        val cal = Calendar.getInstance()
        cal.setTime(new Date())
        for(i<- 1 to 20){
            cal.add(Calendar.DAY_OF_YEAR,-1)

            val yesterday = cal.getTime

            val date = sdf.format(yesterday)

            // 生成10000条访问数据
            for (_ <- 0 until 3000) {
                // 生成时间戳
                val timestamp = new Date().getTime
                // 生成随机userid(默认1000注册用户， 每天1/10的访客是未注册用户)
                var userId: String = null
                userId =  String.valueOf(random.nextInt(100000))
                // 生成随机的pageId
                val pageId = random.nextInt(10000)
                // 生成随机模块
                val section = sections(random.nextInt(10))
                val action = actions(random.nextInt(6))

                buffer.append(date).append(",")
                  .append(timestamp).append(",")
                  .append(userId).append(",")
                  .append(pageId).append(",")
                  .append(section).append(",")
                  .append(action).append("\n")
            }

            val ra = random.nextInt(10) + 10
            // 生成10条注册数据
            for (_ <- 0 until ra) {
                // 生成时间戳
                val timestamp = new Date().getTime
                // 新用户都是userId为null
                val userId: String = null
                val pageID: String = null
                val section: String = null
                val action = "register"

                buffer.append(date).append(",")
                  .append(timestamp).append(",")
                  .append(userId).append(",")
                  .append(pageID).append(",")
                  .append(section).append(",")
                  .append(action).append("\n")
            }

            // 输出到文件

            var pw: PrintWriter = null
            try {
                pw = new PrintWriter(new OutputStreamWriter(
                    new FileOutputStream("此处填写路径名称",true)))
                pw.write(buffer.toString)
            } catch {
                case e: Exception => e.printStackTrace()
            } finally {
                pw.close()
            }
        }

    }
}

（分别有date，时间戳，userId，pageId,category,行为这六个字段，每天随机生成十几个新用户，新用户的数据为null，最后将数据保存到指定文件夹下）生成后上传至HDFS上。

数据表生成

package com.spark.sql.news

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{IntegerType, LongType,
 StringType, StructField, StructType}

object NewsOfflineStatSpark {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf()
          .setMaster("local[6]")
          .setAppName("spark-sql-01")

        val sparkSession = SparkSession
          .builder()
          .config(conf)
          .getOrCreate()

        val schema = StructType(
            Seq(
                StructField("date",StringType),
                StructField("dateLong", LongType),
                StructField("userId", IntegerType),
                StructField("pageId", IntegerType),
                StructField("category",StringType),
                StructField("action", StringType),
            )
        )
        val frmUsers = sparkSession.read
          .schema(schema)
          .option("separator", ",")
          .option("header",false)//没有表头
          .option("quteChar", "\"")
          .option("escapeChar", "\\")
          .csv("hdfs://single01:9000/news/access_test.log/")//填自己的HDFS路径
          .repartition(4)//因为数据量较少，默认分区数200较大，故设置为4
          .cache()//缓存，减少重复计算，提高计算的效率
        frmUsers.createTempView("news_access")

数据分析（本文采用spark_sql的函数式编程）

//查询浏览种类最多的用户

        frmUsers
          .filter($"userId"isNotNull)
          .groupBy($"userId")
          .agg(
            count($"category").as("category_cnt")
          )
          .orderBy($"category_cnt".desc)
          .show(30)

//查询近一个月活跃的用户排名

        frmUsers
          .filter($"userId"isNotNull)
          .where(datediff(current_timestamp(),$"date")<30)
          .groupBy($"userId")
          .agg(
              count($"userId") as "cnt",
          )
          .select($"userId",$"cnt",
              dense_rank()over(Window.orderBy($"cnt"desc)) as("rank")
          )
          .where($"rank"<3)
          .show(1000)

//查询最受欢迎的板块

        frmUsers
          .filter($"userId"isNotNull)
          .groupBy($"category")
          .agg(
              count($"category")as("cat_cnt"),
          )
          .select($"category",$"cat_cnt"
          ,dense_rank()over(Window.orderBy($"cat_cnt".desc)) as("rnk")
          )
          .where($"rnk"<=2)
          .show(1000)
    }
}

当然也可以用sql语言直接进行编程，
例如：（查找某用户的行为）

    sparkSession.sql(
            """
              |select *
              |from news_access
              |where userId = 18617
              |""".stripMargin)
    }.show(10)