DSP广告数据系统的地域指标统计场景案例

上一篇:DSP广告数据系统的提交job到yarn运行案例

1、统计各省市的数据分布情况

需求:统计日志中,每个省市的数据记录条数

分析:使用spark SQL进行实现,实现的结果存储到MySQL,存json文件

引入pom文件

<!--gson|jackson|....-->
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.4</version>
        </dependency>

直接上代码

dolphin-doit01\src\main\scala\cn\sheep\dolphin\report\ProvinceCityAnalysisCore.scala 

package cn.sheep.dolphin.report

import cn.sheep.dolphin.common.DolphinAppComm
import cn.sheep.dolphin.config.DolphinConfig
import com.google.gson.Gson
import org.apache.spark.sql.{SQLContext, SaveMode}

/**
 * 统计各省市数据分布情况的 - Core
 * author: old sheep
 * Created 2021/04/04  12:22
 */
object ProvinceCityAnalysisCore {
  def main(args: Array[String]): Unit = {

    if (args.length != 2) {
      println(
        """
          |Usage: cn.sheep.dolphin.report.ProvinceCityAnalysisCore
          |参数:<parquetInputPath> <jsonOutputPath>
        """.stripMargin)
      sys.exit(-1)
    }

    // 接受参数
    val Array(parquetInputPath, jsonOutputPath) = args

    // sparkContext
    val sc = DolphinAppComm.createSparkContext("统计各省市数据分布情况的")

    // parquet <- SQLContext
    val sqlc = new SQLContext(sc)
    val dataFrame = sqlc.read.parquet(parquetInputPath)

    // 统计数据
    val result = dataFrame.map(row => {
      // 日期
      val day = row.getAs[String]("requestdate").substring(0, 10)
      // 省份
      val province = row.getAs[String]("provincename")
      // 地市
      val city = row.getAs[String]("cityname")

      ((day, province, city), 1)
    }).reduceByKey(_ + _)

    // 存储-MySQL
    import sqlc.implicits._
    val resultDF = result
      .map { case ((day, province, city), count) => (day, province, city, count) }
      .toDF("day", "province", "city", "cnt")

    resultDF.write.mode(SaveMode.Overwrite).jdbc(DolphinConfig._url, "r_procity_analysis", DolphinConfig.props)

    // 存储-JSON
    // 第一种方式
    // resultDF.write.json()

    // 第二种方式 case class => Schema
    result.map { case ((day, province, city), count) => {
      val gson = new Gson()
      gson.toJson(RProcityAnalysis(day, province, city, count))
    }}.saveAsTextFile(jsonOutputPath)

    sc.stop()

  }

}
/*封装结果数据样例对类*/
case class RProcityAnalysis(day: String, province: String, city: String, cnt: Int)

 启动程序之前需要配置传参:

C:\Users\mrman\dolphin-doit01\parquet
C:\Users\mrman\dolphin-doit01\provinceCityAnalysisCore

运行ProvinceCityAnalysisCore程序,在自己的磁盘可以查看输出的数据

 查看输出的part-00000文件数据信息

由于是对mysql的数据进行重写操作,所以可以在mysql的spark数据库查看

总结:

(1)在生产环境下,对应程序运行时的参数,一般都是args传递的,这样做的灵活性会比较好;对于不可变的数据集,实际上是可以放到配置文件中的,比如字段数据;

(2)无论是SQL还是Core的实现方式,其实思想是一样的,永远都是要找到你的统计需求里面的维度字段是什么,即就是SQL中的group by的字段,Core中就是reduceByKey的Key;

(3)通过SQLContext可以获得DataFrame对象,构建DataFrame的几种方式

a.RDD[Row] + StructType(Seq(StructField......)) 
b.RDD + case class =>toDF
RDD[A <: Product]

(4)fastjson对scala样例类支持不是特别好,建议大家在scala中将对象转换成json时使用Gson|Jackson

2、地域指标统计

省市原始请求有效请求广告请求参与竞价竞价成功广告展示广告点击广告消费广告成本

2.1、需求:

2.2、要求

  • 数据存储到MySQL中,省市数据分成2个字段存储;

  • 需求的实现采用两种方式进行实现(SQL|Core);

  • 结果要求一个job完成;

  • 需要按天统计

sparkcore方式实现

直接上代码

dolphin-doit01\src\main\scala\cn\sheep\dolphin\bean\AreaReportResult.scala


package cn.sheep.dolphin.bean
/**
 * author: old sheep
 * Created 2021/04/04  16:13
 */
case class AreaReportResult(
                             day: String,
                             provinceName: String,
                             cityName: String,
                             rawRequest: Int,
                             effectiveRequest: Int,
                             adRequest: Int,
                             isRTB: Int,
                             succRTB: Int,
                             adShow: Int,
                             adClick: Int,
                             expense: Double,
                             cost: Double
                           )

dolphin-doit01\src\main\scala\cn\sheep\dolphin\report\AreaAnalysisCore.scala


package cn.sheep.dolphin.report

import cn.sheep.dolphin.bean.AreaReportResult
import cn.sheep.dolphin.common.DolphinAppComm
import cn.sheep.dolphin.config.DolphinConfig
import org.apache.spark.sql.{Row, SQLContext, SaveMode}

/**
 * 地域指标统计 - core
 * author: old sheep
 * Created 2021/04/04 14:25
 */
object AreaAnalysisCore {
  def main(args: Array[String]): Unit = {
    if (args.size != 1) {
      println(
        """
          |Usage: cn.sheep.dolphin.report.AreaAnalysisCore
          |  <parquetInputPath>
        """.stripMargin)
      sys.exit(-1)
    }

    val Array(parquetInputPath) = args

    val sc = DolphinAppComm.createSparkContext("地域指标统计")
    val sQLContext = new SQLContext(sc)

    // 读取数据parquet
    val dataFrame = sQLContext.read.parquet(parquetInputPath)

    import sQLContext.implicits._

    dataFrame.map(row =>{
      //获取省市及日期
      val day = row.getAs[String]("requestdate").substring(0, 10)  //请求时间,格式为:yyyy-m-dd hh:mm:ss
      val proName = row.getAs[String]("provincename")  //设备所在省份名称
      val cityName = row.getAs[String]("cityname")  //设备所在城市名称

      //提取判断字段的依据
      val requestMode = row.getAs[Int]("requestmode")  //数据请求方式(1:请求、2:展示、3:点击)
      val processNode = row.getAs[Int]("processnode") //流程节点(1:请求量kpi 2:有效请求 3:广告请)
      val effective = row.getAs[Int]("iseffective")  //有效标识(有效指可以正常计费的)(0:无效 1:有效)
      val billing = row.getAs[Int]("isbilling")  //是否收费(0:未收费1:已收费)
      val isBid = row.getAs[Int]("isbid")  //是否rtb
      val isWin = row.getAs[Int]("iswin")   //是否竞价成功
      val adOrderId = row.getAs[Int]("adorderid")  //广告id

      val winPrice = row.getAs[Double]("winprice")  //竞价成功价格
      val adPayment = row.getAs[Double]("adpayment")  //转换后的广告消费

      //原始请求、有效请求、广告请求
      val (rawRequest,effectiveRequest,adRequest)=if (requestMode == 1 && processNode >=3)(1,1,1)
      else if (requestMode ==1 && processNode ==2)(1,1,0)
      else if (requestMode ==1 && processNode ==1)(1,0,0)
      else(0,0,0)

      //是否参与竞价的日志
      val isRBT=if(effective ==1 && billing ==1 && isBid ==1 && adOrderId !=0) 1 else 0

      //是竞价成功的日志吗
      val (succRTB,expense,cost) = if (effective == 1 && billing == 1 && isWin == 1) {
        val winPrice = row.getAs[Double]("winprice")/1000d  //竞价成功价格
        val adPayment = row.getAs[Double]("adpayment")/1000d  //转换后的广告消费
        (1,winPrice,adPayment)
      } else (0,0d,0d)

      //(广告的展示、广告点击)
      val (adShow,adClick) = if (requestMode == 2 && effective == 1) (1, 0)
      else if (requestMode == 3 && effective ==1)(0,1)
      else(0,0)

 //组装数据格式(k,v) => (day,proName,cityName),(原始请求、有效请求、广告请求、参与竞价、竞价成功、广告的展示、广告点击)
      (
        (day,proName,cityName),
          (rawRequest,effectiveRequest,adRequest,isRBT,succRTB,adShow,adClick,expense,cost)
        )

    }).reduceByKey((a,b) =>{
      (
        a._1 + b._1,
      a._2 + b._2,
      a._3 + b._3,
      a._4 + b._4,
      a._5 + b._5,
      a._6 + b._6,
      a._7 + b._7,
      a._8 + b._8,
      a._9 + b._9
      )

    }).map{
      case ((day, pname, cname),(rawReq, effReq, adReq, isRTB, succRTB, adShow, adClick, expense, cost)) =>
        AreaReportResult(
          day, pname, cname,
          rawReq.toInt, effReq.toInt, adReq.toInt,
          isRTB.toInt, succRTB.toInt, adShow.toInt,
          adClick.toInt, expense, cost)
    }
      .toDF()
      .write.mode(SaveMode.Append)
      .jdbc(DolphinConfig._url, "r_area_analysis", DolphinConfig.props)

    sc.stop()

  }

}

运行AreaAnalysisCore程序,控制台打印输出

在mysql查看spark数据库发现多了一张r_area_analysis数据表

代码优化改造


package cn.sheep.dolphin.report

import cn.sheep.dolphin.bean.AreaReportResult
import cn.sheep.dolphin.common.DolphinAppComm
import cn.sheep.dolphin.config.DolphinConfig
import org.apache.spark.sql.{SQLContext, SaveMode}

/**
 * 地域指标统计 - core【代码优化改造后】
 * author: old sheep
 * Created 2021/04/04 19:23
 */
object AreaAnalysisCore {
  def main(args: Array[String]): Unit = {
    if (args.size != 1) {
      println(
        """
          |Usage: cn.sheep.dolphin.report.AreaAnalysisCore
          |  <parquetInputPath>
        """.stripMargin)
      sys.exit(-1)
    }

    val Array(parquetInputPath) = args

    val sc = DolphinAppComm.createSparkContext("地域指标统计")
    val sQLContext = new SQLContext(sc)

    // 读取数据parquet
    val dataFrame = sQLContext.read.parquet(parquetInputPath)

    import sQLContext.implicits._

    dataFrame.map(row =>{
      //获取省市及日期
      val day = row.getAs[String]("requestdate").substring(0, 10)  //请求时间,格式为:yyyy-m-dd hh:mm:ss
      val proName = row.getAs[String]("provincename")  //设备所在省份名称
      val cityName = row.getAs[String]("cityname")  //设备所在城市名称

      //提取判断字段的依据
      val requestMode = row.getAs[Int]("requestmode")  //数据请求方式(1:请求、2:展示、3:点击)
      val processNode = row.getAs[Int]("processnode") //流程节点(1:请求量kpi 2:有效请求 3:广告请)
      val effective = row.getAs[Int]("iseffective")  //有效标识(有效指可以正常计费的)(0:无效 1:有效)
      val billing = row.getAs[Int]("isbilling")  //是否收费(0:未收费1:已收费)
      val isBid = row.getAs[Int]("isbid")  //是否rtb
      val isWin = row.getAs[Int]("iswin")   //是否竞价成功
      val adOrderId = row.getAs[Int]("adorderid")  //广告id

      val winPrice = row.getAs[Double]("winprice")  //竞价成功价格
      val adPayment = row.getAs[Double]("adpayment")  //转换后的广告消费

      //原始请求、有效请求、广告请求
      val (rawRequest,effectiveRequest,adRequest)=if (requestMode == 1 && processNode >=3)(1,1,1)
      else if (requestMode ==1 && processNode ==2)(1,1,0)
      else if (requestMode ==1 && processNode ==1)(1,0,0)
      else(0,0,0)

      //是否参与竞价的日志
      val isRBT=if(effective ==1 && billing ==1 && isBid ==1 && adOrderId !=0) 1 else 0

      //是竞价成功的日志吗
      val (succRTB,expense,cost) = if (effective == 1 && billing == 1 && isWin == 1) {
        val winPrice = row.getAs[Double]("winprice")/1000d  //竞价成功价格
        val adPayment = row.getAs[Double]("adpayment")/1000d  //转换后的广告消费
        (1,winPrice,adPayment)
      } else (0,0d,0d)

      //(广告的展示、广告点击)
      val (adShow,adClick) = if (requestMode == 2 && effective == 1) (1, 0)
      else if (requestMode == 3 && effective ==1)(0,1)
      else(0,0)

 //组装数据格式(k,v) => (day,proName,cityName),(原始请求、有效请求、广告请求、参与竞价、竞价成功、广告的展示、广告点击)
      (
        (day,proName,cityName),
         List (rawRequest,effectiveRequest,adRequest,isRBT,succRTB,adShow,adClick,expense,cost)
        )

    }).reduceByKey((a,b) => {
      a.zip(b).map(tp => tp._1 + tp._2)

    }).map{
      case ((day, pname, cname),List(rawReq, effReq, adReq, isRTB, succRTB, adShow, adClick, expense, cost)) =>
        AreaReportResult(
          day, pname, cname,
          rawReq.toInt, effReq.toInt, adReq.toInt,
          isRTB.toInt, succRTB.toInt, adShow.toInt,
          adClick.toInt, expense, cost)
    }
      .toDF()
      .write.mode(SaveMode.Append)
      .jdbc(DolphinConfig._url, "r_area_analysis", DolphinConfig.props)

    sc.stop()

  }
}

sparkSql方式实现

 


package cn.sheep.dolphin.report

import cn.sheep.dolphin.common.DolphinAppComm
import cn.sheep.dolphin.config.DolphinConfig
import org.apache.spark.sql.{SQLContext, SaveMode}
/** 地域指标统计 - sql
 * author: old sheep
 * Created 2021/04/04 21:30
 */
object AreaAnalysisSQL {
  def main(args: Array[String]): Unit = {

    if (args.size != 1) {
      println(
        """
          |Usage: cn.sheep.dolphin.report.AreaAnalysisSQL
          |  <parquetInputPath>
        """.stripMargin)
      sys.exit(-1)
    }

    val Array(parquetInputPath) = args

    val sc = DolphinAppComm.createSparkContext("地域指标统计")
    val sQLContext = new SQLContext(sc)

    // 读取数据parquet
    val dataFrame = sQLContext.read.parquet(parquetInputPath)

    // 将dataFrame注册成临时表
    dataFrame.registerTempTable("adlogs")

    // spark sql提供了udf功能
    sQLContext.udf.register("sheepif", (bool: Boolean, right: Double, wrong: Double) => if (bool) right else wrong)

    // 写sql查表
    val result = sQLContext.sql(
      """
        |select provincename, cityname, substring(requestdate, 0, 10) day,
        |sum(case when requestmode=1 and processnode>=1 then 1 else 0 end) rawReq,
        |sum(case when requestmode=1 and processnode>=2 then 1 else 0 end) effReq,
        |sum(case when requestmode=1 and processnode=3 then 1 else 0 end) adReq,
        |
        |sum(sheepif(iseffective=1 and isbilling=1 and isbid=1 and adorderid!=0, 1, 0)) isRTB,
        |sum(if(iseffective=1 and isbilling=1 and iswin=1, 1, 0)) succRTB,
        |
        |sum(if(requestmode=2 and iseffective=1, 1, 0)) adShow,
        |sum(if(requestmode=3 and iseffective=1, 1, 0)) adClik,
        |
        |sum(if(iseffective=1 and isbilling=1 and iswin=1, winprice/1000, 0)) expense,
        |sum(if(iseffective=1 and isbilling=1 and iswin=1, adpayment/1000, 0)) cost
        |
        |from adlogs
        |group by provincename, cityname, substring(requestdate, 0, 10)
        |
      """.stripMargin)

    // 将结果写入到数据库中
    result.write.mode(SaveMode.Overwrite).jdbc(DolphinConfig._url, "r_area_analysis2", DolphinConfig.props)


    sc.stop()


  }
}

运行AreaAnalysisSQL程序,控制台打印输出

在mysql查看spark数据库发现多了一张r_area_analysis2数据表

若想在控制台查看数据,直接show方法即可:

  // 将结果写入到数据库中
//    result.write.mode(SaveMode.Overwrite).jdbc(DolphinConfig._url, "r_area_analysis2", DolphinConfig.props)
    result.show()

控制台打印输出

查看spark的地域指标的Web UI 页面

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值