电商平台分析平台----需求二:按照比列随机抽取session

做什么?

在符合过滤条件的session中,按照时间比例随机抽取100个session。当存在若干天的数据时,100个session抽取指标在天之间平均分配,在一天之中,根据某个小时的session数量在一天中总session数量中的占比决定这个小时抽取多少个session

一个小时要抽取的session数量 = (这个小时的session数量/这一天的session数量) * 这一天要抽取的session数量

需求解析

明确一个小时抽取多少session后(假设为N个),根据数量产生N个随机数,这N个随机数组成的列表就是要抽取的session的索引列表,我们假设按照hour聚合后的session数据可以从0开始编号,如果session对应的索引存在于列表中,那么就抽取此session,否则不抽取。
流程图:
在这里插入图片描述
表的转化:
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

步骤解析

  1. 转化过滤后的数据(第一个需求中得到的数据),结构为:(date,info)
    其中date解析后的为:(HH-MM-DD_hh)
     val dateHour2FullInfoRDD=filterInfo.map{
       case (sessionId,info)=>{
         val date1=StringUtil.getFieldFromConcatString(info, "\\|", Constants.FIELD_START_TIME)
         //(HH-MM-DD_hh)
         val date=DateUtils.getDateHour(date1);
         (date,info);
       }
     }
  1. 根据countByKey算子,得到(date,count)的map结构
 //2.统计同一时间总共的session数量,结果为map结构
    val hourCountMap=dateHour2FullInfoRDD.countByKey();

3.将数据转化为date->map(hour,count)类型,统计出每一小时总共有多少条数据

    val dataHourCount=new mutable.HashMap[String,mutable.HashMap[String,Long]];
    for ((k,v)<-hourCountMap){
      val day=k.split("_")(0);
      val hour=k.split("_")(1);
      dataHourCount.get(day) match {
        case None =>dataHourCount(day)=new mutable.HashMap[String,Long];
          dataHourCount(day)+=(hour->v);
        case Some(value) =>
          dataHourCount(day)+=(hour->v);
      }
    }
  1. 产生每小时要抽取的数据的索引列表
 //4.获取抽取session的索引,用map(date,map(hour,list))来存储
    val ExtractIndexListMap=new mutable.HashMap[String,mutable.HashMap[String,ListBuffer[Int]]];
    val sumday=dataHourCount.size;
    val extractDay=100/sumday;//平均每天

    for ((day,map)<-dataHourCount){
       val oneDay=map.values.sum;
      ExtractIndexListMap.get(day) match {
        case None => ExtractIndexListMap(day)=new mutable.HashMap[String, ListBuffer[Int]]
          generateRandomIndexList(extractDay, oneDay, map,  ExtractIndexListMap(day))
        case Some(value) =>
          generateRandomIndexList(extractDay, oneDay, map,  ExtractIndexListMap(day))
      }
    }

generateRandomIndexList函数:

 def generateRandomIndexList(extractDay: Int, oneDay: Long, hourCountMap: mutable.HashMap[String, Long], hourListMap: mutable.HashMap[String, ListBuffer[Int]])={
    //计算每个小时要抽取多少条数据
    for ((hour,cnt)<-hourCountMap){
      val curHour=((cnt/oneDay)*extractDay).toInt;
      val Random=new Random();
      hourListMap.get(hour) match {
        case None => hourListMap(hour)=new ListBuffer[Int];
          for (i<-0 until curHour.toInt){
            var index=Random.nextInt(cnt.toInt);
            while(hourListMap(hour).contains(index)){
              index=Random.nextInt(cnt.toInt);
            }
            hourListMap(hour).append(index);
          }

        case Some(value) =>
          for (i<-0 until curHour.toInt){
            var index=Random.nextInt(cnt.toInt);
            while(hourListMap(hour).contains(index)){
              index=Random.nextInt(cnt.toInt);
            }
            hourListMap(hour).append(index);

          }
      }
    }
  }

/*
到目前,我们已经得到了:
1.每一个小时里总共有多少条session->dataHourCount
2.每一个小时要抽取的session的索引->ExtractIndexListMap
*/

5.根据ExtractIndexListMap抽取session

 val dateHour2GroupRDD = dateHour2FullInfoRDD.groupByKey()
    val extractSessionRDD=dateHour2GroupRDD.flatMap{
      case (dateHour,iterableFullInfo)=>{
        val day = dateHour.split("_")(0)
        val hour = dateHour.split("_")(1)
        val indexList=ExtractIndexListMap.get(day).get(hour);
        val extractSessionArrayBuffer = new ArrayBuffer[SessionRandomExtract]()

        var index = 0

        for(fullInfo <- iterableFullInfo){
          if(indexList.contains(index)){
            val sessionId = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SESSION_ID)
            val startTime = StringUtil.getFieldFromConcatString(fullInfo, "\\|",Constants.FIELD_START_TIME)
            val searchKeywords = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SEARCH_KEYWORDS)
            val clickCategories = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_CLICK_CATEGORY_IDS)

            val extractSession = SessionRandomExtract(taskUUID , sessionId, startTime, searchKeywords, clickCategories)

            extractSessionArrayBuffer += extractSession
          }
          index += 1
        }
        extractSessionArrayBuffer

      }
    }
    extractSessionRDD.foreach(println);

6.写进数据库

import session.implicits._;
    extractSessionRDD.toDF().write
    .format("jdbc")
      .option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
      .option("user",ConfigurationManager.config.getString(Constants.JDBC_USER))
      .option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
      .option("dbtable", "session_extract_0308")
      .mode(SaveMode.Append)
      .save()

完整代码:

package server

import com.alibaba.fastjson.JSONObject
import commons.conf.ConfigurationManager
import commons.constant.Constants
import commons.model.SessionRandomExtract
import commons.utils.{DateUtils, StringUtil}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SaveMode, SparkSession}

import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
import scala.util.Random

class serverTwo extends Serializable {


  def generateRandomIndexList(extractDay: Int, oneDay: Long, hourCountMap: mutable.HashMap[String, Long], hourListMap: mutable.HashMap[String, ListBuffer[Int]])={
    //计算每个小时要抽取多少条数据
    for ((hour,cnt)<-hourCountMap){
      val curHour=((cnt/oneDay)*extractDay).toInt;
      val Random=new Random();
      hourListMap.get(hour) match {
        case None => hourListMap(hour)=new ListBuffer[Int];
          for (i<-0 until curHour.toInt){
            var index=Random.nextInt(cnt.toInt);
            while(hourListMap(hour).contains(index)){
              index=Random.nextInt(cnt.toInt);
            }
            hourListMap(hour).append(index);
          }

        case Some(value) =>
          for (i<-0 until curHour.toInt){
            var index=Random.nextInt(cnt.toInt);
            while(hourListMap(hour).contains(index)){
              index=Random.nextInt(cnt.toInt);
            }
            hourListMap(hour).append(index);

          }
      }
    }
  }

  def GetextraSession(session: SparkSession, filterInfo: RDD[(String,String)], task: JSONObject, taskUUID: String)={
     //1.数据格式转化成(date,info)
     val dateHour2FullInfoRDD=filterInfo.map{
       case (sessionId,info)=>{
         val date1=StringUtil.getFieldFromConcatString(info, "\\|", Constants.FIELD_START_TIME)
         val date=DateUtils.getDateHour(date1);
         (date,info);
       }
     }
    //2.统计同一时间总共的session数量,结果为map结构
    val hourCountMap=dateHour2FullInfoRDD.countByKey();

    //3.将数据转化为date->map(hour,count)类型
    val dataHourCount=new mutable.HashMap[String,mutable.HashMap[String,Long]];
    for ((k,v)<-hourCountMap){
      val day=k.split("_")(0);
      val hour=k.split("_")(1);
      dataHourCount.get(day) match {
        case None =>dataHourCount(day)=new mutable.HashMap[String,Long];
          dataHourCount(day)+=(hour->v);
        case Some(value) =>
          dataHourCount(day)+=(hour->v);
      }
    }
    //4.获取抽取session的索引,用map(date,map(hour,list))来存储
    val ExtractIndexListMap=new mutable.HashMap[String,mutable.HashMap[String,ListBuffer[Int]]];
    val sumday=dataHourCount.size;
    val extractDay=100/sumday;//平均每天

    for ((day,map)<-dataHourCount){
       val oneDay=map.values.sum;
      ExtractIndexListMap.get(day) match {
        case None => ExtractIndexListMap(day)=new mutable.HashMap[String, ListBuffer[Int]]
          generateRandomIndexList(extractDay, oneDay, map,  ExtractIndexListMap(day))
        case Some(value) =>
          generateRandomIndexList(extractDay, oneDay, map,  ExtractIndexListMap(day))
      }
    }
     /*
     到目前,我们已经得到了:
     1.每一个小时里总共有多少条session->dataHourCount
     2.每一个小时要抽取的session的索引->ExtractIndexListMap
      */
    //5.根据ExtractIndexListMap抽取session
    val dateHour2GroupRDD = dateHour2FullInfoRDD.groupByKey()
    val extractSessionRDD=dateHour2GroupRDD.flatMap{
      case (dateHour,iterableFullInfo)=>{
        val day = dateHour.split("_")(0)
        val hour = dateHour.split("_")(1)
        val indexList=ExtractIndexListMap.get(day).get(hour);
        val extractSessionArrayBuffer = new ArrayBuffer[SessionRandomExtract]()

        var index = 0

        for(fullInfo <- iterableFullInfo){
          if(indexList.contains(index)){
            val sessionId = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SESSION_ID)
            val startTime = StringUtil.getFieldFromConcatString(fullInfo, "\\|",Constants.FIELD_START_TIME)
            val searchKeywords = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SEARCH_KEYWORDS)
            val clickCategories = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_CLICK_CATEGORY_IDS)

            val extractSession = SessionRandomExtract(taskUUID , sessionId, startTime, searchKeywords, clickCategories)

            extractSessionArrayBuffer += extractSession
          }
          index += 1
        }
        extractSessionArrayBuffer

      }
    }
    extractSessionRDD.foreach(println);
    //6.写进数据库
    /*import session.implicits._;
    extractSessionRDD.toDF().write
    .format("jdbc")
      .option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
      .option("user",ConfigurationManager.config.getString(Constants.JDBC_USER))
      .option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
      .option("dbtable", "session_extract_0308")
      .mode(SaveMode.Append)
      .save()*/
  }

}

总结

  • 一般随机抽样,需要动手生成抽样的索引列表,接着再用索引列表,去总的数据中匹配
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值