做什么?
在符合过滤条件的session中,按照时间比例随机抽取100个session。当存在若干天的数据时,100个session抽取指标在天之间平均分配,在一天之中,根据某个小时的session数量在一天中总session数量中的占比决定这个小时抽取多少个session
一个小时要抽取的session数量 = (这个小时的session数量/这一天的session数量) * 这一天要抽取的session数量
需求解析
明确一个小时抽取多少session后(假设为N个),根据数量产生N个随机数,这N个随机数组成的列表就是要抽取的session的索引列表,我们假设按照hour聚合后的session数据可以从0开始编号,如果session对应的索引存在于列表中,那么就抽取此session,否则不抽取。
流程图:
表的转化:
步骤解析
- 转化过滤后的数据(第一个需求中得到的数据),结构为:(date,info)
其中date解析后的为:(HH-MM-DD_hh)
val dateHour2FullInfoRDD=filterInfo.map{
case (sessionId,info)=>{
val date1=StringUtil.getFieldFromConcatString(info, "\\|", Constants.FIELD_START_TIME)
//(HH-MM-DD_hh)
val date=DateUtils.getDateHour(date1);
(date,info);
}
}
- 根据countByKey算子,得到(date,count)的map结构
//2.统计同一时间总共的session数量,结果为map结构
val hourCountMap=dateHour2FullInfoRDD.countByKey();
3.将数据转化为date->map(hour,count)类型,统计出每一小时总共有多少条数据
val dataHourCount=new mutable.HashMap[String,mutable.HashMap[String,Long]];
for ((k,v)<-hourCountMap){
val day=k.split("_")(0);
val hour=k.split("_")(1);
dataHourCount.get(day) match {
case None =>dataHourCount(day)=new mutable.HashMap[String,Long];
dataHourCount(day)+=(hour->v);
case Some(value) =>
dataHourCount(day)+=(hour->v);
}
}
- 产生每小时要抽取的数据的索引列表
//4.获取抽取session的索引,用map(date,map(hour,list))来存储
val ExtractIndexListMap=new mutable.HashMap[String,mutable.HashMap[String,ListBuffer[Int]]];
val sumday=dataHourCount.size;
val extractDay=100/sumday;//平均每天
for ((day,map)<-dataHourCount){
val oneDay=map.values.sum;
ExtractIndexListMap.get(day) match {
case None => ExtractIndexListMap(day)=new mutable.HashMap[String, ListBuffer[Int]]
generateRandomIndexList(extractDay, oneDay, map, ExtractIndexListMap(day))
case Some(value) =>
generateRandomIndexList(extractDay, oneDay, map, ExtractIndexListMap(day))
}
}
generateRandomIndexList函数:
def generateRandomIndexList(extractDay: Int, oneDay: Long, hourCountMap: mutable.HashMap[String, Long], hourListMap: mutable.HashMap[String, ListBuffer[Int]])={
//计算每个小时要抽取多少条数据
for ((hour,cnt)<-hourCountMap){
val curHour=((cnt/oneDay)*extractDay).toInt;
val Random=new Random();
hourListMap.get(hour) match {
case None => hourListMap(hour)=new ListBuffer[Int];
for (i<-0 until curHour.toInt){
var index=Random.nextInt(cnt.toInt);
while(hourListMap(hour).contains(index)){
index=Random.nextInt(cnt.toInt);
}
hourListMap(hour).append(index);
}
case Some(value) =>
for (i<-0 until curHour.toInt){
var index=Random.nextInt(cnt.toInt);
while(hourListMap(hour).contains(index)){
index=Random.nextInt(cnt.toInt);
}
hourListMap(hour).append(index);
}
}
}
}
/*
到目前,我们已经得到了:
1.每一个小时里总共有多少条session->dataHourCount
2.每一个小时要抽取的session的索引->ExtractIndexListMap
*/
5.根据ExtractIndexListMap抽取session
val dateHour2GroupRDD = dateHour2FullInfoRDD.groupByKey()
val extractSessionRDD=dateHour2GroupRDD.flatMap{
case (dateHour,iterableFullInfo)=>{
val day = dateHour.split("_")(0)
val hour = dateHour.split("_")(1)
val indexList=ExtractIndexListMap.get(day).get(hour);
val extractSessionArrayBuffer = new ArrayBuffer[SessionRandomExtract]()
var index = 0
for(fullInfo <- iterableFullInfo){
if(indexList.contains(index)){
val sessionId = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SESSION_ID)
val startTime = StringUtil.getFieldFromConcatString(fullInfo, "\\|",Constants.FIELD_START_TIME)
val searchKeywords = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SEARCH_KEYWORDS)
val clickCategories = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_CLICK_CATEGORY_IDS)
val extractSession = SessionRandomExtract(taskUUID , sessionId, startTime, searchKeywords, clickCategories)
extractSessionArrayBuffer += extractSession
}
index += 1
}
extractSessionArrayBuffer
}
}
extractSessionRDD.foreach(println);
6.写进数据库
import session.implicits._;
extractSessionRDD.toDF().write
.format("jdbc")
.option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
.option("user",ConfigurationManager.config.getString(Constants.JDBC_USER))
.option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
.option("dbtable", "session_extract_0308")
.mode(SaveMode.Append)
.save()
完整代码:
package server
import com.alibaba.fastjson.JSONObject
import commons.conf.ConfigurationManager
import commons.constant.Constants
import commons.model.SessionRandomExtract
import commons.utils.{DateUtils, StringUtil}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SaveMode, SparkSession}
import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
import scala.util.Random
class serverTwo extends Serializable {
def generateRandomIndexList(extractDay: Int, oneDay: Long, hourCountMap: mutable.HashMap[String, Long], hourListMap: mutable.HashMap[String, ListBuffer[Int]])={
//计算每个小时要抽取多少条数据
for ((hour,cnt)<-hourCountMap){
val curHour=((cnt/oneDay)*extractDay).toInt;
val Random=new Random();
hourListMap.get(hour) match {
case None => hourListMap(hour)=new ListBuffer[Int];
for (i<-0 until curHour.toInt){
var index=Random.nextInt(cnt.toInt);
while(hourListMap(hour).contains(index)){
index=Random.nextInt(cnt.toInt);
}
hourListMap(hour).append(index);
}
case Some(value) =>
for (i<-0 until curHour.toInt){
var index=Random.nextInt(cnt.toInt);
while(hourListMap(hour).contains(index)){
index=Random.nextInt(cnt.toInt);
}
hourListMap(hour).append(index);
}
}
}
}
def GetextraSession(session: SparkSession, filterInfo: RDD[(String,String)], task: JSONObject, taskUUID: String)={
//1.数据格式转化成(date,info)
val dateHour2FullInfoRDD=filterInfo.map{
case (sessionId,info)=>{
val date1=StringUtil.getFieldFromConcatString(info, "\\|", Constants.FIELD_START_TIME)
val date=DateUtils.getDateHour(date1);
(date,info);
}
}
//2.统计同一时间总共的session数量,结果为map结构
val hourCountMap=dateHour2FullInfoRDD.countByKey();
//3.将数据转化为date->map(hour,count)类型
val dataHourCount=new mutable.HashMap[String,mutable.HashMap[String,Long]];
for ((k,v)<-hourCountMap){
val day=k.split("_")(0);
val hour=k.split("_")(1);
dataHourCount.get(day) match {
case None =>dataHourCount(day)=new mutable.HashMap[String,Long];
dataHourCount(day)+=(hour->v);
case Some(value) =>
dataHourCount(day)+=(hour->v);
}
}
//4.获取抽取session的索引,用map(date,map(hour,list))来存储
val ExtractIndexListMap=new mutable.HashMap[String,mutable.HashMap[String,ListBuffer[Int]]];
val sumday=dataHourCount.size;
val extractDay=100/sumday;//平均每天
for ((day,map)<-dataHourCount){
val oneDay=map.values.sum;
ExtractIndexListMap.get(day) match {
case None => ExtractIndexListMap(day)=new mutable.HashMap[String, ListBuffer[Int]]
generateRandomIndexList(extractDay, oneDay, map, ExtractIndexListMap(day))
case Some(value) =>
generateRandomIndexList(extractDay, oneDay, map, ExtractIndexListMap(day))
}
}
/*
到目前,我们已经得到了:
1.每一个小时里总共有多少条session->dataHourCount
2.每一个小时要抽取的session的索引->ExtractIndexListMap
*/
//5.根据ExtractIndexListMap抽取session
val dateHour2GroupRDD = dateHour2FullInfoRDD.groupByKey()
val extractSessionRDD=dateHour2GroupRDD.flatMap{
case (dateHour,iterableFullInfo)=>{
val day = dateHour.split("_")(0)
val hour = dateHour.split("_")(1)
val indexList=ExtractIndexListMap.get(day).get(hour);
val extractSessionArrayBuffer = new ArrayBuffer[SessionRandomExtract]()
var index = 0
for(fullInfo <- iterableFullInfo){
if(indexList.contains(index)){
val sessionId = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SESSION_ID)
val startTime = StringUtil.getFieldFromConcatString(fullInfo, "\\|",Constants.FIELD_START_TIME)
val searchKeywords = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_SEARCH_KEYWORDS)
val clickCategories = StringUtil.getFieldFromConcatString(fullInfo, "\\|", Constants.FIELD_CLICK_CATEGORY_IDS)
val extractSession = SessionRandomExtract(taskUUID , sessionId, startTime, searchKeywords, clickCategories)
extractSessionArrayBuffer += extractSession
}
index += 1
}
extractSessionArrayBuffer
}
}
extractSessionRDD.foreach(println);
//6.写进数据库
/*import session.implicits._;
extractSessionRDD.toDF().write
.format("jdbc")
.option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
.option("user",ConfigurationManager.config.getString(Constants.JDBC_USER))
.option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
.option("dbtable", "session_extract_0308")
.mode(SaveMode.Append)
.save()*/
}
}
总结
- 一般随机抽样,需要动手生成抽样的索引列表,接着再用索引列表,去总的数据中匹配